Self-Hosted Inference with llama.cpp

llama.cpp is a high-performance inference engine for GGUF-format quantized language models. It exposes an OpenAI-compatible HTTP API, making it a drop-in backend for tools that expect an OpenAI endpoint.

What this page covers

Choosing and downloading a quantized model from Hugging Face
Running llama.cpp as a Docker container
Configuring the OpenAI-compatible API server
Hardware requirements and GPU acceleration options
Connecting llama.cpp to LiteLLM or directly to Claude Code

Model selection

Quantized models trade a small amount of accuracy for dramatically reduced memory requirements. Common quantization levels:

Quantization	VRAM / RAM needed	Quality loss
Q8_0	~8 GB for 7B model	Minimal
Q4_K_M	~4 GB for 7B model	Small
Q3_K_M	~3 GB for 7B model	Moderate

For coding tasks, models in the Qwen2.5-Coder or DeepSeek-Coder families perform well at the 7B–14B parameter range.

Running with Docker

docker run -d \
  --name llamacpp \
  -p 8080:8080 \
  -v /models:/models \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/my-model.Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 4096

Table of Contents

Self-Hosted Inference with llama.cpp

What this page covers

Model selection

Running with Docker

Related reference docs