Self-Hosted Inference with llama.cpp
llama.cpp is a high-performance inference engine for GGUF-format quantized language models. It exposes an OpenAI-compatible HTTP API, making it a drop-in backend for tools that expect an OpenAI endpoint.
What this page covers
- Choosing and downloading a quantized model from Hugging Face
- Running llama.cpp as a Docker container
- Configuring the OpenAI-compatible API server
- Hardware requirements and GPU acceleration options
- Connecting llama.cpp to LiteLLM or directly to Claude Code
Model selection
Quantized models trade a small amount of accuracy for dramatically reduced memory requirements. Common quantization levels:
| Quantization | VRAM / RAM needed | Quality loss |
|---|---|---|
| Q8_0 | ~8 GB for 7B model | Minimal |
| Q4_K_M | ~4 GB for 7B model | Small |
| Q3_K_M | ~3 GB for 7B model | Moderate |
For coding tasks, models in the Qwen2.5-Coder or DeepSeek-Coder families perform well at the 7B–14B parameter range.
Running with Docker
docker run -d \
--name llamacpp \
-p 8080:8080 \
-v /models:/models \
ghcr.io/ggerganov/llama.cpp:server \
-m /models/my-model.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096