Table of Contents

Self-Hosted Inference with llama.cpp

llama.cpp is a high-performance inference engine for GGUF-format quantized language models. It exposes an OpenAI-compatible HTTP API, making it a drop-in backend for tools that expect an OpenAI endpoint.

What this page covers

  • Choosing and downloading a quantized model from Hugging Face
  • Running llama.cpp as a Docker container
  • Configuring the OpenAI-compatible API server
  • Hardware requirements and GPU acceleration options
  • Connecting llama.cpp to LiteLLM or directly to Claude Code

Model selection

Quantized models trade a small amount of accuracy for dramatically reduced memory requirements. Common quantization levels:

Quantization VRAM / RAM needed Quality loss
Q8_0 ~8 GB for 7B model Minimal
Q4_K_M ~4 GB for 7B model Small
Q3_K_M ~3 GB for 7B model Moderate

For coding tasks, models in the Qwen2.5-Coder or DeepSeek-Coder families perform well at the 7B–14B parameter range.

Running with Docker

docker run -d \
  --name llamacpp \
  -p 8080:8080 \
  -v /models:/models \
  ghcr.io/ggerganov/llama.cpp:server \
  -m /models/my-model.Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 4096