Run llama.cpp OAI Server in Docker

# Using docker to run llamacpp OpenAI compatible server

Run LLM Model

docker run `
-v </path/to/models>:/models `
-p 8033:8033 `
ghcr.io/ggml-org/llama.cpp:server `
-m /models/Llama-3-Taiwan-8B-Instruct/Llama-3-Taiwan-8B-Instruct-rc2-Q4_K_M.gguf `
--jinja `
-c 8192 `
--port 8033 `
--host 0.0.0.0

Tip

You can access the web ui at http://localhost:8033

Default batch is 2048 tokens, and ubatch is 512 token, modifing it using -b <batch num> and -ub <ubatch num>

Tip

For auto download model, replace the -m arg to -hf <huggingface model id>

Setting download model path with LLAMA_CACHE env variable.

Run Embedding Model

docker run `
-v </path/to/models>:/models `
-p 8033:8033 `
ghcr.io/ggml-org/llama.cpp:server `
-m /models/Qwen3-Embedding-0.6B-GGUF/Qwen3-Embedding-0.6B-Q8_0.gguf `
--embeddings `
--pooling cls `
-c 2048 `
-b 2048 `
-ub 1024 `
--port 8033 `
--host 0.0.0.0

Tip

For the OpenAI Embedding api, the path should use /v1 prefix

Use OpenAI client base URL with http://localhostL8033/v1

Tip

For gpu, you can use CUDA supported image, ghcr.io/ggml-org/llama.cpp:server-cuda

For cpu offload, add -ngl <num layers> to load only the to gpu, other remains on cpu memory.

Should install cuda container toolkit on docker host. And add --gpus all to docker run args.

Reference

llamacpp doc

Table of Contents

Run LLM Model

Tip

Tip

Run Embedding Model

Tip

Tip

Reference