# Deploy Litellm gateway
Litellm can deploy as a llm gateway, which selects the model based on the configuration. It can do a unified api gateway serving multiple models from different providers. It can also set up fallback models when models are down.
Configuration
A config.yaml for setting all backend models.
For a local lamacpp model with modelId: .\\Llama-3-Taiwan-8B-Instruct-rc2-Q4_K_M.gguf
model_list:
- model_name: chat-llm # this would be the modelId calling to this model
litellm_params: # backend model setting
model: "openai/.\\Llama-3-Taiwan-8B-Instruct-rc2-Q4_K_M.gguf" # the openai/ indicates this is an OpenAI competiable endpoint
api_base: http://10.1.0.100:8033 # use a local model listen on 10.1.0.100:8033
api_key: NO_USED # NO USED
Tip
When multiple models with the same model_name, the request will be routed to the one of them. It can add weight setting, for route control.
Adding fallback model using Groq
Add groq model as fallback model
model_list:
# ... ignore existing settings
- model_name: chat-llm-cloud # modelId
litellm_params:
model: "groq/llama-3.3-70b-versatile" # groq/ indicates the Groq provider
api_key: os.environ/GROQ_API_KEY # getting the secret from env `GROQ_API_KEY`
litellm_settings:
num_retries: 3 # retry call <num> times on each model_name
request_timeout: 10 # raise Timeout error if the call takes longer than 10 secondes.
fallbacks: [ { "chat-llm": [ "chat-llm-cloud" ] } ] # fallback to chat-llm-cloud if call fails num_retries
allowed_fails: 1 # cooldown model if it fails > 1 call in a minute.
cooldown_time: 300 # how long to cooldown model if fails/min > allowed_fails
Test config using Docker
Tip
Set the GROQ_API_KEY using $env:GROQ_API_KEY = <api key>
Run the litellm instance, open the website localhost:4000, then you can access the swagger to call the model.
docker run `
-e GROQ_API_KEY=$env:GROQ_API_KEY `
-v <path to the config.yaml>:/app/config.yaml `
-p 4000:4000 `
ghcr.io/berriai/litellm:main-latest `
--config /app/config.yaml `
--detailed_debug
Deploy to Kubernetes
Translate the docker command into K8s deployment and configmap.
Apply them using kubectl apply <filename>
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: litellm
namespace: litellm
labels:
app: litellm
spec:
replicas: 1
selector:
matchLabels:
app: litellm
template:
metadata:
name: litellm
labels:
app: litellm
spec:
volumes:
- name: config
configMap:
name: litellm-configmap
containers:
- name: litellm
image: ghcr.io/berriai/litellm:main-latest
args:
- "--config"
- "/app/config.yaml"
- "--detailed_debug"
imagePullPolicy: Always
envFrom:
- secretRef:
name: litellm-secret # mount the GROQ_API_KEY
volumeMounts:
- mountPath: /app/config.yaml # mount the config from configmap
name: config
subPath: config.yaml
ports:
- containerPort: 4000
name: web
resources:
requests:
cpu: 50m
memory: 200Mi
limits:
cpu: 1000m
memory: 1Gi
restartPolicy: Always
Configmap
Define the config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: litellm-configmap
namespace: litellm
data:
config.yaml: |
model_list:
- model_name: chat-llm
litellm_params:
model: "openai/.\\Llama-3-Taiwan-8B-Instruct-rc2-Q4_K_M.gguf"
api_base: http://10.1.0.100:8033
api_key: NO_USED
- model_name: chat-llm-cloud
litellm_params:
model: "groq/llama-3.3-70b-versatile"
api_key: os.environ/GROQ_API_KEY
litellm_settings:
num_retries: 3
request_timeout: 10
fallbacks: [ { "chat-llm": [ "chat-llm-cloud" ] } ]
allowed_fails: 1
cooldown_time: 300
Secret
Define the api key in config.
apiVersion: v1
kind: Secret
metadata:
name: litellm-secret
namespace: litellm
type: Opaque
data:
GROQ_API_KEY: <base64 api-key>
Service
Expose litellm to the K8s
apiVersion: v1
kind: Service
metadata:
name: litellm-svc
namespace: litellm
spec:
selector:
app: litellm
ports:
- protocol: TCP
port: 80
targetPort: 4000
name: web
Tip
Once the service is deployed, you can call the service by URL `http://<service_name>.<service_namespace>