Deploy LiteLLM Gateway | Davidhsaiou Docs

# Deploy Litellm gateway

Litellm can deploy as a llm gateway, which selects the model based on the configuration. It can do a unified api gateway serving multiple models from different providers. It can also set up fallback models when models are down.

Configuration

A config.yaml for setting all backend models.

For a local lamacpp model with modelId: .\\Llama-3-Taiwan-8B-Instruct-rc2-Q4_K_M.gguf

model_list:
  - model_name: chat-llm # this would be the modelId calling to this model
    litellm_params: # backend model setting
      model: "openai/.\\Llama-3-Taiwan-8B-Instruct-rc2-Q4_K_M.gguf" # the openai/ indicates this is an OpenAI competiable endpoint
      api_base: http://10.1.0.100:8033 # use a local model listen on 10.1.0.100:8033
      api_key: NO_USED  # NO USED

Tip

When multiple models with the same model_name, the request will be routed to the one of them. It can add weight setting, for route control.

Adding fallback model using Groq

Add groq model as fallback model

model_list:
# ... ignore existing settings
  - model_name: chat-llm-cloud # modelId
    litellm_params:
      model: "groq/llama-3.3-70b-versatile" # groq/ indicates the Groq provider
      api_key: os.environ/GROQ_API_KEY # getting the secret from env `GROQ_API_KEY`

litellm_settings:
  num_retries: 3 # retry call <num> times on each model_name
  request_timeout: 10 # raise Timeout error if the call takes longer than 10 secondes.
  fallbacks: [ { "chat-llm": [ "chat-llm-cloud" ] } ] # fallback to chat-llm-cloud if call fails num_retries
  allowed_fails: 1 # cooldown model if it fails > 1 call in a minute.
  cooldown_time: 300 # how long to cooldown model if fails/min > allowed_fails

Test config using Docker

Tip

Set the GROQ_API_KEY using $env:GROQ_API_KEY = <api key>

Run the litellm instance, open the website localhost:4000, then you can access the swagger to call the model.

docker run `
-e GROQ_API_KEY=$env:GROQ_API_KEY `
-v <path to the config.yaml>:/app/config.yaml `
-p 4000:4000 `
ghcr.io/berriai/litellm:main-latest `
--config /app/config.yaml `
--detailed_debug

Deploy to Kubernetes

Translate the docker command into K8s deployment and configmap. Apply them using kubectl apply <filename>

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm
  namespace: litellm
  labels:
    app: litellm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      name: litellm
      labels:
        app: litellm
    spec:
      volumes:
        - name: config
          configMap:
            name: litellm-configmap
      containers:
        - name: litellm
          image: ghcr.io/berriai/litellm:main-latest
          args:
            - "--config"
            - "/app/config.yaml"
            - "--detailed_debug"
          imagePullPolicy: Always
          envFrom:
            - secretRef:
                name: litellm-secret # mount the GROQ_API_KEY
          volumeMounts:
            - mountPath: /app/config.yaml # mount the config from configmap
              name: config
              subPath: config.yaml
          ports:
            - containerPort: 4000
              name: web
          resources:
            requests:
              cpu: 50m
              memory: 200Mi
            limits:
              cpu: 1000m
              memory: 1Gi
      restartPolicy: Always

Configmap

Define the config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: litellm-configmap
  namespace: litellm
data:
  config.yaml: |
    model_list:
      - model_name: chat-llm
        litellm_params:
          model: "openai/.\\Llama-3-Taiwan-8B-Instruct-rc2-Q4_K_M.gguf"
          api_base: http://10.1.0.100:8033
          api_key: NO_USED
      - model_name: chat-llm-cloud
        litellm_params:
          model: "groq/llama-3.3-70b-versatile"
          api_key: os.environ/GROQ_API_KEY
    litellm_settings:
      num_retries: 3
      request_timeout: 10 
      fallbacks: [ { "chat-llm": [ "chat-llm-cloud" ] } ]
      allowed_fails: 1 
      cooldown_time: 300

Secret

Define the api key in config.

apiVersion: v1
kind: Secret
metadata:
  name: litellm-secret
  namespace: litellm
type: Opaque
data:
  GROQ_API_KEY: <base64 api-key>

Service

Expose litellm to the K8s

apiVersion: v1
kind: Service
metadata:
  name: litellm-svc
  namespace: litellm
spec:
  selector:
    app: litellm
  ports:
    - protocol: TCP
      port: 80
      targetPort: 4000
      name: web

Tip

Once the service is deployed, you can call the service by URL `http://<service_name>.<service_namespace>

Reference

LiteLLM

Table of Contents