Table of Contents

# Quantized model using llamacpp

Prepare

  1. git clone the llamacpp project.
  2. make venv to the project
  3. install llamacpp prebuild cli commands
Tip

For windows can use winget install llama.cpp to install.

Convert model into gguf format

# from llamacpp git
pip install -r requirements.txt
py convert_hf_to_gguf.py <model path>

Quantized gguf model

# from model location
# using the cli from llamacpp
llama-quantize <gguf model path> Q4_K_M

# serving model with open-ai compatible api with embeding api and simple webui
llama-server -m <model-path> --jinja -c 0 --host 127.0.0.1 --port 8033 --embeddings
# -c is contenxt length, 0 for default from model

Reference

llamacpp