This document explains how this project interacts with and controls llama.cpp, including:
- Low-level C API bindings and call flow
- High-level Python inference APIs
- OpenAI-compatible server implementation
- Programmatic and operational server control with Python
- Practical code snippets for common management tasks
The repository has four integration layers:
llama_cpp/llama_cpp.py
- Loads the
llamashared library (.dll,.so,.dylib) withctypes - Exposes raw
llama.cppsymbols and constants as Python-callable functions
llama_cpp/_internals.py
- Resource-safe wrappers for native handles:
LlamaModel(llama_model*)LlamaContext(llama_context*)LlamaBatch(llama_batch)LlamaSampler(sampler chain)
llama_cpp/llama.py
- Public high-level API (
Llamaclass) - Implements tokenization, eval/decode, sampling, generation, embeddings, chat handling, state save/restore, cache integration
llama_cpp/server/*
- FastAPI OpenAI-compatible server
- Model lifecycle and routing (
LlamaProxy) - Streaming (SSE), authentication, multi-model config
The bridge into llama.cpp starts in llama_cpp/llama_cpp.py.
_lib_base_name = "llama"
_override_base_path = os.environ.get("LLAMA_CPP_LIB_PATH")
_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__))) / "lib" if _override_base_path is None else pathlib.Path(_override_base_path)
_lib = load_shared_library(_lib_base_name, _base_path)
ctypes_function = ctypes_function_for_shared_library(_lib)Key control point:
- Set
LLAMA_CPP_LIB_PATHto force loading a customllamabuild.
After load, llama_cpp.py defines hundreds of C signatures/constants (llama_*, LLAMA_*) used by all higher layers.
llama_cpp/_internals.py wraps native objects with deterministic cleanup via ExitStack.
Creation:
llama_model_load_from_file()llama_model_get_vocab()
Cleanup:
llama_model_free()
Important metadata/introspection calls:
llama_model_n_ctx_train,llama_model_n_embd,llama_model_desc,llama_model_meta_*
Creation:
llama_init_from_model()
Core runtime calls:
llama_decode()for token forward passllama_encode()in embedding flowsllama_get_logits(),llama_get_embeddings(),llama_get_embeddings_seq()
KV cache controls:
llama_memory_clearllama_memory_seq_rm,llama_memory_seq_cp,llama_memory_seq_keep,llama_memory_seq_add
Creation/free:
llama_batch_init()/llama_batch_free()
Used to populate token ids, positions, sequence ids, and logits flags for decode passes.
Builds a llama.cpp sampler pipeline by chaining:
- penalties
- top-k/top-p/min-p/typical
- temp/greedy/distribution
- mirostat
- grammar and custom samplers
This is how generation behavior is controlled at runtime.
The public API is llama_cpp.Llama in llama_cpp/llama.py.
Llama.__init__ performs these major steps:
- Initialize backend once per process:
if not Llama.__backend_initialized:
llama_cpp.llama_backend_init()-
Configure model params (
llama_model_default_params) and context params (llama_context_default_params) -
Load model/context/batch wrappers:
internals.LlamaModel(...)internals.LlamaContext(...)internals.LlamaBatch(...)
- Optionally attach:
- LoRA adapter (
llama_adapter_lora_init,llama_set_adapter_lora) - custom tokenizer
- draft model for speculative decoding
- Build chat handler map from GGUF metadata (
tokenizer.chat_template.*) or fallback chat formats
- Model placement/perf:
n_gpu_layers,split_mode,main_gpu,tensor_split,offload_kqv,flash_attn - Context/perf:
n_ctx,n_batch,n_threads,n_threads_batch - Rope/YaRN scaling:
rope_scaling_type,rope_freq_base,rope_freq_scale,yarn_* - Memory behavior:
use_mmap,use_mlock - KV quantization:
type_k,type_v - Chat behavior:
chat_format,chat_handler - Grammar and constrained decode:
grammarat request time
At a high level:
- Tokenize prompt
- Evaluate prompt with
eval() - Sample next token with configured sampler chain
- Repeat until stop/eos/max-tokens/stopping criteria
- Detokenize and return OpenAI-style response
Core generation loop comes from generate() + sample() + eval().
- Prunes KV tail for overwrite scenarios:
self._ctx.kv_cache_seq_rm(-1, self.n_tokens, -1)
- Fills
LlamaBatch - Calls
self._ctx.decode(self._batch)(llama.cpp forward pass) - Copies logits into NumPy buffers when needed
- Builds sampler chain with penalties + chosen strategy
- Calls
llama_sampler_sample(...)
- Adds stop handling, UTF-8 safe chunking, optional logprobs
- Supports streaming and non-streaming
- Supports cache lookup/save with
LlamaState
create_chat_completion() routes through a chat handler:
handler = (
self.chat_handler
or self._chat_handlers.get(self.chat_format)
or llama_chat_format.get_chat_completion_handler(self.chat_format)
)The selected handler:
- formats messages into a prompt
- may build JSON-schema grammar for tool calling / JSON mode
- delegates to
create_completion() - converts output back to OpenAI chat schema
create_embedding() and embed():
- require model created with
embedding=True - tokenize inputs
- decode in batches
- read embeddings from llama.cpp pointers
- optional normalization
save_state():
llama_get_state_sizellama_copy_state_data- snapshots prompt state + logits/input ids + seed
load_state():
llama_set_state_data- restores prompt position and internal state
This enables prompt caching and fast resume.
The server lives under llama_cpp/server.
CLI entrypoint: python -m llama_cpp.server
- implemented in
llama_cpp/server/__main__.py - parses
ServerSettings+ModelSettings - supports
--config_file(JSON/YAML) - runs
uvicorn.run(app, ...)
Programmatic entrypoint: llama_cpp.server.app.create_app(...)
llama_cpp/server/settings.py defines:
-
ServerSettingshost,port, TLS filesapi_keyinterrupt_requestsdisable_ping_eventsroot_path
-
ModelSettings- almost all
Llama(...)controls (GPU, context, sampling defaults, LoRA, tokenizer, draft model, cache, etc.)
- almost all
All fields can be set via CLI args or environment variables (Pydantic settings behavior).
llama_cpp/server/model.py:
- loads default model on startup
- keeps only one active
Llamain memory at a time - routes by
modelrequest field to model alias - on model switch:
- closes current model (
.close()) - loads target model via
load_llama_from_model_settings
- closes current model (
Key behavior:
- unknown model alias falls back to default model
- this is an unload/reload switch, not concurrent multi-model residency
In server/app.py, dependency get_llama_proxy() uses a double-lock strategy (llama_outer_lock, llama_inner_lock) so the server can:
- serialize model access safely
- optionally interrupt long streaming responses when a new request arrives (
interrupt_requests=True)
This is crucial because most llama.cpp context usage is not safe for unconstrained concurrent generation on one context.
Primary endpoints:
POST /v1/completionsPOST /v1/chat/completionsPOST /v1/embeddingsGET /v1/models
Extra utility endpoints:
POST /extras/tokenizePOST /extras/tokenize/countPOST /extras/detokenize
Streaming:
- Implemented with Server-Sent Events via
EventSourceResponse - Chunks are emitted as
data: <json>and terminated withdata: [DONE]
Auth:
- If
api_keyis set, bearer token is required
Request-time controls in server:
logit_bias_type="tokens"converts text tokens to input ids before passing to modelgrammarstring converted viaLlamaGrammar.from_stringmin_tokensconverted to logits processor blocking EOS until threshold
Common runtime interactions:
-
Backend init:
llama_backend_initllama_numa_init
-
Model/context:
llama_model_load_from_filellama_init_from_modelllama_free,llama_model_free
-
Tokenization:
llama_tokenizellama_token_to_piece
-
Inference:
llama_decodellama_get_logits
-
Sampling:
llama_sampler_chain_initllama_sampler_init_*(top-k/top-p/temp/mirostat/grammar/penalties/etc.)llama_sampler_sample
-
Embeddings:
llama_get_embeddingsllama_get_embeddings_seq
-
KV cache/state:
llama_memory_seq_*llama_get_state_sizellama_copy_state_datallama_set_state_data
from llama_cpp import Llama
llm = Llama(
model_path="models/mistral-7b-instruct.Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=-1,
n_threads=8,
chat_format="chatml",
)
resp = llm.create_chat_completion(
messages=[
{"role": "system", "content": "Be concise."},
{"role": "user", "content": "Explain KV cache in one paragraph."},
],
max_tokens=200,
temperature=0.2,
)
print(resp["choices"][0]["message"]["content"])from llama_cpp import Llama
llm = Llama(model_path="models/model.gguf", n_ctx=4096)
for chunk in llm.create_completion(
prompt="Write a haiku about compilers.",
stream=True,
max_tokens=64,
temperature=0.8,
):
print(chunk["choices"][0]["text"], end="", flush=True)from llama_cpp import Llama
llm = Llama(model_path="models/model.gguf", n_ctx=4096)
_ = llm("The quick brown fox ", max_tokens=8)
state = llm.save_state()
# Later: restore and continue
llm.load_state(state)
out = llm("", max_tokens=16)
print(out["choices"][0]["text"])import uvicorn
from llama_cpp.server.app import create_app
from llama_cpp.server.settings import ServerSettings, ModelSettings
app = create_app(
server_settings=ServerSettings(host="0.0.0.0", port=8000, api_key="sk-local"),
model_settings=[
ModelSettings(
model="models/model.gguf",
model_alias="gpt-3.5-turbo",
n_ctx=4096,
n_gpu_layers=-1,
chat_format="chatml",
)
],
)
uvicorn.run(app, host="0.0.0.0", port=8000)from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="sk-local")
resp = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Say hello in JSON"}],
response_format={"type": "json_object"},
)
print(resp.choices[0].message.content)config.json:
{
"host": "0.0.0.0",
"port": 8000,
"interrupt_requests": true,
"models": [
{
"model": "models/chat.gguf",
"model_alias": "gpt-3.5-turbo",
"chat_format": "chatml",
"n_ctx": 4096,
"n_gpu_layers": -1
},
{
"model": "models/code.gguf",
"model_alias": "copilot-codex",
"n_ctx": 16192,
"n_gpu_layers": -1
}
]
}Run:
python -m llama_cpp.server --config_file config.jsonSwitch per request by setting model to one alias.
- Single active model in server process: model switches unload current model and load target model.
- Threading defaults:
n_threadsdefaults to half CPU coresn_threads_batchdefaults to all CPU cores
logprobsrequireslogits_all=Truein model creation.- Embeddings require
embedding=Trueat model creation. - If LoRA is enabled,
use_mmapis forced off in current implementation path. interrupt_requests=Truecan improve responsiveness under competing streaming requests.
From repository Makefile:
- CPU/OpenBLAS build:
make build.openblas- CUDA build:
make build.cuda- Server run:
MODEL=/path/to/model.gguf make run-serverThese ultimately install/build the Python package with specific CMAKE_ARGS, which configure embedded llama.cpp backend support.
- Low-level bindings:
llama_cpp/llama_cpp.py - Native wrappers:
llama_cpp/_internals.py - High-level API:
llama_cpp/llama.py - Chat formats/handlers:
llama_cpp/llama_chat_format.py - Server app:
llama_cpp/server/app.py - Model routing/loading:
llama_cpp/server/model.py - Server/model settings:
llama_cpp/server/settings.py - Server CLI parsing:
llama_cpp/server/cli.py - Server entrypoint:
llama_cpp/server/__main__.py - Server request/response schemas:
llama_cpp/server/types.py
This guide is intentionally implementation-oriented so you can reason about real control flow and extend behavior safely.