API_DOCS.md

llama-cpp-python and llama.cpp Integration Guide

This document explains how this project interacts with and controls llama.cpp, including:

Low-level C API bindings and call flow
High-level Python inference APIs
OpenAI-compatible server implementation
Programmatic and operational server control with Python
Practical code snippets for common management tasks

1. Architecture Overview

The repository has four integration layers:

llama_cpp/llama_cpp.py

Loads the llama shared library (.dll, .so, .dylib) with ctypes
Exposes raw llama.cpp symbols and constants as Python-callable functions

llama_cpp/_internals.py

Resource-safe wrappers for native handles:
- LlamaModel (llama_model*)
- LlamaContext (llama_context*)
- LlamaBatch (llama_batch)
- LlamaSampler (sampler chain)

llama_cpp/llama.py

Public high-level API (Llama class)
Implements tokenization, eval/decode, sampling, generation, embeddings, chat handling, state save/restore, cache integration

llama_cpp/server/*

FastAPI OpenAI-compatible server
Model lifecycle and routing (LlamaProxy)
Streaming (SSE), authentication, multi-model config

2. How the C Library is Loaded

The bridge into llama.cpp starts in llama_cpp/llama_cpp.py.

Shared library resolution

_lib_base_name = "llama"
_override_base_path = os.environ.get("LLAMA_CPP_LIB_PATH")
_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__))) / "lib" if _override_base_path is None else pathlib.Path(_override_base_path)
_lib = load_shared_library(_lib_base_name, _base_path)
ctypes_function = ctypes_function_for_shared_library(_lib)

Key control point:

Set LLAMA_CPP_LIB_PATH to force loading a custom llama build.

After load, llama_cpp.py defines hundreds of C signatures/constants (llama_*, LLAMA_*) used by all higher layers.

3. Native Handle Lifecycle and Safety

llama_cpp/_internals.py wraps native objects with deterministic cleanup via ExitStack.

Model lifecycle (`LlamaModel`)

Creation:

llama_model_load_from_file()
llama_model_get_vocab()

Cleanup:

llama_model_free()

Important metadata/introspection calls:

llama_model_n_ctx_train, llama_model_n_embd, llama_model_desc, llama_model_meta_*

Context lifecycle (`LlamaContext`)

Creation:

llama_init_from_model()

Core runtime calls:

llama_decode() for token forward pass
llama_encode() in embedding flows
llama_get_logits(), llama_get_embeddings(), llama_get_embeddings_seq()

KV cache controls:

llama_memory_clear
llama_memory_seq_rm, llama_memory_seq_cp, llama_memory_seq_keep, llama_memory_seq_add

Batch lifecycle (`LlamaBatch`)

Creation/free:

llama_batch_init() / llama_batch_free()

Used to populate token ids, positions, sequence ids, and logits flags for decode passes.

Sampler chain (`LlamaSampler`)

Builds a llama.cpp sampler pipeline by chaining:

penalties
top-k/top-p/min-p/typical
temp/greedy/distribution
mirostat
grammar and custom samplers

This is how generation behavior is controlled at runtime.

4. High-Level `Llama` Control Flow

The public API is llama_cpp.Llama in llama_cpp/llama.py.

4.1 Initialization

Llama.__init__ performs these major steps:

Initialize backend once per process:

if not Llama.__backend_initialized:
    llama_cpp.llama_backend_init()

Configure model params (llama_model_default_params) and context params (llama_context_default_params)
Load model/context/batch wrappers:

internals.LlamaModel(...)
internals.LlamaContext(...)
internals.LlamaBatch(...)

Optionally attach:

LoRA adapter (llama_adapter_lora_init, llama_set_adapter_lora)
custom tokenizer
draft model for speculative decoding

Build chat handler map from GGUF metadata (tokenizer.chat_template.*) or fallback chat formats

Important init controls

Model placement/perf: n_gpu_layers, split_mode, main_gpu, tensor_split, offload_kqv, flash_attn
Context/perf: n_ctx, n_batch, n_threads, n_threads_batch
Rope/YaRN scaling: rope_scaling_type, rope_freq_base, rope_freq_scale, yarn_*
Memory behavior: use_mmap, use_mlock
KV quantization: type_k, type_v
Chat behavior: chat_format, chat_handler
Grammar and constrained decode: grammar at request time

4.2 Inference loop (`create_completion`)

At a high level:

Tokenize prompt
Evaluate prompt with eval()
Sample next token with configured sampler chain
Repeat until stop/eos/max-tokens/stopping criteria
Detokenize and return OpenAI-style response

Core generation loop comes from generate() + sample() + eval().

`eval()`

Prunes KV tail for overwrite scenarios:
- self._ctx.kv_cache_seq_rm(-1, self.n_tokens, -1)
Fills LlamaBatch
Calls self._ctx.decode(self._batch) (llama.cpp forward pass)
Copies logits into NumPy buffers when needed

`sample()`

Builds sampler chain with penalties + chosen strategy
Calls llama_sampler_sample(...)

`create_completion()`

Adds stop handling, UTF-8 safe chunking, optional logprobs
Supports streaming and non-streaming
Supports cache lookup/save with LlamaState

4.3 Chat completions

create_chat_completion() routes through a chat handler:

handler = (
    self.chat_handler
    or self._chat_handlers.get(self.chat_format)
    or llama_chat_format.get_chat_completion_handler(self.chat_format)
)

The selected handler:

formats messages into a prompt
may build JSON-schema grammar for tool calling / JSON mode
delegates to create_completion()
converts output back to OpenAI chat schema

4.4 Embeddings

create_embedding() and embed():

require model created with embedding=True
tokenize inputs
decode in batches
read embeddings from llama.cpp pointers
optional normalization

4.5 Save/restore state

save_state():

llama_get_state_size
llama_copy_state_data
snapshots prompt state + logits/input ids + seed

load_state():

llama_set_state_data
restores prompt position and internal state

This enables prompt caching and fast resume.

5. OpenAI-Compatible Server: Design and Control

The server lives under llama_cpp/server.

5.1 Server startup paths

CLI entrypoint: python -m llama_cpp.server

implemented in llama_cpp/server/__main__.py
parses ServerSettings + ModelSettings
supports --config_file (JSON/YAML)
runs uvicorn.run(app, ...)

Programmatic entrypoint: llama_cpp.server.app.create_app(...)

5.2 Server settings and model settings

llama_cpp/server/settings.py defines:

ServerSettings
- host, port, TLS files
- api_key
- interrupt_requests
- disable_ping_events
- root_path
ModelSettings
- almost all Llama(...) controls (GPU, context, sampling defaults, LoRA, tokenizer, draft model, cache, etc.)

All fields can be set via CLI args or environment variables (Pydantic settings behavior).

5.3 Model management (`LlamaProxy`)

llama_cpp/server/model.py:

loads default model on startup
keeps only one active Llama in memory at a time
routes by model request field to model alias
on model switch:
- closes current model (.close())
- loads target model via load_llama_from_model_settings

Key behavior:

unknown model alias falls back to default model
this is an unload/reload switch, not concurrent multi-model residency

5.4 Request concurrency and interruption

In server/app.py, dependency get_llama_proxy() uses a double-lock strategy (llama_outer_lock, llama_inner_lock) so the server can:

serialize model access safely
optionally interrupt long streaming responses when a new request arrives (interrupt_requests=True)

This is crucial because most llama.cpp context usage is not safe for unconstrained concurrent generation on one context.

5.5 OpenAI-compatible endpoints

Primary endpoints:

POST /v1/completions
POST /v1/chat/completions
POST /v1/embeddings
GET /v1/models

Extra utility endpoints:

POST /extras/tokenize
POST /extras/tokenize/count
POST /extras/detokenize

Streaming:

Implemented with Server-Sent Events via EventSourceResponse
Chunks are emitted as data: <json> and terminated with data: [DONE]

Auth:

If api_key is set, bearer token is required

Request-time controls in server:

logit_bias_type="tokens" converts text tokens to input ids before passing to model
grammar string converted via LlamaGrammar.from_string
min_tokens converted to logits processor blocking EOS until threshold

6. API Interaction Map (Python -> llama.cpp)

Common runtime interactions:

Backend init:
- llama_backend_init
- llama_numa_init
Model/context:
- llama_model_load_from_file
- llama_init_from_model
- llama_free, llama_model_free
Tokenization:
- llama_tokenize
- llama_token_to_piece
Inference:
- llama_decode
- llama_get_logits
Sampling:
- llama_sampler_chain_init
- llama_sampler_init_* (top-k/top-p/temp/mirostat/grammar/penalties/etc.)
- llama_sampler_sample
Embeddings:
- llama_get_embeddings
- llama_get_embeddings_seq
KV cache/state:
- llama_memory_seq_*
- llama_get_state_size
- llama_copy_state_data
- llama_set_state_data

7. Practical Python Control Snippets

7.1 Direct local model control

from llama_cpp import Llama

llm = Llama(
    model_path="models/mistral-7b-instruct.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    n_threads=8,
    chat_format="chatml",
)

resp = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "Be concise."},
        {"role": "user", "content": "Explain KV cache in one paragraph."},
    ],
    max_tokens=200,
    temperature=0.2,
)

print(resp["choices"][0]["message"]["content"])

7.2 Streaming tokens

from llama_cpp import Llama

llm = Llama(model_path="models/model.gguf", n_ctx=4096)

for chunk in llm.create_completion(
    prompt="Write a haiku about compilers.",
    stream=True,
    max_tokens=64,
    temperature=0.8,
):
    print(chunk["choices"][0]["text"], end="", flush=True)

7.3 Save and restore generation state

from llama_cpp import Llama

llm = Llama(model_path="models/model.gguf", n_ctx=4096)

_ = llm("The quick brown fox ", max_tokens=8)
state = llm.save_state()

# Later: restore and continue
llm.load_state(state)
out = llm("", max_tokens=16)
print(out["choices"][0]["text"])

7.4 Start OpenAI-compatible server from Python

import uvicorn
from llama_cpp.server.app import create_app
from llama_cpp.server.settings import ServerSettings, ModelSettings

app = create_app(
    server_settings=ServerSettings(host="0.0.0.0", port=8000, api_key="sk-local"),
    model_settings=[
        ModelSettings(
            model="models/model.gguf",
            model_alias="gpt-3.5-turbo",
            n_ctx=4096,
            n_gpu_layers=-1,
            chat_format="chatml",
        )
    ],
)

uvicorn.run(app, host="0.0.0.0", port=8000)

7.5 Call local server with OpenAI client

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="sk-local")

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Say hello in JSON"}],
    response_format={"type": "json_object"},
)

print(resp.choices[0].message.content)

7.6 Multi-model config and hot switching

config.json:

{
  "host": "0.0.0.0",
  "port": 8000,
  "interrupt_requests": true,
  "models": [
    {
      "model": "models/chat.gguf",
      "model_alias": "gpt-3.5-turbo",
      "chat_format": "chatml",
      "n_ctx": 4096,
      "n_gpu_layers": -1
    },
    {
      "model": "models/code.gguf",
      "model_alias": "copilot-codex",
      "n_ctx": 16192,
      "n_gpu_layers": -1
    }
  ]
}

Run:

python -m llama_cpp.server --config_file config.json

Switch per request by setting model to one alias.

8. Operational Notes

Single active model in server process: model switches unload current model and load target model.
Threading defaults:
- n_threads defaults to half CPU cores
- n_threads_batch defaults to all CPU cores
logprobs requires logits_all=True in model creation.
Embeddings require embedding=True at model creation.
If LoRA is enabled, use_mmap is forced off in current implementation path.
interrupt_requests=True can improve responsiveness under competing streaming requests.

9. Build/Run Controls Relevant to llama.cpp

From repository Makefile:

CPU/OpenBLAS build:

make build.openblas

CUDA build:

make build.cuda

Server run:

MODEL=/path/to/model.gguf make run-server

These ultimately install/build the Python package with specific CMAKE_ARGS, which configure embedded llama.cpp backend support.

10. File Map for Further Inspection

Low-level bindings: llama_cpp/llama_cpp.py
Native wrappers: llama_cpp/_internals.py
High-level API: llama_cpp/llama.py
Chat formats/handlers: llama_cpp/llama_chat_format.py
Server app: llama_cpp/server/app.py
Model routing/loading: llama_cpp/server/model.py
Server/model settings: llama_cpp/server/settings.py
Server CLI parsing: llama_cpp/server/cli.py
Server entrypoint: llama_cpp/server/__main__.py
Server request/response schemas: llama_cpp/server/types.py

This guide is intentionally implementation-oriented so you can reason about real control flow and extend behavior safely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-cpp-python and llama.cpp Integration Guide

1. Architecture Overview

2. How the C Library is Loaded

Shared library resolution

3. Native Handle Lifecycle and Safety

Model lifecycle (`LlamaModel`)

Context lifecycle (`LlamaContext`)

Batch lifecycle (`LlamaBatch`)

Sampler chain (`LlamaSampler`)

4. High-Level `Llama` Control Flow

4.1 Initialization

Important init controls

4.2 Inference loop (`create_completion`)

`eval()`

`sample()`

`create_completion()`

4.3 Chat completions

4.4 Embeddings

4.5 Save/restore state

5. OpenAI-Compatible Server: Design and Control

5.1 Server startup paths

5.2 Server settings and model settings

5.3 Model management (`LlamaProxy`)

5.4 Request concurrency and interruption

5.5 OpenAI-compatible endpoints

6. API Interaction Map (Python -> llama.cpp)

7. Practical Python Control Snippets

7.1 Direct local model control

7.2 Streaming tokens

7.3 Save and restore generation state

7.4 Start OpenAI-compatible server from Python

7.5 Call local server with OpenAI client

7.6 Multi-model config and hot switching

8. Operational Notes

9. Build/Run Controls Relevant to llama.cpp

10. File Map for Further Inspection

FilesExpand file tree

API_DOCS.md

Latest commit

History

API_DOCS.md

File metadata and controls

llama-cpp-python and llama.cpp Integration Guide

1. Architecture Overview

2. How the C Library is Loaded

Shared library resolution

3. Native Handle Lifecycle and Safety

Model lifecycle (LlamaModel)

Context lifecycle (LlamaContext)

Batch lifecycle (LlamaBatch)

Sampler chain (LlamaSampler)

4. High-Level Llama Control Flow

4.1 Initialization

Important init controls

4.2 Inference loop (create_completion)

eval()

sample()

create_completion()

4.3 Chat completions

4.4 Embeddings

4.5 Save/restore state

5. OpenAI-Compatible Server: Design and Control

5.1 Server startup paths

5.2 Server settings and model settings

5.3 Model management (LlamaProxy)

5.4 Request concurrency and interruption

5.5 OpenAI-compatible endpoints

6. API Interaction Map (Python -> llama.cpp)

7. Practical Python Control Snippets

7.1 Direct local model control

7.2 Streaming tokens

7.3 Save and restore generation state

7.4 Start OpenAI-compatible server from Python

7.5 Call local server with OpenAI client

7.6 Multi-model config and hot switching

8. Operational Notes

9. Build/Run Controls Relevant to llama.cpp

10. File Map for Further Inspection

Model lifecycle (`LlamaModel`)

Context lifecycle (`LlamaContext`)

Batch lifecycle (`LlamaBatch`)

Sampler chain (`LlamaSampler`)

4. High-Level `Llama` Control Flow

4.2 Inference loop (`create_completion`)

`eval()`

`sample()`

`create_completion()`

5.3 Model management (`LlamaProxy`)