Skip to content

Latest commit

 

History

History
509 lines (370 loc) · 13.4 KB

File metadata and controls

509 lines (370 loc) · 13.4 KB

llama-cpp-python and llama.cpp Integration Guide

This document explains how this project interacts with and controls llama.cpp, including:

  • Low-level C API bindings and call flow
  • High-level Python inference APIs
  • OpenAI-compatible server implementation
  • Programmatic and operational server control with Python
  • Practical code snippets for common management tasks

1. Architecture Overview

The repository has four integration layers:

  1. llama_cpp/llama_cpp.py
  • Loads the llama shared library (.dll, .so, .dylib) with ctypes
  • Exposes raw llama.cpp symbols and constants as Python-callable functions
  1. llama_cpp/_internals.py
  • Resource-safe wrappers for native handles:
    • LlamaModel (llama_model*)
    • LlamaContext (llama_context*)
    • LlamaBatch (llama_batch)
    • LlamaSampler (sampler chain)
  1. llama_cpp/llama.py
  • Public high-level API (Llama class)
  • Implements tokenization, eval/decode, sampling, generation, embeddings, chat handling, state save/restore, cache integration
  1. llama_cpp/server/*
  • FastAPI OpenAI-compatible server
  • Model lifecycle and routing (LlamaProxy)
  • Streaming (SSE), authentication, multi-model config

2. How the C Library is Loaded

The bridge into llama.cpp starts in llama_cpp/llama_cpp.py.

Shared library resolution

_lib_base_name = "llama"
_override_base_path = os.environ.get("LLAMA_CPP_LIB_PATH")
_base_path = pathlib.Path(os.path.abspath(os.path.dirname(__file__))) / "lib" if _override_base_path is None else pathlib.Path(_override_base_path)
_lib = load_shared_library(_lib_base_name, _base_path)
ctypes_function = ctypes_function_for_shared_library(_lib)

Key control point:

  • Set LLAMA_CPP_LIB_PATH to force loading a custom llama build.

After load, llama_cpp.py defines hundreds of C signatures/constants (llama_*, LLAMA_*) used by all higher layers.

3. Native Handle Lifecycle and Safety

llama_cpp/_internals.py wraps native objects with deterministic cleanup via ExitStack.

Model lifecycle (LlamaModel)

Creation:

  • llama_model_load_from_file()
  • llama_model_get_vocab()

Cleanup:

  • llama_model_free()

Important metadata/introspection calls:

  • llama_model_n_ctx_train, llama_model_n_embd, llama_model_desc, llama_model_meta_*

Context lifecycle (LlamaContext)

Creation:

  • llama_init_from_model()

Core runtime calls:

  • llama_decode() for token forward pass
  • llama_encode() in embedding flows
  • llama_get_logits(), llama_get_embeddings(), llama_get_embeddings_seq()

KV cache controls:

  • llama_memory_clear
  • llama_memory_seq_rm, llama_memory_seq_cp, llama_memory_seq_keep, llama_memory_seq_add

Batch lifecycle (LlamaBatch)

Creation/free:

  • llama_batch_init() / llama_batch_free()

Used to populate token ids, positions, sequence ids, and logits flags for decode passes.

Sampler chain (LlamaSampler)

Builds a llama.cpp sampler pipeline by chaining:

  • penalties
  • top-k/top-p/min-p/typical
  • temp/greedy/distribution
  • mirostat
  • grammar and custom samplers

This is how generation behavior is controlled at runtime.

4. High-Level Llama Control Flow

The public API is llama_cpp.Llama in llama_cpp/llama.py.

4.1 Initialization

Llama.__init__ performs these major steps:

  1. Initialize backend once per process:
if not Llama.__backend_initialized:
    llama_cpp.llama_backend_init()
  1. Configure model params (llama_model_default_params) and context params (llama_context_default_params)

  2. Load model/context/batch wrappers:

  • internals.LlamaModel(...)
  • internals.LlamaContext(...)
  • internals.LlamaBatch(...)
  1. Optionally attach:
  • LoRA adapter (llama_adapter_lora_init, llama_set_adapter_lora)
  • custom tokenizer
  • draft model for speculative decoding
  1. Build chat handler map from GGUF metadata (tokenizer.chat_template.*) or fallback chat formats

Important init controls

  • Model placement/perf: n_gpu_layers, split_mode, main_gpu, tensor_split, offload_kqv, flash_attn
  • Context/perf: n_ctx, n_batch, n_threads, n_threads_batch
  • Rope/YaRN scaling: rope_scaling_type, rope_freq_base, rope_freq_scale, yarn_*
  • Memory behavior: use_mmap, use_mlock
  • KV quantization: type_k, type_v
  • Chat behavior: chat_format, chat_handler
  • Grammar and constrained decode: grammar at request time

4.2 Inference loop (create_completion)

At a high level:

  1. Tokenize prompt
  2. Evaluate prompt with eval()
  3. Sample next token with configured sampler chain
  4. Repeat until stop/eos/max-tokens/stopping criteria
  5. Detokenize and return OpenAI-style response

Core generation loop comes from generate() + sample() + eval().

eval()

  • Prunes KV tail for overwrite scenarios:
    • self._ctx.kv_cache_seq_rm(-1, self.n_tokens, -1)
  • Fills LlamaBatch
  • Calls self._ctx.decode(self._batch) (llama.cpp forward pass)
  • Copies logits into NumPy buffers when needed

sample()

  • Builds sampler chain with penalties + chosen strategy
  • Calls llama_sampler_sample(...)

create_completion()

  • Adds stop handling, UTF-8 safe chunking, optional logprobs
  • Supports streaming and non-streaming
  • Supports cache lookup/save with LlamaState

4.3 Chat completions

create_chat_completion() routes through a chat handler:

handler = (
    self.chat_handler
    or self._chat_handlers.get(self.chat_format)
    or llama_chat_format.get_chat_completion_handler(self.chat_format)
)

The selected handler:

  • formats messages into a prompt
  • may build JSON-schema grammar for tool calling / JSON mode
  • delegates to create_completion()
  • converts output back to OpenAI chat schema

4.4 Embeddings

create_embedding() and embed():

  • require model created with embedding=True
  • tokenize inputs
  • decode in batches
  • read embeddings from llama.cpp pointers
  • optional normalization

4.5 Save/restore state

save_state():

  • llama_get_state_size
  • llama_copy_state_data
  • snapshots prompt state + logits/input ids + seed

load_state():

  • llama_set_state_data
  • restores prompt position and internal state

This enables prompt caching and fast resume.

5. OpenAI-Compatible Server: Design and Control

The server lives under llama_cpp/server.

5.1 Server startup paths

CLI entrypoint: python -m llama_cpp.server

  • implemented in llama_cpp/server/__main__.py
  • parses ServerSettings + ModelSettings
  • supports --config_file (JSON/YAML)
  • runs uvicorn.run(app, ...)

Programmatic entrypoint: llama_cpp.server.app.create_app(...)

5.2 Server settings and model settings

llama_cpp/server/settings.py defines:

  • ServerSettings

    • host, port, TLS files
    • api_key
    • interrupt_requests
    • disable_ping_events
    • root_path
  • ModelSettings

    • almost all Llama(...) controls (GPU, context, sampling defaults, LoRA, tokenizer, draft model, cache, etc.)

All fields can be set via CLI args or environment variables (Pydantic settings behavior).

5.3 Model management (LlamaProxy)

llama_cpp/server/model.py:

  • loads default model on startup
  • keeps only one active Llama in memory at a time
  • routes by model request field to model alias
  • on model switch:
    • closes current model (.close())
    • loads target model via load_llama_from_model_settings

Key behavior:

  • unknown model alias falls back to default model
  • this is an unload/reload switch, not concurrent multi-model residency

5.4 Request concurrency and interruption

In server/app.py, dependency get_llama_proxy() uses a double-lock strategy (llama_outer_lock, llama_inner_lock) so the server can:

  • serialize model access safely
  • optionally interrupt long streaming responses when a new request arrives (interrupt_requests=True)

This is crucial because most llama.cpp context usage is not safe for unconstrained concurrent generation on one context.

5.5 OpenAI-compatible endpoints

Primary endpoints:

  • POST /v1/completions
  • POST /v1/chat/completions
  • POST /v1/embeddings
  • GET /v1/models

Extra utility endpoints:

  • POST /extras/tokenize
  • POST /extras/tokenize/count
  • POST /extras/detokenize

Streaming:

  • Implemented with Server-Sent Events via EventSourceResponse
  • Chunks are emitted as data: <json> and terminated with data: [DONE]

Auth:

  • If api_key is set, bearer token is required

Request-time controls in server:

  • logit_bias_type="tokens" converts text tokens to input ids before passing to model
  • grammar string converted via LlamaGrammar.from_string
  • min_tokens converted to logits processor blocking EOS until threshold

6. API Interaction Map (Python -> llama.cpp)

Common runtime interactions:

  • Backend init:

    • llama_backend_init
    • llama_numa_init
  • Model/context:

    • llama_model_load_from_file
    • llama_init_from_model
    • llama_free, llama_model_free
  • Tokenization:

    • llama_tokenize
    • llama_token_to_piece
  • Inference:

    • llama_decode
    • llama_get_logits
  • Sampling:

    • llama_sampler_chain_init
    • llama_sampler_init_* (top-k/top-p/temp/mirostat/grammar/penalties/etc.)
    • llama_sampler_sample
  • Embeddings:

    • llama_get_embeddings
    • llama_get_embeddings_seq
  • KV cache/state:

    • llama_memory_seq_*
    • llama_get_state_size
    • llama_copy_state_data
    • llama_set_state_data

7. Practical Python Control Snippets

7.1 Direct local model control

from llama_cpp import Llama

llm = Llama(
    model_path="models/mistral-7b-instruct.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    n_threads=8,
    chat_format="chatml",
)

resp = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "Be concise."},
        {"role": "user", "content": "Explain KV cache in one paragraph."},
    ],
    max_tokens=200,
    temperature=0.2,
)

print(resp["choices"][0]["message"]["content"])

7.2 Streaming tokens

from llama_cpp import Llama

llm = Llama(model_path="models/model.gguf", n_ctx=4096)

for chunk in llm.create_completion(
    prompt="Write a haiku about compilers.",
    stream=True,
    max_tokens=64,
    temperature=0.8,
):
    print(chunk["choices"][0]["text"], end="", flush=True)

7.3 Save and restore generation state

from llama_cpp import Llama

llm = Llama(model_path="models/model.gguf", n_ctx=4096)

_ = llm("The quick brown fox ", max_tokens=8)
state = llm.save_state()

# Later: restore and continue
llm.load_state(state)
out = llm("", max_tokens=16)
print(out["choices"][0]["text"])

7.4 Start OpenAI-compatible server from Python

import uvicorn
from llama_cpp.server.app import create_app
from llama_cpp.server.settings import ServerSettings, ModelSettings

app = create_app(
    server_settings=ServerSettings(host="0.0.0.0", port=8000, api_key="sk-local"),
    model_settings=[
        ModelSettings(
            model="models/model.gguf",
            model_alias="gpt-3.5-turbo",
            n_ctx=4096,
            n_gpu_layers=-1,
            chat_format="chatml",
        )
    ],
)

uvicorn.run(app, host="0.0.0.0", port=8000)

7.5 Call local server with OpenAI client

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="sk-local")

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Say hello in JSON"}],
    response_format={"type": "json_object"},
)

print(resp.choices[0].message.content)

7.6 Multi-model config and hot switching

config.json:

{
  "host": "0.0.0.0",
  "port": 8000,
  "interrupt_requests": true,
  "models": [
    {
      "model": "models/chat.gguf",
      "model_alias": "gpt-3.5-turbo",
      "chat_format": "chatml",
      "n_ctx": 4096,
      "n_gpu_layers": -1
    },
    {
      "model": "models/code.gguf",
      "model_alias": "copilot-codex",
      "n_ctx": 16192,
      "n_gpu_layers": -1
    }
  ]
}

Run:

python -m llama_cpp.server --config_file config.json

Switch per request by setting model to one alias.

8. Operational Notes

  • Single active model in server process: model switches unload current model and load target model.
  • Threading defaults:
    • n_threads defaults to half CPU cores
    • n_threads_batch defaults to all CPU cores
  • logprobs requires logits_all=True in model creation.
  • Embeddings require embedding=True at model creation.
  • If LoRA is enabled, use_mmap is forced off in current implementation path.
  • interrupt_requests=True can improve responsiveness under competing streaming requests.

9. Build/Run Controls Relevant to llama.cpp

From repository Makefile:

  • CPU/OpenBLAS build:
make build.openblas
  • CUDA build:
make build.cuda
  • Server run:
MODEL=/path/to/model.gguf make run-server

These ultimately install/build the Python package with specific CMAKE_ARGS, which configure embedded llama.cpp backend support.

10. File Map for Further Inspection

  • Low-level bindings: llama_cpp/llama_cpp.py
  • Native wrappers: llama_cpp/_internals.py
  • High-level API: llama_cpp/llama.py
  • Chat formats/handlers: llama_cpp/llama_chat_format.py
  • Server app: llama_cpp/server/app.py
  • Model routing/loading: llama_cpp/server/model.py
  • Server/model settings: llama_cpp/server/settings.py
  • Server CLI parsing: llama_cpp/server/cli.py
  • Server entrypoint: llama_cpp/server/__main__.py
  • Server request/response schemas: llama_cpp/server/types.py

This guide is intentionally implementation-oriented so you can reason about real control flow and extend behavior safely.