cragents

Grammar-constrained Pydantic AI agents that think smarter and respond faster.

Motivation

And I'm thinking While I'm thinking... (Crackerman, Stone Temple Pilots, 1992)

Reasoning models use a lot of tokens for their reasoning output. This is resource-intensive while not necessarily improving accuracy. So it may be desirable to limit the tokens used. Doing so can:

Improve response speed
Decrease GPU memory requirements
Provide more space in the context for stuff that matters

Requirements

Pydantic AI
The model must be served with vLLM >= 0.13
vLLM must be started without a reasoning parser

Primitives

The set_guide() method accepts a sequence of elements that control model output. These primitives are reusable and composable. Sequence them in any combination to shape model output.

Note: The model will follow whatever guide you provide, but pydantic-ai may not handle all combinations correctly (e.g., tool calls before think blocks). Use primitives outside the tested patterns at your own risk.

Anchor

Force the model to generate exact text.

Anchor(text: str)

text - The exact text the model must generate

Constrain

Limit text expansion. Think of a text block that expands vertically through newlines and horizontally through all other characters.

Constrain(
    max_newlines: int,
    max_char_captures: int,
    chars_to_capture: str = "."
)

max_newlines - Upper bound on newlines (vertical expansion)
max_char_captures - Upper bound on capture characters (horizontal expansion)
chars_to_capture - Characters to count for horizontal limiting (default: ".")

Free

Allow unconstrained generation.

Free()

Warning: The model decides when to stop, which may be never.

UseTools

Force tool call generation.

UseTools(
    json_schema: dict | None = None,
    tool_name_regex: str = "/[a-zA-Z0-9_]+/",
    tool_names: list[str] | None = None,
    start_token: str = "<tool_call>",
    stop_token: str = "</tool_call>"
)

json_schema - Schema for allowed tool calls (auto-built from agent config if None)
tool_name_regex - Regex pattern for valid tool names
tool_names - Explicit list of allowed tool names
start_token - Token generated before tool calls
stop_token - Token generated after tool calls

Think (Wrapper)

Wrap a sequence of primitives in reasoning tokens.

Think(
    sequence: Sequence[Anchor | Constrain | Free],
    start_token: str = "<think>",
    stop_token: str = "</think>"
)

sequence - Primitives that control the reasoning output
start_token - Token generated before the sequence
stop_token - Token generated after the sequence

Example

Guide model output with a composable generation sequence.

Start vLLM without a reasoning parser.

vllm serve $VLLM_MODEL_NAME --gpu-memory-utilization 0.92 --api-key $VLLM_API_KEY --enable-auto-tool-choice --tool-call-parser hermes --max-model-len auto

Pass the vllm_model_profile to a Pydantic AI OpenAIChatModel.

import os
from cragents import vllm_model_profile
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(
    model_name=os.environ["VLLM_MODEL_NAME"],
    provider=OpenAIProvider(
        api_key=os.environ["VLLM_API_KEY"],
        base_url=os.environ["VLLM_BASE_URL"],
    ),
    profile=vllm_model_profile,
)

Using the model, initialize a CRAgent the same as you would for a Pydantic AI agent.

from cragents import CRAgent
from pydantic_ai import ToolOutput

agent = CRAgent(model, output_type=[ToolOutput(bool), ToolOutput(int)])

Define a generation sequence to guide model output.

from cragents import Anchor, Constrain, Free, Think, UseTools

generation_sequence = [
    Think(
        [
            Anchor("I think "),
            Constrain(max_newlines=1, max_char_captures=1, chars_to_capture=".?!"),
            Anchor("So I should "),
            Free(),
        ]
    ),
    UseTools(),
]

await agent.set_guide(generation_sequence)

Note: You can change the guide at any time by setting it again.

Use the agent as you normally would use a Pydantic AI agent.
```
run = await agent.run("Hi")
```

Inspecting ThinkingParts should confirm that output is constrained.

from pydantic_ai.messages import ThinkingPart

for message in run.all_messages():
    for part in message.parts:
        if isinstance(part, ThinkingPart):
            print(part.content)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
cragents		cragents
tests		tests
website		website
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cragents

Motivation

Requirements

Primitives

Anchor

Constrain

Free

UseTools

Think (Wrapper)

Example

About

Uh oh!

Releases 2

Languages

License

g-eoj/cragents

Folders and files

Latest commit

History

Repository files navigation

cragents

Motivation

Requirements

Primitives

Anchor

Constrain

Free

UseTools

Think (Wrapper)

Example

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Languages