This directory contains the code for the TensorZero Evaluations Guide.
We provide a configuration file (./config/tensorzero.toml) that specifies:
- A
write_haikufunction that generates a haiku, withgpt_4oandgpt_4o_minivariants. - Evaluators for the
write_haikufunction, including exact match and assorted LLM judges.
- Install Docker.
- Install Python 3.10+.
- Install the Python dependencies. We recommend using
uv:uv sync - Generate an API key for OpenAI (
OPENAI_API_KEY).
- Create a
.envfile with theOPENAI_API_KEYenvironment variable (see.env.examplefor an example). - Run
docker compose upto launch the TensorZero Gateway, the TensorZero UI, and a development ClickHouse database. - Run the
main.pyscript to generate 100 haikus.
Let's generate a dataset composed of our 100 haikus.
- Open the UI, navigate to "Datasets", and select "Build Dataset" (
http://localhost:4000/datasets/builder). - Create a new dataset called
haiku_dataset. Select yourwrite_haikufunction, "None" as the metric, and "Inference" as the dataset output.
Let's evaluate our gpt_4o variant using the TensorZero Evaluations CLI tool.
- Launch an evaluation with the CLI:
docker compose run --rm evaluations \
--function-name write_haiku \
--evaluator-names valid_haiku,metaphor_count,exact_match,compare_haikus \
--dataset-name haiku_dataset \
--variant-name gpt_4o \
--concurrency 5Let's evaluate our gpt_4o_mini variant using the TensorZero Evaluations UI, and compare the results.
- Navigate to "Evaluations" (
http://localhost:4000/evaluations) and select "New Run". - Launch an evaluation with the
gpt_4o_minivariant. - Select the previous evaluation run in the dropdown to compare the results.