Tutorial: Static Evaluations

This guide shows how to define and run static evaluations for your TensorZero functions.

Status Quo

Imagine we have a TensorZero function for writing haikus about a given topic, and want to compare the behavior of GPT-4o and GPT-4o Mini on this task.

Initially, our configuration for this function might look like:

[functions.write_haiku]
type = "chat"
user_schema = "functions/write_haiku/user_schema.json"

[functions.write_haiku.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
user_template = "functions/write_haiku/user_template.minijinja"

[functions.write_haiku.variants.gpt_4o]
type = "chat_completion"
model = "openai::gpt-4o"
user_template = "functions/write_haiku/user_template.minijinja"

User Schema & Template

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "topic": {
      "type": "string"
    }
  },
  "required": ["topic"],
  "additionalProperties": false
}

Write a haiku about: {{ topic }}

How can we evaluate the behavior of our two variants in a principled way?

One option is to build a dataset of “test cases” that we can evaluate them against.

Datasets

To use TensorZero Evaluations, you first need to build a dataset.

A dataset is a collection of datapoints. Each datapoint has an input and optionally a output. In the context of evaluations, the output in the dataset should be a reference output, i.e. the output you’d have liked to see. You don’t necessarily need to provide a reference output: some evaluators (e.g. LLM judges) can score generated outputs without a reference output (otherwise, that datapoint is skipped).

Let’s create a dataset:

Generate many haikus by running inference on your write_haiku function. (On GitHub, we provide a script main.py that generates 100 haikus with write_haiku.)
Open the UI, navigate to “Datasets”, and select “Build Dataset” (http://localhost:4000/datasets/builder).
Create a new dataset called haiku_dataset. Select your write_haiku function, “None” as the metric, and “Inference” as the dataset output.

Evaluations

Evalutions test the behavior of variants for a TensorZero function.

Let’s define an evaluation in our configuration file:

[evaluations.haiku_eval]
type = "static"
function_name = "write_haiku"

Evaluators

Each evaluation has one or more evaluators: a rule or behavior you’d like to test.

Today, TensorZero supports two types of evaluators: exact_match and llm_judge.

`exact_match`

The exact_match evaluator compares the generated output with the datapoint’s reference output. If they are identical, it returns true; otherwise, it returns false.

[evaluations.haiku_eval.evaluators.exact_match]
type = "exact_match"

`llm_judge`

LLM Judges are special-purpose TensorZero function that can be used to evaluate a TensorZero function.

For example, our haikus should generally follow a specific format, but it’s hard to define a heuristic to determine if they’re correct. Why not ask an LLM?

Let’s do that:

[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"  # LLM judge should generate a boolean (or float)
optimize = "max"  # higher is better
cutoff = 0.95  # if the variant scores <95% = bad

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"

System Instructions

Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.

Here, we defined an evaluator valid_haiku of type llm_judge, with a variant that uses GPT-4o Mini.

Similar to regular TensorZero functions, we can define multiple variants for an LLM judge. But unlike regular functions, only one variant can be active at a time during evaluation; you can denote that with the active property.

Example: Multiple Variants for an LLM Judge

[evaluations.haiku_eval.evaluators.valid_haiku]
type = "llm_judge"
output_type = "boolean"
optimize = "max"
cutoff = 0.95

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"
active = true

[evaluations.haiku_eval.evaluators.valid_haiku.variants.gpt_4o_judge]
type = "chat_completion"
model = "openai::gpt-4o"
system_instructions = "evaluations/haiku_eval/valid_haiku/system_instructions.txt"
json_mode = "strict"

The LLM judge we showed above generates a boolean, but they can also generate floats.

Let’s define another evalutor that counts the number of metaphors in our haiku.

[evaluations.haiku_eval.evaluators.metaphor_count]
type = "llm_judge"
output_type = "float"  # LLM judge should generate a boolean (or float)
optimize = "max"
cutoff = 1  # <1 metaphor per haiku = bad

We can also use different variant types for evaluators. Let’s use a chain-of-thought variant for our metaphor count evaluator, since it’s a bit more complex.

[evaluations.haiku_eval.evaluators.metaphor_count.variants.gpt_4o_mini_judge]
type = "experimental_chain_of_thought"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/metaphor_count/system_instructions.txt"
json_mode = "strict"

System Instructions

How many metaphors does the generated haiku have?

The LLM judges we’ve defined so far only look at the datapoint’s input and the generated output. But we can also provide the datapoint’s reference output to the judge:

[evaluations.haiku_eval.evaluators.compare_haikus]
type = "llm_judge"
include = { reference_output = true }  # include the reference output in the LLM judge's context
output_type = "boolean"
optimize = "max"

[evaluations.haiku_eval.evaluators.compare_haikus.variants.gpt_4o_mini_judge]
type = "chat_completion"
model = "openai::gpt-4o-mini"
system_instructions = "evaluations/haiku_eval/compare_haikus/system_instructions.txt"
json_mode = "strict"

System Instructions

Does the generated haiku include the same figures of speech as the reference haiku?

Running an Evaluation

Let’s run our evaluations!

You can run evaluations using the TensorZero Evaluations CLI tool or the TensorZero UI.

CLI

To run evaluations in the CLI, you can use the tensorzero/evaluations container:

docker compose run --rm evaluations \
    --evaluation-name haiku_eval \
    --dataset-name haiku_dataset \
    --variant-name gpt_4o \
    --concurrency 5

Docker Compose

Here’s the relevant section of the docker-compose.yml for the evaluations tool.

You should provide credentials for any LLM judges. Alternatively, the evaluations tool can use an external TensorZero Gateway with the --gateway-url http://gateway:3000 flag.

services:
  # ...

  evaluations:
    profiles: [evaluations]
    image: tensorzero/evaluations
    volumes:
      - ./config:/app/config:ro
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
      # ... and any other relevant API credentials ...
      - TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      clickhouse:
        condition: service_healthy
# ...

See GitHub for the complete Docker Compose configuration.

UI

To run evaluations in the UI, navigate to “Evaluations” (http://localhost:4000/evaluations) and select “New Run”.

You can compare multiple evaluation runs in the TensorZero UI (including evaluation runs for the CLI).

TensorZero Evaluation UI