Configuration Reference

`[evaluations.evaluation_name]`

The evaluations sub-section of the config file defines the behavior of an evaluation in TensorZero. You can define multiple evaluations by including multiple [evaluations.evaluation_name] sections.

If your evaluation_name is not a basic string, it can be escaped with quotation marks. For example, periods are not allowed in basic strings, so you can define an evaluation named foo.bar as [evaluations."foo.bar"].

[evaluations.email-guardrails]
# ...

`type`

Type: Literal "static" (we may add other options here later on)
Required: yes

`function_name`

Type: string
Required: yes

This should be the name of a function defined in the [functions] section of the gateway config. This value sets which function this evaluation should evaluate when run.

`[evaluations.evaluation_name.evaluators.evaluator_name]`

The evaluators sub-section defines the behavior of a particular evaluator that will be run as part of its parent evaluation. You can define multiple evaluators by including multiple [evaluations.evaluation_name.evaluators.evaluator_name] sections.

If your evaluator_name is not a basic string, it can be escaped with quotation marks. For example, periods are not allowed in basic strings, so you can define includes.jpg as [evaluations.evaluation_name.evaluators."includes.jpg"].

[evaluations.email-guardrails]
# ...

[evaluations.email-guardrails.evaluators."includes.jpg"]
# ...

[evaluations.email-guardrails.evaluators.check-signature]
# ...

`type`

Type: string
Required: yes

Defines the type of the evaluator.

TensorZero currently supports the following variant types:

Type	Description
`llm_judge`	Use a TensorZero function as a judge
`exact_match`	Evaluates whether the generated output exactly matches the reference output (skips the datapoint if unavailable).

[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
# ...

type: "exact_match"

`cutoff`

Type: float
Required: no

Sets a user defined threshold at which the test is passing. This can be useful for applications where the evaluations are run as an automated test. If the average value of this evaluator is below the cutoff, the evaluations binary will return a nonzero status code.

type: "llm_judge"

`input_format`

Type: string
Required: no (default: serialized)

Defines the format of the input provided to the LLM judge.

serialized: Passes the input messages, generated output, and reference output (if included) as a single serialized string.
messages: Passes the input messages, generated output, and reference output (if included) as distinct messages in the conversation history.

[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
input_format = "messages"
# ...

`output_type`

Type: string
Required: yes

Defines the expected data type of the evaluation result from the LLM judge.

float: The judge is expected to return a floating-point number.
boolean: The judge is expected to return a boolean value.

[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
output_type = "float"
# ...

`include.reference_output`

Type: boolean
Required: no (default: false)

If set to true, the reference output associated with the evaluation datapoint will be included in the input provided to the LLM judge. In these cases, the evaluation run will not run this evaluator for datapoints where there is no reference output.

[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
include = { reference_output = true }
# ...

`optimize`

Type: string
Required: yes

Defines whether the metric produced by the LLM judge should be maximized or minimized.

max: Higher values are better.
min: Lower values are better.

[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max"
# ...

`cutoff`

Type: float
Required: no

Sets a user defined threshold at which the test is passing. This may be useful for applications where the evaluations are run as an automated test. If the average value of this evaluator is below the cutoff (when optimize is max) or above the cutoff (when optimize is min), the evaluations binary will return a nonzero status code.

[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max" # Example: Maximize score
cutoff = 0.8 # Example: Consider passing if average score is >= 0.8
# ...

`[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]`

An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function. Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our inference-time optimizations.

You can include a standard variant configuration in this block, with two modifications:

Instead of assigning weight to each variant, you simply mark a single variant as active.
For chat_completion variants, instead of a system_template we require system_instructions as a text file and take no other templates.

Here we list only the configuration for variants that differs from the configuration for a normal TensorZero function. Please refer the variant configuration reference for the remaining options.

[evaluations.email-guardrails.evaluators.check-signature]
# ...
type = "llm_judge"
optimize = "max"

[evaluations.email-guardrails.evaluators.check-signature.variants."claude3.5sonnet"]
type = "chat_completion"
model = "anthropic::claude-3-5-sonnet-20241022"
temperature = 0.1
system_instructions = "./evaluations/email-guardrails/check-signature/system_instructions.txt"
# ... other chat completion configuration ...

[evaluations.email-guardrails.evaluators.check-signature.variants."mix3claude3.5sonnet"]
active = true  # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluator
type = "experimental_mixture_of_n"
candidates = ["claude3.5sonnet", "claude3.5sonnet", "claude3.5sonnet"]

`active`

Type: boolean
Required: Defaults to true if there is a single variant configured. Otherwise, this field is required to be set to true for exactly one variant.

Sets which of the variants should be used for evaluation runs.

[evaluations.email-guardrails.evaluators.check-signature]
# ...

[evaluations.email-guardrails.evaluators.check-signature.variants."mix3claude3.5sonnet"]
active = true # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluator
type = "experimental_mixture_of_n"

`system_instructions`

Type: string (path)
Required: yes

Defines the path to the system instructions file. This path is relative to the configuration file.

This file should contain a text file with the system instructions for the LLM judge. These instructions should instruct the judge to output a float or boolean value. We use JSON mode to enforce that the judge returns a JSON object of the form {"thinking": "<thinking>", "score": <float or boolean>} configured to the output_type of the evaluator.

Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.

[evaluations.email-guardrails.evaluators.check-signature]
# ...
system_instructions = "./evaluations/email-guardrails/check-signature/claude_35_sonnet/system_instructions.txt"
# ...

Configuration Reference

`[evaluations.evaluation_name]`

`type`

`function_name`

`[evaluations.evaluation_name.evaluators.evaluator_name]`

`type`

`type: "exact_match"`

`cutoff`

`type: "llm_judge"`

`input_format`

`output_type`

`include.reference_output`

`optimize`

`cutoff`

`[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]`

`active`

`system_instructions`