Configuration Reference
[evaluations.evaluation_name]
The evaluations
sub-section of the config file defines the behavior of an evaluation in TensorZero.
You can define multiple evaluations by including multiple [evaluations.evaluation_name]
sections.
If your evaluation_name
is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define an evaluation named foo.bar
as [evaluations."foo.bar"]
.
[evaluations.email-guardrails]# ...
type
- Type: Literal
"static"
(we may add other options here later on) - Required: yes
function_name
- Type: string
- Required: yes
This should be the name of a function defined in the [functions]
section of the gateway config.
This value sets which function this evaluation should evaluate when run.
[evaluations.evaluation_name.evaluators.evaluator_name]
The evaluators
sub-section defines the behavior of a particular evaluator that will be run as part of its parent evaluation.
You can define multiple evaluators by including multiple [evaluations.evaluation_name.evaluators.evaluator_name]
sections.
If your evaluator_name
is not a basic string, it can be escaped with quotation marks.
For example, periods are not allowed in basic strings, so you can define includes.jpg
as [evaluations.evaluation_name.evaluators."includes.jpg"]
.
[evaluations.email-guardrails]# ...
[evaluations.email-guardrails.evaluators."includes.jpg"]# ...
[evaluations.email-guardrails.evaluators.check-signature]# ...
type
- Type: string
- Required: yes
Defines the type of the evaluator.
TensorZero currently supports the following variant types:
Type | Description |
---|---|
llm_judge | Use a TensorZero function as a judge |
exact_match | Evaluates whether the generated output exactly matches the reference output (skips the datapoint if unavailable). |
[evaluations.email-guardrails.evaluators.check-signature]# ...type = "llm_judge"# ...
type: "exact_match"
type: "exact_match"
type: "exact_match"
cutoff
- Type: float
- Required: no
Sets a user defined threshold at which the test is passing. This can be useful for applications where the evaluations are run as an automated test. If the average value of this evaluator is below the cutoff, the evaluations binary will return a nonzero status code.
type: "llm_judge"
type: "llm_judge"
type: "llm_judge"
input_format
- Type: string
- Required: no (default:
serialized
)
Defines the format of the input provided to the LLM judge.
serialized
: Passes the input messages, generated output, and reference output (if included) as a single serialized string.messages
: Passes the input messages, generated output, and reference output (if included) as distinct messages in the conversation history.
[evaluations.email-guardrails.evaluators.check-signature]# ...type = "llm_judge"input_format = "messages"# ...
output_type
- Type: string
- Required: yes
Defines the expected data type of the evaluation result from the LLM judge.
float
: The judge is expected to return a floating-point number.boolean
: The judge is expected to return a boolean value.
[evaluations.email-guardrails.evaluators.check-signature]# ...type = "llm_judge"output_type = "float"# ...
include.reference_output
- Type: boolean
- Required: no (default:
false
)
If set to true
, the reference output associated with the evaluation datapoint will be included in the input provided to the LLM judge.
In these cases, the evaluation run will not run this evaluator for datapoints where there is no reference output.
[evaluations.email-guardrails.evaluators.check-signature]# ...type = "llm_judge"include = { reference_output = true }# ...
optimize
- Type: string
- Required: yes
Defines whether the metric produced by the LLM judge should be maximized or minimized.
max
: Higher values are better.min
: Lower values are better.
[evaluations.email-guardrails.evaluators.check-signature]# ...type = "llm_judge"optimize = "max"# ...
cutoff
- Type: float
- Required: no
Sets a user defined threshold at which the test is passing.
This may be useful for applications where the evaluations are run as an automated test.
If the average value of this evaluator is below the cutoff (when optimize
is max
) or above the cutoff (when optimize
is min
), the evaluations binary will return a nonzero status code.
[evaluations.email-guardrails.evaluators.check-signature]# ...type = "llm_judge"optimize = "max" # Example: Maximize scorecutoff = 0.8 # Example: Consider passing if average score is >= 0.8# ...
[evaluations.evaluation_name.evaluators.evaluator_name.variants.variant_name]
An LLM Judge evaluator defines a TensorZero function that is used to judge the output of another TensorZero function. Therefore, all the variant types that are available for a normal TensorZero function are also available for LLMs as judges — including all of our inference-time optimizations.
You can include a standard variant configuration in this block, with two modifications:
- Instead of assigning
weight
to each variant, you simply mark a single variant asactive
. - For
chat_completion
variants, instead of asystem_template
we requiresystem_instructions
as a text file and take no other templates.
Here we list only the configuration for variants that differs from the configuration for a normal TensorZero function. Please refer the variant configuration reference for the remaining options.
[evaluations.email-guardrails.evaluators.check-signature]# ...type = "llm_judge"optimize = "max"
[evaluations.email-guardrails.evaluators.check-signature.variants."claude3.5sonnet"]type = "chat_completion"model = "anthropic::claude-3-5-sonnet-20241022"temperature = 0.1system_instructions = "./evaluations/email-guardrails/check-signature/system_instructions.txt"# ... other chat completion configuration ...
[evaluations.email-guardrails.evaluators.check-signature.variants."mix3claude3.5sonnet"]active = true # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluatortype = "experimental_mixture_of_n"candidates = ["claude3.5sonnet", "claude3.5sonnet", "claude3.5sonnet"]
active
- Type: boolean
- Required: Defaults to
true
if there is a single variant configured. Otherwise, this field is required to be set totrue
for exactly one variant.
Sets which of the variants should be used for evaluation runs.
[evaluations.email-guardrails.evaluators.check-signature]# ...
[evaluations.email-guardrails.evaluators.check-signature.variants."mix3claude3.5sonnet"]active = true # if we run the `email-guardrails` evaluation, this is the variant we'll use for the check-signature evaluatortype = "experimental_mixture_of_n"
system_instructions
- Type: string (path)
- Required: yes
Defines the path to the system instructions file. This path is relative to the configuration file.
This file should contain a text file with the system instructions for the LLM judge.
These instructions should instruct the judge to output a float or boolean value.
We use JSON mode to enforce that the judge returns a JSON object of the form {"thinking": "<thinking>", "score": <float or boolean>}
configured to the output_type
of the evaluator.
Evaluate if the text follows the haiku structure of exactly three lines with a 5-7-5 syllable pattern, totaling 17 syllables. Verify only this specific syllable structure of a haiku without making content assumptions.
[evaluations.email-guardrails.evaluators.check-signature]# ...system_instructions = "./evaluations/email-guardrails/check-signature/claude_35_sonnet/system_instructions.txt"# ...