Skip to content

Experimentation (A/B Testing)

The TensorZero Gateway provides built-in support for experimentation (A/B testing) through variants. Each function can have multiple variants, and the gateway will sample between them based on their weights.

Variants enable you to experiment with different models (e.g. GPT-4o vs Claude), prompts (e.g. different templates), parameters (e.g. different temperatures), inference strategies (e.g. dynamic in-context learning), and more.

During an episode, multiple calls to the same function will receive the same variant (unless fallbacks are necessary). This ensures consistency in multi-step LLM workflows. Formally, this consistent variant assignment acts as a randomized controlled experiment, providing the statistical foundation needed to make causal inferences about which configurations perform best.

You can use the feedback collected about an inference or episode to compare how different variants perform in a principled way. You can visualize the performance of different variants over time using the TensorZero UI.

Examples

Simple A/B Testing

Let’s say you want to experiment with different models for a function that drafts an email.

Let’s create a function with two variants: GPT-4o Mini and Claude 3.5 Haiku. In a production setting, you’d likely want to set up prompt templates, inference parameters, and more.

[functions.draft_email]
type = "chat"
[functions.draft_email.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
[functions.draft_email.variants.claude_3_5_haiku]
type = "chat_completion"
model = "anthropic::claude-3.5-haiku"

With the configuration above, the gateway will sample between the the variants with equal probability.

Variant Weights

You can also set variant weights to control the probability of each variant being chosen.

[functions.draft_email]
type = "chat"
[functions.draft_email.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini"
weight = 0.9
[functions.draft_email.variants.claude_3_5_haiku]
type = "chat_completion"
model = "anthropic::claude-3.5-haiku"
weight = 0.1

With the configuration above, the gateway will sample gpt_4o_mini 90% of the time and claude_3_5_haiku 10% of the time.

If you don’t specify the weight for a given variant, it will default to zero. If you mix variants with zero and non-zero weights, the gateway will sample the non-zero weighted variants first, and only use zero-weighted variants as fallbacks. See Retries & Fallbacks for more details.