Skip to content

Retries & Fallbacks

The TensorZero Gateway offers multiple strategies to handle errors and improve reliability.

These strategies are defined at three levels: models (model provider routing), variants (variant retries), and functions (variant fallbacks). You can combine these strategies to define complex fallback behavior.

Model Provider Routing

We can specify that a model is available on multiple providers using its routing field. If we include multiple providers on the list, the gateway will try each one sequentially until one succeeds or all fail.

In the example below, the gateway will first try OpenAI, and if that fails, it will try Azure.

[models.gpt_4o_mini]
# Try the following providers in order:
# 1. `models.gpt_4o_mini.providers.openai`
# 2. `models.gpt_4o_mini.providers.azure`
routing = ["openai", "azure"]
[models.gpt_4o_mini.providers.openai]
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
[models.gpt_4o_mini.providers.azure]
type = "azure"
deployment_id = "gpt4o-mini-20240718"
endpoint = "https://your-azure-openai-endpoint.openai.azure.com"
[functions.extract_data]
type = "chat"
[functions.extract_data.variants.gpt_4o_mini]
type = "chat_completion"
model = "gpt_4o_mini"

Variant Retries

We can add a retries field to a variant to specify the number of times to retry that variant if it fails. The retry strategy is a truncated exponential backoff with jitter.

In the example below, the gateway will retry the variant four times (i.e. a total of five attempts), with a maximum delay of 10 seconds between retries.

[functions.extract_data]
type = "chat"
[functions.extract_data.variants.claude_3_5_haiku]
type = "chat_completion"
model = "anthropic::claude-3-5-haiku-20241022"
# Retry the variant up to four times, with a maximum delay of 10 seconds between retries.
retries = { num_retries = 4, max_delay_s = 10 }

Variant Fallbacks

If we specify multiple variants for a function, the gateway will try different variants until one succeeds or all fail. After each attempt, the gateway will re-sample an unused variant with a probability proportional to its weight (i.e. sampling without replacement).

In the example below, the gateway will first sample and attempt the variants with non-zero weights (GPT-4o Mini or Claude 3.5 Haiku). If all of those variants fail, the gateway will sample and attempt the variants with zero weights (Gemini 1.5 Flash 8B or Grok 2).

[functions.extract_data]
type = "chat"
[functions.extract_data.variants.gpt_4o_mini]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"
weight = 0.7
[functions.extract_data.variants.claude_3_5_haiku]
type = "chat_completion"
model = "anthropic::claude-3-5-haiku-20241022"
weight = 0.3
[functions.extract_data.variants.gemini_1_5_flash_8b]
type = "chat_completion"
model = "google_ai_studio_gemini::gemini-1.5-flash-8b"
weight = 0
[functions.extract_data.variants.grok_2]
type = "chat_completion"
model = "xai::grok-2-1212"
weight = 0

Combining Strategies

We can combine strategies to define complex fallback behavior.

The gateway will try the following strategies in order:

  1. Model Provider Routing
  2. Variant Retries
  3. Variant Fallbacks

In other words, the gateway will follow a strategy like the pseudocode below.

while variants:
# First sample variants with non-zero weight, then variants with zero weight
variant = sample_variant(variants) # sampling without replacement
for _ in range(num_retries):
for provider in variant.routing:
try:
return inference(variant, provider)
except:
continue

Load Balancing

TensorZero doesn’t currently offer an explicit strategy for load balancing API keys, but you can achieve a similar effect by defining multiple variants with appropriate weights. We plan to add a streamlined load balancing strategy in the future.

In the example below, the gateway will split the traffic between two variants (gpt_4o_mini_api_key_A and gpt_4o_mini_api_key_B). Each variant leverages a model with providers that use different API keys (OPENAI_API_KEY_A and OPENAI_API_KEY_B). See Configuration Reference for more details on credential management.

[models.gpt_4o_mini_api_key_A]
routing = ["openai"]
[models.gpt_4o_mini_api_key_A.providers.openai]
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
api_key_location = "env:OPENAI_API_KEY_A"
[models.gpt_4o_mini_api_key_B]
routing = ["openai"]
[models.gpt_4o_mini_api_key_B.providers.openai]
type = "openai"
model_name = "gpt-4o-mini-2024-07-18"
api_key_location = "env:OPENAI_API_KEY_B"
[functions.extract_data]
type = "chat"
[functions.extract_data.variants.gpt_4o_mini_api_key_A]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"
weight = 0.5
[functions.extract_data.variants.gpt_4o_mini_api_key_B]
type = "chat_completion"
model = "openai::gpt-4o-mini-2024-07-18"
weight = 0.5

Technical Notes

  • For variant types that require multiple model inferences (e.g. best-of-N sampling), the routing fallback applies to each individual model inference separately.