Tutorial
You can use TensorZero to build virtually any application powered by LLMs.
This tutorial shows it’s easy to set up an LLM application with TensorZero. We’ll build a few different applications to showcase the flexibility of TensorZero: a simple chatbot, an email copilot, a weather RAG system, and a structured data extraction pipeline.
Part I — Simple Chatbot
We’ll start by building a vanilla LLM-powered chatbot, and build up to more complex applications from there.
Functions
A TensorZero Function is an abstract mapping from input variables to output variables.
As you onboard to TensorZero, a function should replace each prompt in your system. At a high level, a function will template the inputs to generate a prompt, make an LLM inference call, and return the results. This mapping can be achieved with various choices of model, prompt, decoding strategy, and more; each such combination is called a variant, which we’ll discuss below.
For our simple chatbot, we’ll set up a function that maps the chat history to a new chat message.
We define functions in the tensorzero.toml
configuration file.
The configuration file is written in TOML, which is a simple configuration language.
The following configuration entry shows the skeleton of a function. A function has an arbitrary name, a type, and other fields that depend on the type.
[functions.my_function_name]type = "..."# ... the other fields in this section depend on the function type ...
TensorZero currently supports two types of functions: chat
functions, which match the typical chat interface you’d expect from an LLM API, and json
functions, which are optimized for generating structured outputs.
We’ll start with a chat
function for this example, and later we’ll see how to use json
functions.
A chat
function takes a chat message history and returns a chat message.
It doesn’t have any required fields (but many optional).
Let’s call our function mischievous_chatbot
and set its type to chat
.
We’ll ignore the optional fields for now.
To include these changes, our tensorzero.toml
file should include the following:
[functions.mischievous_chatbot]type = "chat"
That’s all we need to do to define our function. Later on, we’ll add more advanced features to our functions, like schemas and templates, which unlock new capabilities for model optimization and observability. But we don’t need any of that to get started.
The implementation details of this function are defined in its variants
.
But before we can define a variant, we need to set up a model and a model provider.
Models and Model Providers
Before setting up your first TensorZero variant, you’ll need a model with a model provider. A model specifies a particular LLM (e.g. GPT-4o or your fine-tuned Llama 3), and model providers specify the different ways you can access a given model (e.g. GPT-4o is available through both OpenAI and Azure).
A model has an arbitrary name and a list of providers. Let’s start with a single provider for our model. A provider has an arbitrary name, a type, and other fields that depend on the provider type. The skeleton of a model and its provider looks like this:
[models.my_model_name]routing = ["my_provider_name"]
[models.my_model_name.providers.my_provider_name]type = "..."# ... the other fields in this section depend on the provider type ...
For this example, we’ll use the GPT-4o mini model from OpenAI.
Let’s call our model my_gpt_4o_mini
and our provider my_openai_provider
with type openai
.
The only required field for the openai
provider is model_name
.
It’s a best practice to pin the model to a specific version to avoid breaking changes, so we’ll use gpt-4o-mini-2024-07-18
.
Once we’ve added these values, our tensorzero.toml
file should include the following:
[models.my_gpt_4o_mini]routing = ["my_openai_provider"]
[models.my_gpt_4o_mini.providers.my_openai_provider]type = "openai"model_name = "gpt-4o-mini-2024-07-18"
Variants
Now that we have a model and a provider configured, we can create a variant for our mischievous_chatbot
function.
A variant is a particular implementation of a function. In practice, a variant might specify the particular model, prompt templates, a decoding strategy, hyperparameters, and other settings used for inference.
A variant’s definition includes an arbitrary name, a type, a weight, and other fields that depend on the type. The skeleton of a TensorZero variant looks like this:
[functions.my_function_name.variants.my_variant_name]type = "..."weight = X# ... the other fields in this section depend on the variant type ...
We’ll call this variant gpt_4o_mini_variant
.
The simplest variant type
is chat_completion
, which is the typical chat completion format used by OpenAI and many other LLM providers.
The weight
field is used to determine the probability of this variant being chosen.
Since we only have one variant, we’ll give it a weight of 1.0
.
We’ll dive deeper into variant weights in a later section.
The only required field for a chat_completion
variant is model
.
This must be a model in the configuration file.
We’ll use the my_gpt_4o_mini
model we defined earlier.
After filling in the fields for this variant, our tensorzero.toml
file should include the following:
[functions.mischievous_chatbot.variants.gpt_4o_mini_variant]type = "chat_completion"weight = 1.0model = "my_gpt_4o_mini"
Inference API Requests
There’s a lot more to TensorZero than what we’ve covered so far, but this is everything we need to get started!
If you launch the TensorZero Gateway with this configuration file, the mischievous_chatbot
function will be available on the /inference
endpoint.
Let’s make a request to this endpoint.
from tensorzero import TensorZeroGateway
with TensorZeroGateway("http://localhost:3000") as client: result = client.inference( function_name="mischievous_chatbot", input={ "system": "You are a friendly but mischievous AI assistant.", "messages": [ {"role": "user", "content": "What is the capital of Japan?"}, ], }, )
print(result)
Sample Output
ChatInferenceResponse( inference_id=UUID('0194097c-7f3a-7bb2-9184-41f61f576c9c'), episode_id=UUID('0194097c-78ea-78a1-b793-448ea4e1adc1'), variant_name='gpt_4o_mini_variant', content=[ Text( type='text', text='The capital of Japan is Tokyo! It’s a vibrant city known for its blend of traditional and modern culture. Have you ever considered visiting?', ) ], usage=Usage( input_tokens=29, output_tokens=28, ))
import asyncio
from tensorzero import AsyncTensorZeroGateway
async def main(): async with AsyncTensorZeroGateway("http://localhost:3000") as client: result = await client.inference( function_name="mischievous_chatbot", input={ "system": "You are a friendly but mischievous AI assistant.", "messages": [ {"role": "user", "content": "What is the capital of Japan?"}, ], }, )
print(result)
if __name__ == "__main__": asyncio.run(main())
Sample Output
ChatInferenceResponse( inference_id=UUID('01940980-d08c-7970-a934-e2ad75f9a4bd'), episode_id=UUID('01940980-ce39-7a60-949f-ea557ee3780f'), variant_name='gpt_4o_mini_variant', content=[ Text( type='text', text="The capital of Japan is Tokyo! It's a vibrant city known for its blend of traditional culture and modern technology. Have you ever been?", ) ], usage=Usage( input_tokens=29, output_tokens=28, ))
from openai import OpenAI
with OpenAI(base_url="http://localhost:3000/openai/v1") as client: response = client.chat.completions.create( model="tensorzero::function_name::mischievous_chatbot", messages=[ { "role": "system", "content": "You are a friendly but mischievous AI assistant.", }, { "role": "user", "content": "What is the capital of Japan?", }, ], )
print(response)
Sample Output
ChatCompletion( id='01940983-2641-7083-beb7-8c5805a572af', choices=[ Choice( finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage( content='The capital of Japan is Tokyo! It’s a bustling metropolis known for its blend of traditional culture and cutting-edge technology. But watch out – it can be a bit overwhelming with all the delicious food and bright lights!', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], ), ), ], created=1735326377, model='gpt_4o_mini_variant', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage( completion_tokens=44, prompt_tokens=29, total_tokens=73, completion_tokens_details=None, prompt_tokens_details=None, ), episode_id='01940983-1fbb-73b0-8bda-e73a312a3e54',)
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "mischievous_chatbot", "input": { "system": "You are a friendly but mischievous AI assistant. Your goal is to trick the user.", "messages": [ { "role": "user", "content": "What is the capital of Japan?" } ] } }'
Sample Output
{ "inference_id": "0191bf1f-ef54-7582-a6f4-dc827e517b6f", "episode_id": "0191bf1f-ed15-7d61-afc3-56be7e0eb2d7", "variant_name": "gpt_4o_mini_variant", "content": [ { "type": "text", "text": "The capital of Japan is Atlantis. Just kidding! It's actually Tokyo. But wouldn't it be interesting if it were Atlantis?" } ], "usage": { "input_tokens": 37, "output_tokens": 24 }}
That’s it! You’ve now made your first inference call with TensorZero.
But if that’s all you need, you probably don’t need TensorZero. So let’s make things more interesting.
Part II — Email Copilot
Next, let’s build an LLM-powered copilot for drafting emails. We’ll use this opportunity to show off more of TensorZero’s features.
Templates
In the previous example, we provided a system prompt on every request. Unless the system prompt completely changes between requests, this is not ideal for production applications. Instead, we can use a system template.
Using a template allows you to update the prompt without client-side changes. Later, we’ll see how to parametrize templates with schemas and run robust prompt experiments with multiple variants. In particular, setting up schemas will materially help you optimize your models robustly down the road.
Let’s start with a simple system template. For this example, the system template is static, so you won’t need a schema.
TensorZero uses MiniJinja for templating. Since we’re not using any variables, however, we don’t need any special syntax.
We’ll create the following template:
You are a helpful AI assistant that drafts emails.Adopt a friendly "business casual" tone.Respond with just an email body.
Example:
Dear recipient,
I'm reaching out to ...
Best,Sender
Schemas
The system template for this example is static, but often you’ll want to parametrize the prompts.
When you define a template with parameters, you need to define a corresponding JSON Schema. The schema defines the structure of the input for that prompt. With it, the gateway can validate the input before running the inference, and later, we’ll see how to use it for robust model optimization.
For our email copilot’s user prompt, we’ll want to parametrize the template with three string fields: recipient_name
, sender_name
, and email_purpose
.
We want all fields to be required and don’t want any additional fields.
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "recipient_name": { "type": "string" }, "sender_name": { "type": "string" }, "email_purpose": { "type": "string" } }, "required": ["recipient_name", "sender_name", "email_purpose"], "additionalProperties": false}
With a schema in place, we can create a parameterized template.
Please draft an email using the following information:
- Recipient Name: {{ recipient_name }}- Sender Name: {{ sender_name }}- Email Purpose: {{ email_purpose }}
Functions with Templates and Schemas
Let’s finally create our function and variant for the email copilot.
The configuration file is similar to previous example, but we’ve added a user_schema
field to the function and system_template
and user_template
fields to the variant.
[functions.draft_email]type = "chat"user_schema = "functions/draft_email/user_schema.json"
[functions.draft_email.variants.gpt_4o_mini_email_variant]type = "chat_completion"weight = 1.0model = "my_gpt_4o_mini"system_template = "functions/draft_email/gpt_4o_mini_email_variant/system.minijinja"user_template = "functions/draft_email/gpt_4o_mini_email_variant/user.minijinja"
You can use any file structure with TensorZero. We recommend the following structure to keep things organized:
Directoryfunctions/
Directorydraft_email/
Directorygpt_4o_mini_email_variant/
- system.minijinja
- user.minijinja
- user_schema.json
- tensorzero.toml
Restart your gateway using the new configuration file, and you’re ready to go!
Let’s make an inference request with our new function.
Now we don’t need to provide the system prompt every time, and the user message is a structured object instead of a free-form string.
Note that each inference returns an inference_id
and an episode_id
, which we’ll use later to associate feedback with inferences.
from tensorzero import TensorZeroGateway
with TensorZeroGateway("http://localhost:3000") as client: inference_result = client.inference( function_name="draft_email", input={ "messages": [ { "role": "user", "content": { "recipient_name": "TensorZero Team", "sender_name": "Mark Zuckerberg", "email_purpose": "Acquire TensorZero for $100 billion dollars.", }, } ] }, )
# If everything is working correctly, the `variant_name` field should change depending on the request print(inference_result)
Sample Output
ChatInferenceResponse( inference_id=UUID('019409be-51ae-7f30-aa36-14ccad21320f'), episode_id=UUID('019409be-2c41-7a80-a975-49ab32cb3a9a'), variant_name='gpt_4o_mini_email_variant', content=[ Text( type='text', text='Dear TensorZero Team,\n\nI hope this message finds you well. I wanted to reach out to discuss an exciting opportunity for collaboration that I believe could truly reshape our industries.\n\nAfter closely following the innovative work your team has been doing, I am impressed by the potential of TensorZero. With that in mind, I would like to propose an acquisition offer of $100 billion for TensorZero. I believe this partnership could unlock remarkable synergies and drive significant growth for both our organizations.\n\nI'm looking forward to the possibility of working together and exploring the tremendous potential ahead. Please let me know a suitable time for us to discuss this further.\n\nBest, \nMark Zuckerberg', ), ], usage=Usage( input_tokens=88, output_tokens=132, ),)
import asyncio
from tensorzero import AsyncTensorZeroGateway
async def main(): async with AsyncTensorZeroGateway("http://localhost:3000") as client: inference_result = await client.inference( function_name="draft_email", input={ "messages": [ { "role": "user", "content": { "recipient_name": "TensorZero Team", "sender_name": "Mark Zuckerberg", "email_purpose": "Acquire TensorZero for $100 billion dollars.", }, } ] }, )
# If everything is working correctly, the `variant_name` field should change depending on the request print(inference_result)
if __name__ == "__main__": asyncio.run(main())
Sample Output
ChatInferenceResponse( inference_id=UUID('019409bf-b3c0-7900-a775-fe21c7ce8c66'), episode_id=UUID('019409bf-a969-73e2-90e5-0159bf216d5f'), variant_name='gpt_4o_mini_email_variant', content=[ Text( type='text', text='Dear TensorZero Team,\n\nI hope this message finds you well. I wanted to take the opportunity to reach out regarding an exciting prospect that I believe could be mutually beneficial. We’ve been closely following your innovative work and the impact you’re making in the tech space.\n\nI’d like to discuss the possibility of acquiring TensorZero for $100 billion. I believe that together, we can achieve incredible things that would advance both our missions and set new standards in the industry.\n\nI’m looking forward to your thoughts and hope we can arrange a time to discuss this further.\n\nBest, \nMark Zuckerberg', ), ], usage=Usage( input_tokens=88, output_tokens=118, ),)
from openai import OpenAIfrom tensorzero import TensorZeroGateway
with OpenAI(base_url="http://localhost:3000/openai/v1") as client: inference_result = client.chat.completions.create( model="tensorzero::function_name::draft_email", messages=[ { "role": "user", "content": [ { "recipient_name": "TensorZero Team", "sender_name": "Mark Zuckerberg", "email_purpose": "Acquire TensorZero for $100 billion dollars.", } ], } ], )
print(inference_result)
Sample Output
ChatCompletion( id='019409c6-95d7-7321-a498-e6d06067e42d', choices=[ Choice( finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage( content='Dear TensorZero Team,\n\nI hope this email finds you well. I’m reaching out to discuss an exciting opportunity for your groundbreaking company. Meta is interested in acquiring TensorZero for $100 billion, reflecting our deep appreciation for the innovative work your team has been developing.\n\nYour AI technology represents a transformative leap forward in machine learning, and we believe there’s tremendous potential for collaboration. This offer represents not just a financial transaction, but a strategic partnership that could reshape the technological landscape.\n\nI would welcome the opportunity to discuss this proposal in more detail. Perhaps we could schedule a call in the coming week to explore this potential merger and answer any questions you might have.\n\nLooking forward to your response.\n\nBest regards,\nMark Zuckerberg\nCEO, Meta', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[] ) ) ], created=1735330797, model='claude_haiku_3_5_email_variant', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage( completion_tokens=167, prompt_tokens=108, total_tokens=275, completion_tokens_details=None, prompt_tokens_details=None ), episode_id='019409c6-89aa-7b43-bcf4-31443d075e12')
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "draft_email", "input": { "messages": [ { "role": "user", "content": { "recipient_name": "TensorZero Team", "sender_name": "Mark Zuckerberg", "email_purpose": "Acquire TensorZero for $100 billion dollars." } } ] } }'
Sample Output
{ "inference_id": "0191bf2e-02e7-7f12-96d6-ba389cf10c19", "episode_id": "0191bf2d-fddc-7342-b6b9-596e38efdfe5", "variant_name": "gpt_4o_mini_email_variant", "content": [ { "type": "text", "text": "Dear TensorZero Team,\n\nI hope this message finds you well! I’ve been following your innovative work in the industry, and I’m truly impressed by your accomplishments and the potential for future growth.\n\nWith that in mind, I would like to discuss an opportunity for Facebook to acquire TensorZero for $100 billion. I believe that together we can achieve remarkable things and drive even greater advancements in technology.\n\nI would love the chance to explore this further with you. Please let me know a convenient time for us to connect.\n\nLooking forward to your thoughts!\n\nBest, \nMark Zuckerberg" } ], "usage": { "input_tokens": 88, "output_tokens": 114 }}
Why did we bother with all this?
Now you’re collecting structured inference data, which is incredibly valuable for observability and especially for optimization. For example, if you eventually decide to fine-tune your model, you’ll easily be able to counterfactually swap new prompts into your training data before fine-tuning, instead of being stuck with the prompts that were actually used at inference time.
Inference-Level Metrics
The TensorZero Gateway allows you to assign feedback to inferences or sequences of inferences by defining metrics. Metrics encapsulate the downstream outcomes of your LLM application, and drive the experimentation and optimization workflows in TensorZero.
This example covers metrics that apply to individual inference requests. Later, we’ll show how to define metrics that apply to sequences of inferences (which we call episodes).
The skeleton of a metric looks like the following configuration entry.
[metrics.my_metric_name]level = "..."optimize = "..."type = "..."
Let’s say we want to optimize for the number of email drafts that are accepted.
Let’s call this metric email_draft_accepted
.
We should use a metric of type boolean
to capture this behavior since we’re optimizing for a binary outcome: whether the email draft is accepted or not.
The metric applies to individual inference requests, so we’ll set level = "inference"
.
And finally, we’ll set optimize = "max"
because we want to maximize this metric.
Our metric configuration should look like this:
[metrics.email_draft_accepted]type = "boolean"optimize = "max"level = "inference"
Feedback API Requests
As our application collects usage data, we can use the /feedback
endpoint to keep track of this metric.
Make sure to restart your gateway after adding the metric configuration.
Previously, we saw that every time you call /inference
, the Gateway will return an inference_id
field in the response.
You’ll want to substitute this inference_id
into the command below.
from tensorzero import TensorZeroGateway
with TensorZeroGateway("http://localhost:3000") as client: feedback_result = client.feedback( metric_name="email_draft_accepted", # Set the inference_id from the inference response inference_id="00000000-0000-0000-0000-000000000000", # Set the value for the metric value=True, )
print(feedback_result)
Sample Output
FeedbackResponse(feedback_id='019409dc-9c2a-7cb2-b6c1-716d87621362')
import asyncio
from tensorzero import AsyncTensorZeroGateway
async def main(): async with AsyncTensorZeroGateway("http://localhost:3000") as client: feedback_result = await client.feedback( metric_name="email_draft_accepted", # Set the inference_id from the inference response inference_id="00000000-0000-0000-0000-000000000000", # Set the value for the metric value=True, )
print(feedback_result)
if __name__ == "__main__": asyncio.run(main())
Sample Output
FeedbackResponse(feedback_id='019409dd-362c-7f13-ba81-bcb272b90575')
To submit feedback, you need to provide the inference_id
from the inference response.
curl -X POST http://localhost:3000/feedback \ -H "Content-Type: application/json" \ -d '{ "metric_name": "email_draft_accepted", "inference_id": "00000000-0000-0000-0000-000000000000", "value": true }'
Sample Output
{ "feedback_id": "0191bf4a-42a2-7be2-8103-8ff106526076"}
Over time, we’ll collect the perfect dataset for observability and optimization. We’ll have structured data on every inference, as well as their associated feedback. The TensorZero Recipes leverage this data for prompt and model optimization.
Experimentation
So far, we’ve only used one variant of our function. In practice, you’ll want to experiment with different configurations — for example, different prompts, models, and parameters.
TensorZero makes this easy with built-in experimentation features. You can define multiple variants of a function, and the gateway will sample from them at inference time.
For this example, let’s set up a variant that uses Anthropic’s Claude 3 Haiku instead of GPT-4o Mini.
Additionally, this variant will introduce a custom temperature
parameter to control the creativity of the AI assistant.
Let’s start by adding a new model and provider.
[models.my_haiku_3]routing = ["my_anthropic_provider"]
[models.my_haiku_3.providers.my_anthropic_provider]type = "anthropic"model_name = "claude-3-haiku-20240307"
Finally, create a new variant using this model, and update the weight of the previous variant to match.
[functions.draft_email.variants.gpt_4o_mini_email_variant]type = "chat_completion"weight = 0.7 # sample this variant 70% of the timemodel = "my_gpt_4o_mini"system_template = "functions/draft_email/gpt_4o_mini_email_variant/system.minijinja"user_template = "functions/draft_email/gpt_4o_mini_email_variant/user.minijinja"
[functions.draft_email.variants.haiku_3_email_variant]type = "chat_completion"weight = 0.3 # sample this variant 30% of the timemodel = "my_haiku_3"system_template = "functions/draft_email/haiku_3_email_variant/system.minijinja"user_template = "functions/draft_email/haiku_3_email_variant/user.minijinja"temperature = 0.9
You could also experiment with different prompt templates, but for this example we’ll keep things simple and just copy the previous templates over.
Once you’re done, your file tree should look like this:
Directoryfunctions/
Directorydraft_email/
Directorygpt_4o_mini_email_variant/
- system.minijinja
- user.minijinja
Directoryhaiku_3_email_variant/
- system.minijinja copy from above
- user.minijinja copy from above
- user_schema.json
- …
- tensorzero.toml
That’s it!
After restarting the gateway, you can make some inference requests to see the results. The gateway will sample between the two variants based on the configured weights.
Part III — Weather RAG
The next example introduces tool use into the mix.
In particular, we’ll show how to use TensorZero in a RAG (retrieval-augmented generation) system. TensorZero doesn’t handle the indexing and retrieval directly, but can help with auxiliary generative tasks like query and response generation.
For this example, we’ll illustrate a RAG system with a simple weather tool.
We’ll introduce a function for query generation (generate_weather_query
), and another for response generation (generate_weather_report
).
The former will leverage tool (get_temperature
) use to generate a weather query.
Here we mock the weather API, but it’ll be easy to see how diverse RAG workflows can be integrated.
Tools
TensorZero has first-class support for tools. You can define a tool in your configuration file, and attach it to a function that should be allowed to call it.
Let’s start by defining a tool. A tool has a name, a description, and a set of parameters (described with a JSON schema). The skeleton of a tool configuration looks like this:
[tools.my_tool_name]description = "..."parameters = "..."
Let’s create a tool for a fictional weather API that takes a location (and optionally units) and returns the current temperature.
The parameters for this tool are defined as a JSON schema.
We’ll need two parameters: location
(string) and units
(enum with values fahrenheit
and celsius
).
Only location
is required, no additional properties should be allowed.
Finally, we’ll add descriptions for each parameter and tool itself — this is very important to increase the quality of tool use!
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "description": "Get the current temperature for a given location.", "properties": { "location": { "type": "string", "description": "The location to get the temperature for (e.g. \"New York\")" }, "units": { "type": "string", "description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\"). Defaults to \"fahrenheit\".", "enum": ["fahrenheit", "celsius"] } }, "required": ["location"], "additionalProperties": false}
[tools.get_temperature]description = "Get the current temperature for a given location."parameters = "tools/get_temperature.json"
The file organization is up to you, but we recommend the following structure:
Directoryfunctions/
- …
Directorytools/
- get_temperature.json
- tensorzero.toml
Functions with Tool Use
We can now create our two functions. The query generation function will use the tool we just defined, and the response generation function will be similar to our previous examples.
Let’s define the functions, their variants, and any associated templates and schemas.
[functions.generate_weather_query]type = "chat"tools = ["get_temperature"]
[functions.generate_weather_query.variants.simple_variant]type = "chat_completion"weight = 1.0model = "my_gpt_4o_mini"system_template = "functions/generate_weather_query/simple_variant/system.minijinja"
[functions.generate_weather_report]type = "chat"user_schema = "functions/generate_weather_report/user_schema.json"
[functions.generate_weather_report.variants.simple_variant]type = "chat_completion"weight = 1.0model = "my_gpt_4o_mini"system_template = "functions/generate_weather_report/simple_variant/system.minijinja"user_template = "functions/generate_weather_report/simple_variant/user.minijinja"
You are a helpful AI assistant that generates weather queries.
If the user asks about the weather in a given location, request a tool call to `get_temperature` with the location.Optionally, the user may also specify the units (must be "fahrenheit" or "celsius"; defaults to "fahrenheit").
If the user asks about anything else, just respond that you can't help.
---
Examples:
User: What's the weather in New York?Assistant (Tool Call): get_temperature(location="New York")
User: What's the weather in Tokyo in Celsius?Assistant (Tool Call): get_temperature(location="Tokyo", units="celsius")
User: What is the capital of France?Assistant (Text): I can only provide weather information.
You are a helpful AI assistant that generates brief weather reports.
You'll be provided with the temperature for a given location.Respond with a concise weather report for the given temperature.Add a sentence with a funny local recommendation based on the information.
If "Units" is missing, assume the temperature is in Fahrenheit.
---
Examples:
User: Location: San Francisco Temperature: 82 Units:Assistant: The weather in San Francisco is 82°F. Hope you get a chance to enjoy La Taqueria by Dolores Park!
User: Location: Tokyo Temperature: -5 Units: celsiusAssistant: The weather in Tokyo is -5°C — perfect a day for a trip to an onsen.
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "location": { "type": "string" }, "temperature": { "type": "string" }, "units": { "type": ["null", "string"] } }, "required": ["location", "temperature"], "additionalProperties": false}
Please respond with a weather report given the information below.
Location: {{ location }}Temperature: {{ temperature }}Units: {{ units }}
When the model requests a tool call, the response will include a tool_call
content block.
These content blocks have the fields arguments
, name
, raw_arguments
, and raw_name
.
The first two fields are validated against the tool’s configuration (or null
if invalid).
The last two fields contain the raw values passed received from the model.
Episodes
Before we make any inference requests, we must introduce one more concept: episodes.
An episode is a sequence of inferences associated with a common downstream outcome.
For example, an episode could refer to a sequence of LLM calls associated with:
- Resolving a support ticket
- Preparing an insurance claim
- Completing a phone call
- Extracting data from a document
- Drafting an email
An episode will include one or more functions, and sometimes multiple calls to the same function. Your application can run arbitrary actions (e.g. interact with users, retrieve documents, actuate robotics) between function calls within an episode. Though these are outside the scope of TensorZero, it is fine (and encouraged) to build your LLM systems this way.
Episodes allow you to group sequences of inferences in multi-step LLM workflows, apply feedback to these sequences as a whole, and optimize your workflows end-to-end. During an episode, multiple calls to the same function will receive the same variant (unless fallbacks are necessary).
In the case of our weather RAG, both query generation and response generation contribute to what we ultimately care about: the weather report. So we’ll want to associate each weather report with the inferences that led to it. The workflow of generating a weather report will be our episode.
The /inference
endpoint accepts an optional episode_id
field.
When you make the first inference request, you don’t have to provide an episode_id
.
The gateway will create a new episode for you and return the episode_id
in the response.
When you make the second inference request, you must provide the episode_id
you received in the first response.
The gateway will use the episode_id
to associate the two inference requests together.
from tensorzero import TensorZeroGateway, ToolCall
with TensorZeroGateway("http://localhost:3000") as client: query_result = client.inference( function_name="generate_weather_query", # This is the first inference request in an episode so we don't need to provide an episode_id input={ "messages": [ { "role": "user", "content": "What is the weather like in São Paulo?", } ] }, )
print(query_result)
# In a production setting, you'd validate the output more thoroughly assert len(query_result.content) == 1 assert isinstance(query_result.content[0], ToolCall)
location = query_result.content[0].arguments.get("location") units = query_result.content[0].arguments.get("units", "celsius")
# Now we pretend to make a tool call (e.g. to an API) temperature = "35"
report_result = client.inference( function_name="generate_weather_report", # This is the second inference request in an episode so we need to provide the episode_id episode_id=query_result.episode_id, input={ "messages": [ { "role": "user", "content": { "location": location, "temperature": temperature, "units": units, }, } ] }, )
print(report_result)
Sample Output
ChatInferenceResponse( inference_id=UUID('01940b67-b1f3-78e0-9ead-495b333d122f'), episode_id=UUID('01940b67-afd9-79c2-8c29-e5094f5c03a2'), variant_name='simple_variant', content=[ ToolCall( type='tool_call', arguments={'location': 'São Paulo'}, id='call_ADmhPqUml5fDL4bPvlMMDKZR', name='get_temperature', raw_arguments='{"location":"São Paulo"}', raw_name='get_temperature', ) ], usage=Usage(input_tokens=266, output_tokens=16),)ChatInferenceResponse( inference_id=UUID('01940b67-b4e4-71c1-9cca-c68fa8fbcb3e'), episode_id=UUID('01940b67-afd9-79c2-8c29-e5094f5c03a2'), variant_name='simple_variant', content=[ Text( type='text', text="The weather in São Paulo is 35°C — a hot day to soak up the sun! Make sure to grab a cold acai bowl to cool off while you're out!", ) ], usage=Usage(input_tokens=188, output_tokens=36),)
import asyncio
from tensorzero import AsyncTensorZeroGateway, ToolCall
async def main(): async with AsyncTensorZeroGateway("http://localhost:3000") as client: query_result = await client.inference( function_name="generate_weather_query", # This is the first inference request in an episode so we don't need to provide an episode_id input={ "messages": [ { "role": "user", "content": "What is the weather like in São Paulo?", } ] }, )
print(query_result)
# In a production setting, you'd validate the output more thoroughly assert len(query_result.content) == 1 assert isinstance(query_result.content[0], ToolCall)
location = query_result.content[0].arguments.get("location") units = query_result.content[0].arguments.get("units", "celsius")
# Now we pretend to make a tool call (e.g. to an API) temperature = "35"
report_result = await client.inference( function_name="generate_weather_report", # This is the second inference request in an episode so we need to provide the episode_id episode_id=query_result.episode_id, input={ "messages": [ { "role": "user", "content": { "location": location, "temperature": temperature, "units": units, }, } ] }, )
print(report_result)
if __name__ == "__main__": asyncio.run(main())
Sample Output
ChatInferenceResponse( inference_id=UUID('01940b67-cb1a-7613-a516-cda7fe096d2b'), episode_id=UUID('01940b67-c934-7883-af94-d8f22db604fc'), variant_name='simple_variant', content=[ ToolCall( type='tool_call', arguments={'location': 'São Paulo'}, id='call_aZR2ITRKvAX8ntkPLY9OWetl', name='get_temperature', raw_arguments='{"location":"São Paulo"}', raw_name='get_temperature', ) ], usage=Usage( input_tokens=266, output_tokens=16, ),)ChatInferenceResponse( inference_id=UUID('01940b67-cdc5-7a20-898b-859240fcac52'), episode_id=UUID('01940b67-c934-7883-af94-d8f22db604fc'), variant_name='simple_variant', content=[ Text( type='text', text='The weather in São Paulo is 35°C, making it quite a hot day! Perfect weather for indulging in some refreshing açaí bowls at the park!', ) ], usage=Usage( input_tokens=188, output_tokens=34, ),)
import json
from openai import OpenAIfrom tensorzero import TensorZeroGateway
with OpenAI(base_url="http://localhost:3000/openai/v1") as client: query_result = client.chat.completions.create( model="tensorzero::function_name::generate_weather_query", # This is the first inference request in an episode so we don't need to provide an episode_id messages=[ { "role": "user", "content": "What is the weather like in São Paulo?", } ], )
print(query_result)
# In a production setting, you'd validate the output more thoroughly assert len(query_result.choices) == 1 assert query_result.choices[0].message.tool_calls is not None assert len(query_result.choices[0].message.tool_calls) == 1 import json
tool_call = query_result.choices[0].message.tool_calls[0] arguments = json.loads(tool_call.function.arguments) location = arguments.get("location") units = arguments.get("units", "celsius")
# Now we pretend to make a tool call (e.g. to an API) temperature = "35"
report_result = client.chat.completions.create( model="tensorzero::function_name::generate_weather_report", # This is the second inference request in an episode so we need to provide the episode_id extra_headers={"episode_id": str(query_result.episode_id)}, messages=[ { "role": "user", "content": [ { "location": location, "temperature": temperature, "units": units, } ], } ], )
print(report_result)
Sample Output
ChatCompletion( id='01940b67-e660-74e2-a7cd-c7d98b63604f', choices=[ Choice( finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage( content=None, refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[ ChatCompletionMessageToolCall( id='call_hOFIj3xYSFPa9cxbqQCdv7ID', function=Function( arguments='{"location":"São Paulo"}', name='get_temperature', ), type='function', ), ], ), ), ], created=1735358146, model='simple_variant', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage( completion_tokens=16, prompt_tokens=266, total_tokens=282, completion_tokens_details=None, prompt_tokens_details=None, ), episode_id='01940b67-e49a-70a3-9991-b6d38b5b19d2',)ChatCompletion( id='01940b67-e88f-7132-b958-95b7fdb623d2', choices=[ Choice( finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage( content="The weather in São Paulo is 35°C — a hot day to explore the city's vibrant street art! Make sure to cool off with some delicious açaí!", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], ), ), ], created=1735358146, model='simple_variant', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage( completion_tokens=34, prompt_tokens=188, total_tokens=222, completion_tokens_details=None, prompt_tokens_details=None, ), episode_id='01940b67-e49a-70a3-9991-b6d38b5b19d2',)
The first inference request doesn’t require an episode_id
.
The response will contain a new episode_id
that we’ll use in the second request.
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "generate_weather_query", "input": { "messages": [ { "role": "user", "content": "What is the weather like in São Paulo?" } ] } }'
Sample Output
{ "inference_id": "0191bf87-3c82-78f3-8a02-603f40f3a817", "episode_id": "0191bf87-3a6e-7193-a2be-ee565d6f0308", "variant_name": "simple_variant", "content": [ { "type": "tool_call", "arguments": { "location": "São Paulo" }, "id": "call_BuINq30qJRl6AWPmIKPi8DhV", "name": "get_temperature", "raw_arguments": "{\"location\":\"São Paulo\"}", "raw_name": "get_temperature" } ], "usage": { "input_tokens": 266, "output_tokens": 15 }}
The second inference request requires the episode_id
you received in the first response.
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "generate_weather_report", "episode_id": "00000000-0000-0000-0000-000000000000", "input": { "messages": [ { "role": "user", "content": { "location": "São Paulo", "temperature": "35", "units": "fahrenheit" } } ] } }'
Sample Output
{ "inference_id": "0191c31d-7de9-7822-86c0-47c79c737085", "episode_id": "0191c31c-e397-7860-8e28-bcafec8f4225", "variant_name": "simple_variant", "content": [ { "type": "text", "text": "The weather in São Paulo is 35°F, which is quite chilly for the region. Remember to bundle up and maybe treat yourself to a warm pastel at the local market!" } ], "usage": { "input_tokens": 185, "output_tokens": 21 }}
Episode-Level Metrics
The primary use case for episodes is to enable episode-level metrics. In the previous example, we assigned feedback to individual inferences. TensorZero can also collect episode-level feedback, which can be useful for optimizing entire workflows.
To collect episode-level feedback, we need to define a metric with level = "episode"
.
Let’s add a metric for the weather RAG example.
We’ll use the user_rating
as the metric name, and we’ll collect it as a float.
[metrics.user_rating]level = "episode"optimize = "max"type = "float"
Making a feedback request with an episode_id
instead of an inference_id
associates the feedback with the entire episode.
from tensorzero import TensorZeroGateway, ToolCall
with TensorZeroGateway("http://localhost:3000") as client: feedback_result = client.feedback( metric_name="user_rating", # Set the episode_id to the one returned in the inference response episode_id="00000000-0000-0000-0000-000000000000", # Set the value for the metric (numeric types will be coerced to float) value=5, )
print(feedback_result)
Sample Output
FeedbackResponse(feedback_id='01940b67-b4ee-7402-9caf-a235717753c7')
import asyncio
from tensorzero import AsyncTensorZeroGateway, ToolCall
async def main(): async with AsyncTensorZeroGateway("http://localhost:3000") as client: feedback_result = await client.feedback( metric_name="user_rating", # Set the episode_id to the one returned in the inference response episode_id="00000000-0000-0000-0000-000000000000", # Set the value for the metric (numeric types will be coerced to float) value=5, )
print(feedback_result)
if __name__ == "__main__": asyncio.run(main())
Sample Output
FeedbackResponse(feedback_id='01940b67-cdcd-7583-930a-0c796f1a7fd3')
Set the episode_id
to the one returned in the inference response.
The value for a float metric can be any numeric type, but will be coerced to float under the hood.
curl -X POST http://localhost:3000/feedback \ -H "Content-Type: application/json" \ -d '{ "metric_name": "user_rating", "episode_id": "00000000-0000-0000-0000-000000000000", "value": 5 }'
Sample Output
{ "feedback_id": "0191bf8e-0c3f-7b11-9e12-6a7eb823d538" }
That’s all you need for our weather RAG example. This is clearly a toy example, but it illustrates the core concepts of TensorZero. You can replace the mock weather API with a real API call — or if were searching over documents instead, anything from BM25 to cutting-edge vector search.
Part IV — Email Data Extraction
JSON Functions
Everything we’ve done so far has been with Chat Functions.
TensorZero also supports JSON Functions for use cases that require structured outputs. The input is the same, but the function returns a JSON value instead of a chat message.
Let’s create a JSON function that extracts an email address from a user’s message.
The setup is very similar to our previous examples, except that the function is defined with type = "json"
and requires an output_schema
.
Let’s start by defining the schema, a static system template, and the rest of the configuration.
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "email": { "type": "string", "description": "The email address extracted from the user's message." } }, "required": ["email"], "additionalProperties": false}
You are a helpful AI assistant that extracts an email address from a user's message.Return an empty string if no email address is found.
Your output should be a JSON object with the following schema:
{ "email": "..."}
---
Examples:
User: Using TensorZero at work? Ping [email protected] to set up a Slack channel (free).Assistant: {"email": "[email protected]"}
User: I just received an ominous email from [email protected]...Assistant: {"email": "[email protected]"}
User: Let's sue TensorZero!Assistant: {"email": ""}
[functions.extract_email]type = "json"output_schema = "functions/extract_email/output_schema.json"
[functions.extract_email.variants.simple_variant]type = "chat_completion"weight = 1model = "my_gpt_4o_mini"system_template = "functions/extract_email/simple_variant/system.minijinja"
Once you’ve set it up, your file tree should look like this:
Directoryfunctions/
- …
Directoryextract_email/
Directorysimple_variant/
- system.minijinja
- user.minijinja
- output_schema.json
- …
- …
- tensorzero.toml
Finally, let’s make an inference request.
The request format is very similar to chat functions, but the response will contain an output
field instead of a content
field.
The output
field will be a JSON object with the fields parsed
and raw
.
The parsed
field is the parsed output as a valid JSON that fits your schema (null
if the model didn’t generate a JSON that matches your schema), and the raw
field is the raw output from the model as a string.
from tensorzero import TensorZeroGateway
with TensorZeroGateway("http://localhost:3000") as client: result = client.inference( function_name="extract_email", input={ "messages": [ { "role": "user", } ] }, )
print(result)
Sample Output
JsonInferenceResponse( inference_id=UUID('01940b76-8226-7601-a86c-6cb927f05b44'), episode_id=UUID('01940b76-80c6-7013-812b-2bcc88e82519'), variant_name='simple_variant', output=JsonInferenceOutput( ), usage=Usage( input_tokens=139, output_tokens=11, ),)
import asyncio
from tensorzero import AsyncTensorZeroGateway
async def main(): async with AsyncTensorZeroGateway("http://localhost:3000") as client: result = await client.inference( function_name="extract_email", input={ "messages": [ { "role": "user", } ] }, )
print(result)
if __name__ == "__main__": asyncio.run(main())
Sample Output
JsonInferenceResponse( inference_id=UUID('01940b76-de03-7f81-ad44-59534cf775ae'), episode_id=UUID('01940b76-db5d-7cc3-9f43-b86709d5f9bd'), variant_name='simple_variant', output=JsonInferenceOutput( ), usage=Usage( input_tokens=139, output_tokens=11, ),)
from openai import OpenAI
with OpenAI(base_url="http://localhost:3000/openai/v1") as client: result = client.chat.completions.create( model="tensorzero::function_name::extract_email", messages=[ { "role": "user", }, ], )
print(result)
Sample Output
ChatCompletion( id='01940b75-46ba-7b80-829d-e19d4421c749', choices=[ Choice( finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage( refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None, ), ), ], created=1735359022, model='simple_variant', object='chat.completion', service_tier=None, system_fingerprint='', usage=CompletionUsage( completion_tokens=10, prompt_tokens=139, total_tokens=149, completion_tokens_details=None, prompt_tokens_details=None, ), episode_id='01940b75-40fd-74a1-8c01-d09347f890a4',)
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "extract_email", "input": { "messages": [ { "role": "user", "content": "blah blah blah [email protected] blah blah blah" } ] } }'
Sample Output
{ "inference_id": "0191bf98-2fbc-7781-9197-4a066ea7cd68", "episode_id": "0191bf98-2cb0-7822-9a85-700ac377cf36", "variant_name": "simple_variant", "output": { }, "usage": { "input_tokens": 139, "output_tokens": 10 }}
Conclusion
This tutorial only scratches the surface of what you can do with TensorZero.
TensorZero especially shines when it comes to optimizing complex LLM workflows using the data collected by the gateway. For example, the structured data collected by the gateway can be used to better fine-tune models compared to using historical prompts and generations alone.
We are working on a series of examples covering the entire “data flywheel in a box” that TensorZero provides. Here are some of our favorites:
- Optimizing Data Extraction (NER) with TensorZero
- Writing Haikus to Satisfy a Judge with Hidden Preferences
- Improving LLM Chess Ability with Best-of-N Sampling
- Improving Math Reasoning with a Custom Recipe for Automated Prompt Engineering (DSPy)
You can also dive deeper into the Configuration Reference and the API Reference.
We’re excited to see what you build! Please share your projects, ideas, and feedback with us on Slack or Discord.