API Reference: Inference
POST /inference
The inference endpoint is the core of the TensorZero Gateway API.
Under the hood, the gateway validates the request, samples a variant from the function, handles templating when applicable, and routes the inference to the appropriate model provider. If a problem occurs, it attempts to gracefully fallback to a different model provider or variant. After a successful inference, it returns the data to the client and asynchronously stores structured information in the database.
Request
additional_tools
- Type: a list of tools (see below)
- Required: no (default:
[]
)
A list of tools defined at inference time that the model is allowed to call. This field allows for dynamic tool use, i.e. defining tools at runtime.
You should prefer to define tools in the configuration file if possible. Only use this field if dynamic tool use is necessary for your use case.
Each tool is an object with the following fields: description
, name
, parameters
, and strict
.
The fields are identical to those in the configuration file, except that the parameters
field should contain the JSON schema itself rather than a path to it.
See Configuration Reference for more details.
allowed_tools
- Type: list of strings
- Required: no
A list of tool names that the model is allowed to call. The tools must be defined in the configuration file.
Any tools provided in additional_tools
are always allowed, irrespective of this field.
cache_options
- Type: object
- Required: no (default:
{"enabled": "write_only"}
)
Options for controlling inference caching behavior. The object has the fields below.
See Inference Caching for more details.
cache_options.enabled
- Type: string
- Required: no (default:
"write_only"
)
The cache mode to use. Must be one of:
"write_only"
(default): Only write to cache but don’t serve cached responses"read_only"
: Only read from cache but don’t write new entries"on"
: Both read from and write to cache"off"
: Disable caching completely
Note: When using dryrun=true
, the gateway never writes to the cache.
cache_options.max_age_s
- Type: integer
- Required: no (default:
null
)
Maximum age in seconds for cache entries. If set, cached responses older than this value will not be used.
For example, if you set max_age_s=3600
, the gateway will only use cache entries that were created in the last hour.
credentials
- Type: object (a map from dynamic credential names to API keys)
- Required: no (default: no credentials)
Each model provider in your TensorZero configuration can be configured to accept credentials at inference time by using the dynamic
location (e.g. dynamic::my_dynamic_api_key_name
).
See the configuration reference for more details.
The gateway expects the credentials to be provided in the credentials
field of the request body as specified below.
The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.
Example
[models.my_model_name.providers.my_provider_name]# ...# Note: the name of the credential field (e.g. `api_key_location`) depends on the provider typeapi_key_location = "dynamic::my_dynamic_api_key_name"# ...
{ // ... "credentials": { // ... "my_dynamic_api_key_name": "sk-..." // ... } // ...}
dryrun
- Type: boolean
- Required: no
If true
, the inference request will be executed but won’t be stored to the database.
The gateway will still call the downstream model providers.
This field is primarily for debugging and testing, and you should ignore it in production.
episode_id
- Type: UUID
- Required: no
The ID of an existing episode to associate the inference with.
For the first inference of a new episode, you should not provide an episode_id
. If null, the gateway will generate a new episode ID and return it in the response.
Only use episode IDs that were returned by the TensorZero gateway.
function_name
- Type: string
- Required: either
function_name
ormodel_name
must be provided
The name of the function to call.
The function must be defined in the configuration file.
include_original_response
- Type: boolean
- Required: no
If true
, the original response from the model will be included in the response in the original_response
field as a string.
Currently, this field can’t be used with streaming inferences.
See original_response
in the response section for more details.
input
- Type: varies
- Required: yes
The input to the function.
The type of the input depends on the function type.
input.messages
- Type: list of messages (see below)
- Required: no (default:
[]
)
A list of messages to provide to the model.
Each message is an object with the following fields:
role
: The role of the message (assistant
oruser
).content
: The content of the message (see below).
The content
field can be have one of the following types:
- string: the text for a text message (only allowed if there is no schema for that role)
- list of content blocks: the content blocks for the message (see below)
A content block is an object with the field type
and additional fields depending on the type.
If the content block has type text
, it must have either of the following additional fields:
text
: The text for the content block.arguments
: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see Prompt Templates & Schemas for details).
If the content block has type tool_call
, it must have the following additional fields:
arguments
: The arguments for the tool call.id
: The ID for the content block.name
: The name of the tool for the content block.
If the content block has type tool_result
, it must have the following additional fields:
id
: The ID for the content block.name
: The name of the tool for the content block.result
: The result of the tool call.
If the content block has type image
, it must have either of the following additional fields:
url
: The URL for a remote image.mime_type
anddata
: The MIME type and base64-encoded data for an embedded image.- We support the following MIME types:
image/png
,image/jpeg
, andimage/webp
.
- We support the following MIME types:
See the Multimodal Inference guide for more details on how to use images in inference.
If the content block has type raw_text
, it must have the following additional fields:
value
: The text for the content block. This content block will ignore any relevant templates and schemas for this function.
If the content block has type unknown
, it must have the following additional fields:
data
: The original content block from the provider, without any validation or transformation by TensorZero.model_provider_name
(optional): A string specifying when this content block should be included in the model provider input. If set, the content block will only be provided to this specific model provider. If not set, the content block is passed to all model providers.
For example, the following hypothetical unknown content block will send the daydreaming
content block to inference requests targeting the your_model_provider_name
model provider.
{ "type": "unknown", "data": { "type": "daydreaming", "dream": "..." }, "model_provider_name": "tensorzero::model_name::your_model_name::provider_name::your_model_provider_name"}
This is the most complex field in the entire API. See this example for more details.
Example
{ // ... "input": { "messages": [ // If you don't have a user (or assistant) schema... { "role": "user", // (or "assistant") "content": "What is the weather in Tokyo?" }, // If you have a user (or assistant) schema... { "role": "user", // (or "assistant") "content": [ { "type": "text", "arguments": { "location": "Tokyo" } } ] }, // If the model previously called a tool... { "role": "assistant", "content": [ { "type": "tool_call", "id": "0", "name": "get_temperature", "arguments": "{\"location\": \"Tokyo\"}" } ] }, // ...and you're providing the result of that tool call... { "role": "user", "content": [ { "type": "tool_result", "id": "0", "name": "get_temperature", "result": "70" } ] }, // You can also specify a text message using a content block... { "role": "user", "content": [ { "type": "text", "text": "What about NYC?" // (or object if there is a schema) } ] }, // You can also provide multiple content blocks in a single message... { "role": "assistant", "content": [ { "type": "text", "text": "Sure, I can help you with that." // (or object if there is a schema) }, { "type": "tool_call", "id": "0", "name": "get_temperature", "arguments": "{\"location\": \"New York\"}" } ] } // ... ] // ... } // ...}
input.system
- Type: string or object
- Required: no
The input for the system message.
If the function does not have a system schema, this field should be a string.
If the function has a system schema, this field should be an object that matches the schema.
model_name
- Type: string
- Required: either
model_name
orfunction_name
must be provided
The name of the model to call.
In this case, the function will be a built-in passthrough chat function called tensorzero::default
.
The model must be defined in the configuration file, or correspond to a short-hand model name.
Short-hand model names follow the format provider::model_name
(e.g. openai::gpt-4o-mini
or anthropic::claude-3-5-haiku
).
The following model providers support short-hand model names: anthropic
, deepseek
, fireworks
, google_ai_studio_gemini
, hyperbolic
, mistral
, openai
, together
, and xai
.
The remaining providers do not support short-hand model names, and require an explicit model
block in your configuration file.
output_schema
- Type: object (valid JSON Schema)
- Required: no
If set, this schema will override the output_schema
defined in the function configuration for a JSON function.
This dynamic output schema is used for validating the output of the function, and sent to providers which support structured outputs.
parallel_tool_calls
- Type: boolean
- Required: no
If true
, the function will be allowed to request multiple tool calls in a single conversation turn.
If not set, we default to the configuration value for the function being called.
Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field. At the moment, only Fireworks AI and OpenAI support parallel tool calls.
params
- Type: object (see below)
- Required: no (default:
{}
)
Override inference-time parameters for a particular variant type. This fields allows for dynamic inference parameters, i.e. defining parameters at runtime.
This field’s format is { variant_type: { param: value, ... }, ... }
.
You should prefer to set these parameters in the configuration file if possible.
Only use this field if you need to set these parameters dynamically at runtime.
Note that the parameters will apply to every variant of the specified type.
Currently, we support the following:
chat_completion
frequency_penalty
max_tokens
presence_penalty
seed
temperature
top_p
See Configuration Reference for more details on the parameters, and Examples below for usage.
Example
For example, if you wanted to dynamically override the temperature
parameter for a chat_completion
variants, you’d include the following in the request body:
{ // ... "params": { "chat_completion": { "temperature": 0.7 } } // ...}
See “Chat Function with Dynamic Inference Parameters” for a complete example.
stream
- Type: boolean
- Required: no
If true
, the gateway will stream the response from the model provider.
tags
- Type: flat JSON object with string keys and values
- Required: no
User-provided tags to associate with the inference.
For example, {"user_id": "123"}
or {"author": "Alice"}
.
tool_choice
- Type: string
- Required: no
If set, overrides the tool choice strategy for the request.
The supported tool choice strategies are:
none
: The function should not use any tools.auto
: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.required
: The model should use a tool. If multiple tools are available, the model decides which tool to use.{ specific = "tool_name" }
: The model should use a specific tool. The tool must be defined in thetools
section of the configuration file or provided inadditional_tools
.
variant_name
- Type: string
- Required: no
If set, pins the inference request to a particular variant (not recommended).
You should generally not set this field, and instead let the TensorZero gateway assign a variant. This field is primarily used for testing or debugging purposes.
Response
The response format depends on the function type (as defined in the configuration file) and whether the response is streamed or not.
Chat Function
When the function type is chat
, the response is structured as follows.
In regular (non-streaming) mode, the response is a JSON object with the following fields:
content
- Type: a list of content blocks (see below)
The content blocks generated by the model.
A content block can have type
equal to text
and tool_call
.
Reasoning models (e.g. DeepSeek R1) might also include thought
content blocks.
If type
is text
, the content block has the following fields:
text
: The text for the content block.
If type
is tool_call
, the content block has the following fields:
arguments
(object): The validated arguments for the tool call (null
if invalid).id
(string): The ID of the content block.name
(string): The validated name of the tool (null
if invalid).raw_arguments
(string): The arguments for the tool call generated by the model (which might be invalid).raw_name
(string): The name of the tool generated by the model (which might be invalid).
If type
is thought
, the content block has the following fields:
text
(string): The text of the thought.
If the model provider responds with a content block of an unknown type, it will be included in the response as a content block of type unknown
with the following additional fields:
data
: The original content block from the provider, without any validation or transformation by TensorZero.model_provider_name
: The fully-qualified name of the model provider that returned the content block.
For example, if the model provider your_model_provider_name
returns a content block of type daydreaming
, it will be included in the response like this:
{ "type": "unknown", "data": { "type": "daydreaming", "dream": "..." }, "model_provider_name": "tensorzero::model_name::your_model_name::provider_name::your_model_provider_name"}
episode_id
- Type: UUID
The ID of the episode associated with the inference.
inference_id
- Type: UUID
The ID assigned to the inference.
original_response
- Type: string (optional)
The original response from the model provider (only available when include_original_response
is true
).
The returned data depends on the variant type:
chat_completion
: raw response from the inference to themodel
experimental_best_of_n_sampling
: raw response from the inference to theevaluator
experimental_mixture_of_n_sampling
: raw response from the inference to thefuser
experimental_dynamic_in_context_learning
: raw response from the inference to themodel
variant_name
- Type: string
The name of the variant used for the inference.
usage
- Type: object (optional)
The usage metrics for the inference.
The object has the following fields:
input_tokens
: The number of input tokens used for the inference.output_tokens
: The number of output tokens used for the inference.
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
content
- Type: a list of content block chunks (see below)
The content deltas for the inference.
A content block chunk can have type
equal to text
or tool_call
.
Reasoning models (e.g. DeepSeek R1) might also include thought
content block chunks.
If type
is text
, the chunk has the following fields:
id
: The ID of the content block.text
: The text delta for the content block.
If type
is tool_call
, the chunk has the following fields (all strings):
id
: The ID of the content block.raw_name
: The name of the tool. The gateway does not validate this field during streaming inference.raw_arguments
: The arguments delta for the tool call. The gateway does not validate this field during streaming inference.
If type
is thought
, the chunk has the following fields:
id
: The ID of the content block.text
: The text delta for the thought.
episode_id
- Type: UUID
The ID of the episode associated with the inference.
inference_id
- Type: UUID
The ID assigned to the inference.
variant_name
- Type: string
The name of the variant used for the inference.
usage
- Type: object (optional)
The usage metrics for the inference.
The object has the following fields:
input_tokens
: The number of input tokens used for the inference.output_tokens
: The number of output tokens used for the inference.
JSON Function
When the function type is json
, the response is structured as follows.
In regular (non-streaming) mode, the response is a JSON object with the following fields:
inference_id
- Type: UUID
The ID assigned to the inference.
episode_id
- Type: UUID
The ID of the episode associated with the inference.
original_response
- Type: string (optional)
The original response from the model provider (only available when include_original_response
is true
).
The returned data depends on the variant type:
chat_completion
: raw response from the inference to themodel
experimental_best_of_n_sampling
: raw response from the inference to theevaluator
experimental_mixture_of_n_sampling
: raw response from the inference to thefuser
experimental_dynamic_in_context_learning
: raw response from the inference to themodel
output
- Type: object (see below)
The output object contains the following fields:
raw
: The raw response from the model provider (which might be invalid JSON).parsed
: The parsed response from the model provider (null
if invalid JSON).
variant_name
- Type: string
The name of the variant used for the inference.
usage
- Type: object (optional)
The usage metrics for the inference.
The object has the following fields:
input_tokens
: The number of input tokens used for the inference.output_tokens
: The number of output tokens used for the inference.
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
episode_id
- Type: UUID
The ID of the episode associated with the inference.
inference_id
- Type: UUID
The ID assigned to the inference.
raw
- Type: string
The raw response delta from the model provider.
The TensorZero Gateway does not provide a parsed
field for streaming JSON inferences.
If your application depends on a well-formed JSON response, we recommend using regular (non-streaming) inference.
variant_name
- Type: string
The name of the variant used for the inference.
usage
- Type: object (optional)
The usage metrics for the inference.
The object has the following fields:
input_tokens
: The number of input tokens used for the inference.output_tokens
: The number of output tokens used for the inference.
Examples
Chat Function
Chat Function
Configuration
# ...[functions.draft_email]type = "chat"# ...
Request
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client: result = await client.inference( function_name="draft_email", input={ "system": "You are an AI assistant...", "messages": [ { "role": "user", "content": "I need to write an email to Gabriel explaining..." } ] } # optional: stream=True, )
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "draft_email", "input": { "system": "You are an AI assistant...", "messages": [ { "role": "user", "content": "I need to write an email to Gabriel explaining..." } ] } // optional: "stream": true }'
Response
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "text", "text": "Hi Gabriel,\n\nI noticed...", } ] "usage": { "input_tokens": 100, "output_tokens": 100 }}
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "text", "id": "0", "text": "Hi Gabriel," // a text content delta } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
Chat Function with Schemas
Chat Function with Schemas
Configuration
# ...[functions.draft_email]type = "chat"system_schema = "system_schema.json"user_schema = "user_schema.json"# ...
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "tone": { "type": "string" } }, "required": ["tone"], "additionalProperties": false}
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "recipient": { "type": "string" }, "email_purpose": { "type": "string" } }, "required": ["recipient", "email_purpose"], "additionalProperties": false}
Request
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client: result = await client.inference( function_name="draft_email", input={ "system": {"tone": "casual"}, "messages": [ { "role": "user", "content": [ { "type": "text", "arguments": { "recipient": "Gabriel", "email_purpose": "Request a meeting to..." } } ] } ] } # optional: stream=True, )
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "draft_email", "input": { "system": {"tone": "casual"}, "messages": [ { "role": "user", "content": [ { "type": "text", "arguments": { "recipient": "Gabriel", "email_purpose": "Request a meeting to..." } } ] } ] } // optional: "stream": true }'
Response
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "text", "text": "Hi Gabriel,\n\nI noticed...", } ] "usage": { "input_tokens": 100, "output_tokens": 100 }}
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "text", "id": "0", "text": "Hi Gabriel," // a text content delta } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
Chat Function with Tool Use
Chat Function with Tool Use
Configuration
# ...
[functions.weather_bot]type = "chat"tools = ["get_temperature"]
# ...
[tools.get_temperature]description = "Get the current temperature in a given location"parameters = "get_temperature.json"
# ...
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "location": { "type": "string", "description": "The location to get the temperature for (e.g. \"New York\")" }, "units": { "type": "string", "description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")", "enum": ["fahrenheit", "celsius"] } }, "required": ["location"], "additionalProperties": false}
Request
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client: result = await client.inference( function_name="weather_bot", input={ "messages": [ { "role": "user", "content": "What is the weather like in Tokyo?" } ] } # optional: stream=True, )
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "weather_bot", "input": { "messages": [ { "role": "user", "content": "What is the weather like in Tokyo?" } ] } // optional: "stream": true }'
Response
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "tool_call", "arguments": { "location": "Tokyo", "units": "celsius" }, "id": "123456789", "name": "get_temperature", "raw_arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}", "raw_name": "get_temperature" } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "tool_call", "id": "123456789", "name": "get_temperature", "arguments": "{\"location\":" // a tool arguments delta } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
Chat Function with Multi-Turn Tool Use
Chat Function with Multi-Turn Tool Use
Configuration
# ...
[functions.weather_bot]type = "chat"tools = ["get_temperature"]
# ...
[tools.get_temperature]description = "Get the current temperature in a given location"parameters = "get_temperature.json"
# ...
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "location": { "type": "string", "description": "The location to get the temperature for (e.g. \"New York\")" }, "units": { "type": "string", "description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")", "enum": ["fahrenheit", "celsius"] } }, "required": ["location"], "additionalProperties": false}
Request
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client: result = await client.inference( function_name="weather_bot", input={ "messages": [ { "role": "user", "content": "What is the weather like in Tokyo?" }, { "role": "assistant", "content": [ { "type": "tool_call", "arguments": { "location": "Tokyo", "units": "celsius" }, "id": "123456789", "name": "get_temperature", } ] }, { "role": "user", "content": [ { "type": "tool_result", "id": "123456789", "name": "get_temperature", "result": "25" # the tool result must be a string } ] } ] } # optional: stream=True, )
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "weather_bot", "input": { "messages": [ { "role": "user", "content": "What is the weather like in Tokyo?" }, { "role": "assistant", "content": [ { "type": "tool_call", "arguments": { "location": "Tokyo", "units": "celsius" }, "id": "123456789", "name": "get_temperature", } ] }, { "role": "user", "content": [ { "type": "tool_result", "id": "123456789", "name": "get_temperature", "result": "25" // the tool result must be a string } ] } ] } // optional: "stream": true }'
Response
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "text", "content": [ { "type": "text", "text": "The weather in Tokyo is 25 degrees Celsius." } ] } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "text", "id": "0", "text": "The weather in" // a text content delta } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
Chat Function with Dynamic Tool Use
Chat Function with Dynamic Tool Use
Configuration
# ...
[functions.weather_bot]type = "chat"# Note: no `tools = ["get_temperature"]` field in configuration
# ...
Request
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client: result = await client.inference( function_name="weather_bot", input={ "messages": [ { "role": "user", "content": "What is the weather like in Tokyo?" } ] }, additional_tools=[ { "name": "get_temperature", "description": "Get the current temperature in a given location", "parameters": { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "location": { "type": "string", "description": "The location to get the temperature for (e.g. \"New York\")" }, "units": { "type": "string", "description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")", "enum": ["fahrenheit", "celsius"] } }, "required": ["location"], "additionalProperties": false } } ], # optional: stream=True, )
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "weather_bot", input: { "messages": [ { "role": "user", "content": "What is the weather like in Tokyo?" } ] }, additional_tools: [ { "name": "get_temperature", "description": "Get the current temperature in a given location", "parameters": { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "location": { "type": "string", "description": "The location to get the temperature for (e.g. \"New York\")" }, "units": { "type": "string", "description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")", "enum": ["fahrenheit", "celsius"] } }, "required": ["location"], "additionalProperties": false } } ] // optional: "stream": true }'
Response
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "tool_call", "arguments": { "location": "Tokyo", "units": "celsius" }, "id": "123456789", "name": "get_temperature", "raw_arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}", "raw_name": "get_temperature" } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "tool_call", "id": "123456789", "name": "get_temperature", "arguments": "{\"location\":" // a tool arguments delta } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
Chat Function with Dynamic Inference Parameters
Chat Function with Dynamic Inference Parameters
Configuration
# ...[functions.draft_email]type = "chat"# ...
[functions.draft_email.variants.prompt_v1]type = "chat_completion"temperature = 0.5 # the API request will override this value# ...
Request
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client: result = await client.inference( function_name="draft_email", input={ "system": "You are an AI assistant...", "messages": [ { "role": "user", "content": "I need to write an email to Gabriel explaining..." } ] }, # Override parameters for every variant with type "chat_completion" params={ "chat_completion": { "temperature": 0.7, } }, # optional: stream=True, )
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "draft_email", "input": { "system": "You are an AI assistant...", "messages": [ { "role": "user", "content": "I need to write an email to Gabriel explaining..." } ] }, params={ // Override parameters for every variant with type "chat_completion" "chat_completion": { "temperature": 0.7, } } // optional: "stream": true }'
Response
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "type": "text", "text": "Hi Gabriel,\n\nI noticed...", } ] "usage": { "input_tokens": 100, "output_tokens": 100 }}
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "content": [ { "id": "0", "text": "Hi Gabriel," // a text content delta } ], "usage": { "input_tokens": 100, "output_tokens": 100 }}
JSON Function
JSON Function
Configuration
# ...[functions.extract_email]type = "json"output_schema = "output_schema.json"# ...
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "email": { "type": "string" } }, "required": ["email"]}
Request
from tensorzero import AsyncTensorZeroGateway
async with await AsyncTensorZeroGateway.build_http(gateway_url="http://localhost:3000") as client: result = await client.inference( function_name="extract_email", input={ "system": "You are an AI assistant...", "messages": [ { "role": "user", } ] } # optional: stream=True, )
curl -X POST http://localhost:3000/inference \ -H "Content-Type: application/json" \ -d '{ "function_name": "extract_email", "input": { "system": "You are an AI assistant...", "messages": [ { "role": "user", "content": "...blah blah blah [email protected] blah blah blah..." } ] } // optional: "stream": true }'
Response
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "output": { "parsed": { } } "usage": { "input_tokens": 100, "output_tokens": 100 }}
In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE]
message.
Each JSON message has the following fields:
{ "inference_id": "00000000-0000-0000-0000-000000000000", "episode_id": "11111111-1111-1111-1111-111111111111", "variant_name": "prompt_v1", "raw": "{\"email\":", // a JSON content delta "usage": { "input_tokens": 100, "output_tokens": 100 }}