Skip to content

API Reference

The TensorZero Gateway exposes two primary API endpoints: /inference and /feedback.

The gateway also exposes auxiliary endpoints for Prometheus-compatible metrics (/metrics), liveness probes (/status), and readiness probes (/health).

POST /inference

The inference endpoint is the core of the TensorZero Gateway API.

Under the hood, the gateway validates the request, samples a variant from the function, handles templating when applicable, and routes the inference to the appropriate model provider. If a problem occurs, it attempts to gracefully fallback to a different model provider or variant. After a successful inference, it returns the data to the client and asynchronously stores structured information in the database.

Request

additional_tools

  • Type: a list of tools (see below)
  • Required: no (default: [])

A list of tools defined at inference time that the model is allowed to call. This field allows for dynamic tool use, i.e. defining tools at runtime.

You should prefer to define tools in the configuration file if possible. Only use this field if dynamic tool use is necessary for your use case.

Each tool is an object with the following fields: description, name, parameters, and strict.

The fields are identical to those in the configuration file, except that the parameters field should contain the JSON schema itself rather than a path to it. See Configuration Reference for more details.

allowed_tools

  • Type: list of strings
  • Required: no

A list of tool names that the model is allowed to call. The tools must be defined in the configuration file.

Any tools provided in additional_tools are always allowed, irrespective of this field.

credentials

  • Type: object (a map from dynamic credential names to API keys)
  • Required: no (default: no credentials)

Each model provider in your TensorZero configuration can be configured to accept credentials at inference time by using the dynamic location (e.g. dynamic::my_dynamic_api_key_name). See the configuration reference for more details. The gateway expects the credentials to be provided in the credentials field of the request body as specified below. The gateway will return a 400 error if the credentials are not provided and the model provider has been configured with dynamic credentials.

Example
[models.my_model_name.providers.my_provider_name]
# ...
# Note: the name of the credential field (e.g. `api_key_location`) depends on the provider type
api_key_location = "dynamic::my_dynamic_api_key_name"
# ...
{
// ...
"credentials": {
// ...
"my_dynamic_api_key_name": "sk-..."
// ...
}
// ...
}

dryrun

  • Type: boolean
  • Required: no

If true, the inference request will be executed but won’t be stored to the database. The gateway will still call the downstream model providers.

This field is primarily for debugging and testing, and you should ignore it in production.

episode_id

  • Type: UUID
  • Required: no

The ID of an existing episode to associate the inference with.

For the first inference of a new episode, you should not provide an episode_id. If null, the gateway will generate a new episode ID and return it in the response.

Only use episode IDs that were returned by the TensorZero gateway.

function_name

  • Type: string
  • Required: yes

The name of the function to call.

The function must be defined in the configuration file.

input

  • Type: varies
  • Required: yes

The input to the function.

The type of the input depends on the function type.

input.messages
  • Type: list of messages (see below)
  • Required: no (default: [])

A list of messages to provide to the model.

Each message is an object with the following fields:

  • role: The role of the message (assistant or user).
  • content: The content of the message (see below).

The content field can be have one of the following types:

  • string: the text for a text message (only allowed if there is no schema for that role)
  • object: the arguments for a structured text message (only allowed if there is a schema for that role)
  • list of content blocks: the content blocks for the message (see below)

A content block is an object that can have type text, tool_call, or tool_result. We anticipate adding additional content block types in the future.

If the content block has type text, it must have an additional field text. The text should be a string or object depending on whether there is a schema for that role, similar to the content field above. If your message has a single text content block, setting content to a string or object is the short-hand equivalent to using this structure.

If the content block has type tool_call, it must have the following additional fields:

  • arguments: The arguments for the tool call.
  • id: The ID for the content block.
  • name: The name of the tool for the content block.

If the content block has type tool_result, it must have the following additional fields:

  • id: The ID for the content block.
  • name: The name of the tool for the content block.
  • result: The result of the tool call.

This is the most complex field in the entire API. See this example for more details.

Example
{
// ...
"input": {
"messages": [
// If you don't have a user (or assistant) schema...
{
"role": "user", // (or "assistant")
"content": "What is the weather in Tokyo?"
},
// If you have a user (or assistant) schema...
{
"role": "user", // (or "assistant")
"content": {
"location": "Tokyo"
// ...
}
},
// If the model previously called a tool...
{
"role": "assistant",
"content": [
{
"type": "tool_call",
"id": "0",
"name": "get_temperature",
"arguments": "{\"location\": \"Tokyo\"}"
}
]
},
// ...and you're providing the result of that tool call...
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": "0",
"name": "get_temperature",
"result": "70"
}
]
},
// You can also specify a text message using a content block...
{
"role": "user",
"content": [
{
"type": "text",
"text": "What about NYC?" // (or object if there is a schema)
}
]
},
// You can also provide multiple content blocks in a single message...
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Sure, I can help you with that." // (or object if there is a schema)
},
{
"type": "tool_call",
"id": "0",
"name": "get_temperature",
"arguments": "{\"location\": \"New York\"}"
}
]
}
// ...
]
// ...
}
// ...
}
input.system
  • Type: string or object
  • Required: no

The input for the system message.

If the function does not have a system schema, this field should be a string.

If the function has a system schema, this field should be an object that matches the schema.

output_schema

  • Type: object (valid JSON Schema)
  • Required: no

If set, this schema will override the output_schema defined in the function configuration for a JSON function. This schema is used for validating the output of the function, and sent to providers which support structured outputs.

parallel_tool_calls

  • Type: boolean
  • Required: no

If true, the function will be allowed to request multiple tool calls in a single conversation turn. If not set, we default to the configuration value for the function being called.

Most model providers do not support parallel tool calls. In those cases, the gateway ignores this field. At the moment, only Fireworks AI and OpenAI support parallel tool calls.

params

  • Type: object (see below)
  • Required: no (default: {})

Override inference-time parameters for a particular variant type. This fields allows for dynamic inference parameters, i.e. defining parameters at runtime.

This field’s format is { variant_type: { param: value, ... }, ... }. You should prefer to set these parameters in the configuration file if possible. Only use this field if you need to set these parameters dynamically at runtime.

Note that the parameters will apply to every variant of the specified type.

Currently, we support the following:

  • chat_completion
    • frequency_penalty
    • max_tokens
    • presence_penalty
    • seed
    • temperature
    • top_p

See Configuration Reference for more details on the parameters, and Examples below for usage.

Example

For example, if you wanted to dynamically override the temperature parameter for a chat_completion variants, you’d include the following in the request body:

{
// ...
"params": {
"chat_completion": {
"temperature": 0.7
}
}
// ...
}

See “Chat Function with Dynamic Inference Parameters” for a complete example.

stream

  • Type: boolean
  • Required: no

If true, the gateway will stream the response from the model provider.

tags

  • Type: flat JSON object with string keys and values
  • Required: no

User-provided tags to associate with the inference.

For example, {"user_id": "123"} or {"author": "Alice"}.

tool_choice

  • Type: string
  • Required: no

If set, overrides the tool choice strategy for the request.

The supported tool choice strategies are:

  • none: The function should not use any tools.
  • auto: The model decides whether or not to use a tool. If it decides to use a tool, it also decides which tools to use.
  • required: The model should use a tool. If multiple tools are available, the model decides which tool to use.
  • { specific = "tool_name" }: The model should use a specific tool. The tool must be defined in the tools section of the configuration file or provided in additional_tools.

variant_name

  • Type: string
  • Required: no

If set, pins the inference request to a particular variant (not recommended).

You should generally not set this field, and instead let the TensorZero gateway assign a variant. This field is primarily used for testing or debugging purposes.

Response

The response format depends on the function type (as defined in the configuration file) and whether the response is streamed or not.

Chat Function

When the function type is chat, the response is structured as follows.

In regular (non-streaming) mode, the response is a JSON object with the following fields:

content
  • Type: a list of content blocks (see below)

The content blocks generated by the model.

A content block can have type equal to text or tool_call.

If type is text, the content block has the following fields:

  • text: The text for the content block.

If type is tool_call, the content block has the following fields:

  • arguments (object): The validated arguments for the tool call (null if invalid).
  • id (string): The ID of the content block.
  • name (string): The validated name of the tool (null if invalid).
  • raw_arguments (string): The arguments for the tool call generated by the model (which might be invalid).
  • raw_name (string): The name of the tool generated by the model (which might be invalid).
episode_id
  • Type: UUID

The ID of the episode associated with the inference.

inference_id
  • Type: UUID

The ID assigned to the inference.

variant_name
  • Type: string

The name of the variant used for the inference.

usage
  • Type: object (optional)

The usage metrics for the inference.

The object has the following fields:

  • input_tokens: The number of input tokens used for the inference.
  • output_tokens: The number of output tokens used for the inference.

JSON Function

When the function type is json, the response is structured as follows.

In regular (non-streaming) mode, the response is a JSON object with the following fields:

inference_id
  • Type: UUID

The ID assigned to the inference.

episode_id
  • Type: UUID

The ID of the episode associated with the inference.

output
  • Type: object (see below)

The output object contains the following fields:

  • raw: The raw response from the model provider (which might be invalid JSON).
  • parsed: The parsed response from the model provider (null if invalid JSON).
variant_name
  • Type: string

The name of the variant used for the inference.

usage
  • Type: object (optional)

The usage metrics for the inference.

The object has the following fields:

  • input_tokens: The number of input tokens used for the inference.
  • output_tokens: The number of output tokens used for the inference.

Examples

Chat Function

Chat Function
Configuration
tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
# ...
Request
POST /inference
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.inference(
function_name="draft_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
]
}
# optional: stream=True,
)
Response
POST /inference
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"text": "Hi Gabriel,\n\nI noticed...",
}
]
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}

Chat Function with Schemas

Chat Function with Schemas
Configuration
tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
system_schema = "system_schema.json"
user_schema = "user_schema.json"
# ...
system_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"tone": {
"type": "string"
}
},
"required": ["tone"],
"additionalProperties": false
}
user_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"recipient": {
"type": "string"
},
"email_purpose": {
"type": "string"
}
},
"required": ["recipient", "email_purpose"],
"additionalProperties": false
}
Request
POST /inference
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.inference(
function_name="draft_email",
input={
"system": {"tone": "casual"},
"messages": [
{
"role": "user",
"content": {
"recipient": "Gabriel",
"email_purpose": "Request a meeting to..."
}
}
]
}
# optional: stream=True,
)
Response
POST /inference
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"text": "Hi Gabriel,\n\nI noticed...",
}
]
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}

Chat Function with Tool Use

Chat Function with Tool Use
Configuration
tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
tools = ["get_temperature"]
# ...
[tools.get_temperature]
description = "Get the current temperature in a given location"
parameters = "get_temperature.json"
# ...
get_temperature.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
Request
POST /inference
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.inference(
function_name="weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
}
# optional: stream=True,
)
Response
POST /inference
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "tool_call",
"arguments": {
"location": "Tokyo",
"units": "celsius"
},
"id": "123456789",
"name": "get_temperature",
"raw_arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}",
"raw_name": "get_temperature"
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}

Chat Function with Multi-Turn Tool Use

Chat Function with Multi-Turn Tool Use
Configuration
tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
tools = ["get_temperature"]
# ...
[tools.get_temperature]
description = "Get the current temperature in a given location"
parameters = "get_temperature.json"
# ...
get_temperature.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
Request
POST /inference
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.inference(
function_name="weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
},
{
"role": "assistant",
"content": [
{
"type": "tool_call",
"arguments": {
"location": "Tokyo",
"units": "celsius"
},
"id": "123456789",
"name": "get_temperature",
}
]
},
{
"role": "user",
"content": [
{
"type": "tool_result",
"id": "123456789",
"name": "get_temperature",
"result": "25" # the tool result must be a string
}
]
}
]
}
# optional: stream=True,
)
Response
POST /inference
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"content": [
{
"type": "text",
"text": "The weather in Tokyo is 25 degrees Celsius."
}
]
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}

Chat Function with Dynamic Tool Use

Chat Function with Dynamic Tool Use
Configuration
tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
# Note: no `tools = ["get_temperature"]` field in configuration
# ...
Request
POST /inference
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.inference(
function_name="weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
},
additional_tools=[
{
"name": "get_temperature",
"description": "Get the current temperature in a given location",
"parameters": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
}
],
# optional: stream=True,
)
Response
POST /inference
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "tool_call",
"arguments": {
"location": "Tokyo",
"units": "celsius"
},
"id": "123456789",
"name": "get_temperature",
"raw_arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}",
"raw_name": "get_temperature"
}
],
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}

Chat Function with Dynamic Inference Parameters

Chat Function with Dynamic Inference Parameters
Configuration
tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
# ...
[functions.draft_email.variants.prompt_v1]
type = "chat_completion"
temperature = 0.5 # the API request will override this value
# ...
Request
POST /inference
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.inference(
function_name="draft_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
]
},
# Override parameters for every variant with type "chat_completion"
params={
"chat_completion": {
"temperature": 0.7,
}
},
# optional: stream=True,
)
Response
POST /inference
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"content": [
{
"type": "text",
"text": "Hi Gabriel,\n\nI noticed...",
}
]
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}

JSON Function

JSON Function
Configuration
tensorzero.toml
# ...
[functions.extract_email]
type = "json"
output_schema = "output_schema.json"
# ...
output_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"email": {
"type": "string"
}
},
"required": ["email"]
}
Request
POST /inference
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.inference(
function_name="extract_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "...blah blah blah [email protected] blah blah blah..."
}
]
}
# optional: stream=True,
)
Response
POST /inference
{
"inference_id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"variant_name": "prompt_v1",
"output": {
"raw": "{\"email\": \"[email protected]\"}",
"parsed": {
"email": "[email protected]"
}
}
"usage": {
"input_tokens": 100,
"output_tokens": 100
}
}

POST /feedback

The /feedback endpoint assigns feedback to a particular inference or episode.

Each feedback is associated with a metric that is defined in the configuration file.

Request

dryrun

  • Type: boolean
  • Required: no

If true, the feedback request will be executed but won’t be stored to the database (i.e. no-op).

This field is primarily for debugging and testing, and you should ignore it in production.

episode_id

  • Type: UUID
  • Required: when the metric level is episode

The episode ID to provide feedback for.

You should use this field when the metric level is episode.

Only use episode IDs that were returned by the TensorZero gateway.

inference_id

  • Type: UUID
  • Required: when the metric level is inference

The inference ID to provide feedback for.

You should use this field when the metric level is inference.

Only use inference IDs that were returned by the TensorZero gateway.

metric_name

  • Type: string
  • Required: yes

The name of the metric to provide feedback.

For example, if your metric is defined as [metrics.draft_accepted] in your configuration file, then you would set metric_name: "draft_accepted".

The metric names comment and demonstration are reserved for special types of feedback. A comment is free-form text (string) that can be assigned to either an inference or an episode. The demonstration metric is being finalized and is not yet available.

tags

  • Type: flat JSON object with string keys and values
  • Required: no

User-provided tags to associate with the feedback.

For example, {"user_id": "123"} or {"author": "Alice"}.

value

  • Type: varies
  • Required: yes

The value of the feedback.

The type of the value depends on the metric type (e.g. boolean for a metric with type = "boolean").

Response

feedback_id

  • Type: UUID

The ID assigned to the feedback.

Examples

Inference-Level Boolean Metric

Inference-Level Boolean Metric
Configuration
tensorzero.toml
# ...
[metrics.draft_accepted]
type = "boolean"
level = "inference"
# ...
Request
POST /feedback
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.feedback(
inference_id="00000000-0000-0000-0000-000000000000",
metric_name="draft_accepted",
value=True,
)
Response
POST /feedback
{ "feedback_id": "11111111-1111-1111-1111-111111111111" }

Episode-Level Float Metric

Episode-Level Float Metric
Configuration
tensorzero.toml
# ...
[metrics.user_rating]
type = "float"
level = "episode"
# ...
Request
POST /feedback
from tensorzero import AsyncTensorZeroGateway
async with AsyncTensorZeroGateway("http://localhost:3000") as client:
result = await client.feedback(
episode_id="00000000-0000-0000-0000-000000000000",
metric_name="user_rating",
value=10,
)
Response
POST /feedback
{ "feedback_id": "11111111-1111-1111-1111-111111111111" }

POST /openai/v1/chat/completions

The /openai/v1/chat/completions endpoint allows TensorZero users to make TensorZero inferences with the OpenAI client. The gateway translates the OpenAI request parameters into the arguments expected by the inference endpoint and calls the same underlying implementation. This endpoint supports most of the features supported by the inference endpoint, but there are some limitations. Most notably, this endpoint doesn’t support dynamic credentials, so they must be specified with a different method.

Request

This endpoint leverages both the request body (as JSON) and the request headers to pass information to the inference endpoint. You should assume each field is in the body unless it is explicitly noted as a header.

dryrun

This field should be provided as a request header.

  • Type: boolean
  • Required: no

If true, the inference request will be executed but won’t be stored to the database. The gateway will still call the downstream model providers.

This field is primarily for debugging and testing, and you should ignore it in production.

episode_id

This field should be provided as a request header.

  • Type: UUID
  • Required: no

The ID of an existing episode to associate the inference with.

For the first inference of a new episode, you should not provide an episode_id. If null, the gateway will generate a new episode ID and return it in the response.

Only use episode IDs that were returned by the TensorZero gateway.

frequency_penalty

  • Type: float
  • Required: no (default: null)

Penalizes new tokens based on their frequency in the text so far if positive, encourages them if negative. Overrides the frequency_penalty setting for any chat completion variants being used.

max_completion_tokens

  • Type: integer
  • Required: no (default: null)

Limits the number of tokens that can be generated by the model in a chat completion variant. If both this and max_tokens are set, the smaller value is used.

max_tokens

  • Type: integer
  • Required: no (default: null)

Limits the number of tokens that can be generated by the model in a chat completion variant. If both this and max_completion_tokens are set, the smaller value is used.

messages

  • Type: list
  • Required: yes

A list of messages to provide to the model.

Each message is an object with the following fields:

  • role (required): The role of the message sender in an OpenAI message (assistant, system, tool, or user).
  • content (required for user and system messages and optional for assistant and tool messages): The content of the message. Depending on the TensorZero function being called, the content must be either a string or an array of length 1 that wraps a JSON object that complies with the appropriate schema for the function and message type. The array is required in order for the OpenAI python client to pass structured data to the gateway.
  • tool_calls (optional for assistant messages, otherwise disallowed): A list of tool calls. Each tool call is an object with the following fields:
    • id: A unique identifier for the tool call
    • type: The type of tool being called (currently only "function" is supported)
    • function: An object containing:
      • name: The name of the function to call
      • arguments: A JSON string containing the function arguments
  • tool_call_id (required for tool messages, otherwise disallowed): The ID of the tool call to associate with the message. This should be one that was originally returned by the gateway in a tool call id field.

model

  • Type: string
  • Required: yes

The name of the TensorZero function being called, prepended by "tensorzero::". An error will be returned if the function name is not recognized or is missing the prefix.

parallel_tool_calls

  • Type: boolean
  • Required: no (default: null)

Overrides the parallel_tool_calls setting for the function being called.

presence_penalty

  • Type: float
  • Required: no (default: null)

Penalizes new tokens based on whether they appear in the text so far if positive, encourages them if negative. Overrides the presence_penalty setting for any chat completion variants being used.

response_format

  • Type: either a string or an object
  • Required: no (default: null)

Options here are "text", "json_object", and "{"type": "json_schema", "schema": ...}", where the schema field contains a valid JSON schema. This field is not actually respected except for the "json_schema" variant, in which the schema field can be used to dynamically set the output schema for a json function.

seed

  • Type: integer
  • Required: no (default: null)

Overrides the seed setting for any chat completion variants being used.

stream

  • Type: boolean
  • Required: no (default: false)

If true, the gateway will stream the response to the client in an OpenAI-compatible format.

temperature

  • Type: float
  • Required: no (default: null)

Overrides the temperature setting for any chat completion variants being used.

tools

  • Type: list of tool objects (see below)
  • Required: no (default: null)

Allows the user to dynamically specify tools at inference time in addition to those that are specified in the configuration.

Each tool object has the following structure:

  • type: Must be "function"
  • function: An object containing:
    • name: The name of the function (string, required)
    • description: A description of what the function does (string, optional)
    • parameters: A JSON Schema object describing the function’s parameters (required)
    • strict: Whether to enforce strict schema validation (boolean, defaults to false)

tool_choice

  • Type: string or object
  • Required: no (default: "none" if no tools are present, "auto" if tools are present)

Controls which (if any) tool is called by the model by overriding the value in configuration. Supported values:

  • "none": The model will not call any tool and instead generates a message
  • "auto": The model can pick between generating a message or calling one or more tools
  • "required": The model must call one or more tools
  • {"type": "function", "function": {"name": "my_function"}}: Forces the model to call the specified tool

top_p

  • Type: float
  • Required: no (default: null)

Overrides the top_p setting for any chat completion variants being used.

variant_name

This field should be provided as a request header.

  • Type: string
  • Required: no

If set, pins the inference request to a particular variant (not recommended).

You should generally not set this field, and instead let the TensorZero gateway assign a variant. This field is primarily used for testing or debugging purposes.

Response

In regular (non-streaming) mode, the response is a JSON object with the following fields:

choices

  • Type: list of choice objects, where each choice contains:
    • index: A zero-based index indicating the choice’s position in the list (integer)
    • finish_reason: Always "stop".
    • message: An object containing:
      • content: The message content (string, optional)
      • tool_calls: List of tool calls made by the model (optional). The format is the same as in the request.
      • role: The role of the message sender (always "assistant").

created

  • Type: integer

The Unix timestamp (in seconds) of when the inference was created.

episode_id

  • Type: UUID

The ID of the episode that the inference was created for.

id

  • Type: UUID

The inference ID.

model

  • Type: string

The name of the variant that was actually used for the inference.

object

  • Type: string

The type of the inference object (always "chat.completion").

system_fingerprint

  • Type: string

Always ""

usage

  • Type: object

Contains token usage information for the request and response, with the following fields:

  • prompt_tokens: Number of tokens in the prompt (integer)
  • completion_tokens: Number of tokens in the completion (integer)
  • total_tokens: Total number of tokens used (integer)

Examples

Chat Function with Structured System Prompt

Chat Function with Structured System Prompt
Configuration
tensorzero.toml
# ...
[functions.draft_email]
type = "chat"
system_schema = "functions/draft_email/system_schema.json"
# ...
functions/draft_email/system_schema.json
{
"type": "object",
"properties": {
"assistant_name": { "type": "string" }
}
}
Request
POST /inference
from openai import AsyncOpenAI
async with AsyncOpenAI(
base_url="http://localhost:3000/openai/v1"
) as client:
result = await client.chat.completions.create(
# there already was an episode_id from an earlier inference
extra_headers={"episode_id": str(episode_id)},
messages=[
{
"role": "system",
"content": [{"assistant_name": "Alfred Pennyworth"}]
# NOTE: the JSON is in an array here so that a structured system message can be sent
},
{
"role": "user",
"content": "I need to write an email to Gabriel explaining..."
}
],
model="tensorzero::draft_email",
temperature=0.4,
# Optional: stream=True
)
Response
POST /inference
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "email_draft_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": "Hi Gabriel,\n\nI noticed...",
"role": "assistant"
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}

Chat Function with Dynamic Tool Use

Chat Function with Dynamic Tool Use
Configuration
tensorzero.toml
# ...
[functions.weather_bot]
type = "chat"
# Note: no `tools = ["get_temperature"]` field in configuration
# ...
Request
POST /inference
from openai import AsyncOpenAI
async with AsyncOpenAI(
base_url="http://localhost:3000/openai/v1"
) as client:
result = await client.chat.completions.create(
model="tensorzero::weather_bot",
input={
"messages": [
{
"role": "user",
"content": "What is the weather like in Tokyo?"
}
]
},
tools=[
{
"type": "function",
"function": {
"name": "get_temperature",
"description": "Get the current temperature in a given location",
"parameters": {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for (e.g. \"New York\")"
},
"units": {
"type": "string",
"description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
"enum": ["fahrenheit", "celsius"]
}
},
"required": ["location"],
"additionalProperties": false
}
}
}
],
# optional: stream=True,
)
Response
POST /inference
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "weather_bot_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": null,
"tool_calls": [
{
"id": "123456789",
"type": "function",
"function": {
"name": "get_temperature",
"arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}"
}
}
],
"role": "assistant"
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}

Json Function with Dynamic Output Schema

JSON Function with Dynamic Output Schema
Configuration
tensorzero.toml
# ...
[functions.extract_email]
type = "json"
output_schema = "output_schema.json"
# ...
output_schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"email": {
"type": "string"
}
},
"required": ["email"]
}
Request
POST /inference
from openai import AsyncOpenAI
dynamic_output_schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"email": { "type": "string" },
"domain": { "type": "string" }
},
"required": ["email", "domain"]
}
async with AsyncOpenAI(
base_url="http://localhost:3000/openai/v1"
) as client:
result = await client.chat.completions.create(
model="tensorzero::extract_email",
input={
"system": "You are an AI assistant...",
"messages": [
{
"role": "user",
"content": "...blah blah blah [email protected] blah blah blah..."
}
]
}
# Override the output schema using the `response_format` field
response_format={"type": "json_schema", "schema": dynamic_output_schema}
# optional: stream=True,
)
Response
POST /inference
{
"id": "00000000-0000-0000-0000-000000000000",
"episode_id": "11111111-1111-1111-1111-111111111111",
"model": "extract_email_variant",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"content": "{\"email\": \"[email protected]\", \"domain\": \"tensorzero.com\"}"
}
}
],
"usage": {
"prompt_tokens": 100,
"completion_tokens": 100,
"total_tokens": 200
}
}

Auxiliary Endpoints

GET /metrics

The TensorZero Gateway exposes a Prometheus-compatible /metrics endpoint for monitoring.

At the moment, the only available metric is request_count, which counts the number of successful requests to the gateway. The metric reports counts for both inference and feedback requests.

Example Response

GET /metrics
# ...
request_count{endpoint="inference",function_name="draft_email"} 10
request_count{endpoint="feedback",metric_name="draft_accepted"} 10
# ...

GET /status

The /status endpoint is a simple liveness probe. It returns HTTP status code 200 if the gateway is running.

Example Response

GET /status
{ "status": "ok" }

GET /health

The /health endpoint is a simple readiness probe that checks if the gateway can communicate with the database. It returns HTTP status code 200 if the gateway is ready to serve requests.

Example Response

GET /health
{ "gateway": "ok", "clickhouse": "ok" }