API Reference: Inference (OpenAI-Compatible)

`POST /openai/v1/chat/completions`

The /openai/v1/chat/completions endpoint allows TensorZero users to make TensorZero inferences with the OpenAI client. The gateway translates the OpenAI request parameters into the arguments expected by the inference endpoint and calls the same underlying implementation. This endpoint supports most of the features supported by the inference endpoint, but there are some limitations. Most notably, this endpoint doesn’t support dynamic credentials, so they must be specified with a different method.

Request

This endpoint leverages both the request body (as JSON) and the request headers to pass information to the inference endpoint. You should assume each field is in the body unless it is explicitly noted as a header.

`dryrun`

This field should be provided as a request header.

Type: boolean
Required: no

If true, the inference request will be executed but won’t be stored to the database. The gateway will still call the downstream model providers.

This field is primarily for debugging and testing, and you should ignore it in production.

`episode_id`

This field should be provided as a request header.

Type: UUID
Required: no

The ID of an existing episode to associate the inference with.

For the first inference of a new episode, you should not provide an episode_id. If null, the gateway will generate a new episode ID and return it in the response.

Only use episode IDs that were returned by the TensorZero gateway.

`frequency_penalty`

Type: float
Required: no (default: null)

Penalizes new tokens based on their frequency in the text so far if positive, encourages them if negative. Overrides the frequency_penalty setting for any chat completion variants being used.

`max_completion_tokens`

Type: integer
Required: no (default: null)

Limits the number of tokens that can be generated by the model in a chat completion variant. If both this and max_tokens are set, the smaller value is used.

`max_tokens`

Type: integer
Required: no (default: null)

Limits the number of tokens that can be generated by the model in a chat completion variant. If both this and max_completion_tokens are set, the smaller value is used.

`messages`

Type: list
Required: yes

A list of messages to provide to the model.

Each message is an object with the following fields:

role (required): The role of the message sender in an OpenAI message (assistant, system, tool, or user).
content (required for user and system messages and optional for assistant and tool messages): The content of the message. The content must be either a string or an array of content blocks (see below).
tool_calls (optional for assistant messages, otherwise disallowed): A list of tool calls. Each tool call is an object with the following fields:
- id: A unique identifier for the tool call
- type: The type of tool being called (currently only "function" is supported)
- function: An object containing:
  - name: The name of the function to call
  - arguments: A JSON string containing the function arguments
tool_call_id (required for tool messages, otherwise disallowed): The ID of the tool call to associate with the message. This should be one that was originally returned by the gateway in a tool call id field.

A content block is an object that can have type text or image_url.

If the content block has type text, it must have either of the following additional fields:

text: The text for the content block.
tensorzero::arguments: A JSON object containing the function arguments for TensorZero functions with templates and schemas (see Prompt Templates & Schemas for details).

If a content block has type image_url, it must have the following additional fields:

"image_url": A JSON object with the following field:
- url: The URL for a remote image (e.g. "https://example.com/image.png") or base64-encoded data for an embedded image (e.g. "data:image/png;base64,...").

`model`

Type: string
Required: yes

The name of the TensorZero function being called, prepended by "tensorzero::function_name::". An error will be returned if the function name is not recognized or is missing the prefix.

`parallel_tool_calls`

Type: boolean
Required: no (default: null)

Overrides the parallel_tool_calls setting for the function being called.

`presence_penalty`

Type: float
Required: no (default: null)

Penalizes new tokens based on whether they appear in the text so far if positive, encourages them if negative. Overrides the presence_penalty setting for any chat completion variants being used.

`response_format`

Type: either a string or an object
Required: no (default: null)

Options here are "text", "json_object", and "{"type": "json_schema", "schema": ...}", where the schema field contains a valid JSON schema. This field is not actually respected except for the "json_schema" variant, in which the schema field can be used to dynamically set the output schema for a json function.

`seed`

Type: integer
Required: no (default: null)

Overrides the seed setting for any chat completion variants being used.

`stream`

Type: boolean
Required: no (default: false)

If true, the gateway will stream the response to the client in an OpenAI-compatible format.

`temperature`

Type: float
Required: no (default: null)

Overrides the temperature setting for any chat completion variants being used.

`tools`

Type: list of tool objects (see below)
Required: no (default: null)

Allows the user to dynamically specify tools at inference time in addition to those that are specified in the configuration.

Each tool object has the following structure:

type: Must be "function"
function: An object containing:
- name: The name of the function (string, required)
- description: A description of what the function does (string, optional)
- parameters: A JSON Schema object describing the function’s parameters (required)
- strict: Whether to enforce strict schema validation (boolean, defaults to false)

`tool_choice`

Type: string or object
Required: no (default: "none" if no tools are present, "auto" if tools are present)

Controls which (if any) tool is called by the model by overriding the value in configuration. Supported values:

"none": The model will not call any tool and instead generates a message
"auto": The model can pick between generating a message or calling one or more tools
"required": The model must call one or more tools
{"type": "function", "function": {"name": "my_function"}}: Forces the model to call the specified tool

`top_p`

Type: float
Required: no (default: null)

Overrides the top_p setting for any chat completion variants being used.

`variant_name`

This field should be provided as a request header.

Type: string
Required: no

If set, pins the inference request to a particular variant (not recommended).

You should generally not set this field, and instead let the TensorZero gateway assign a variant. This field is primarily used for testing or debugging purposes.

In regular (non-streaming) mode, the response is a JSON object with the following fields:

`choices`

Type: list of choice objects, where each choice contains:
- index: A zero-based index indicating the choice’s position in the list (integer)
- finish_reason: Always "stop".
- message: An object containing:
  - content: The message content (string, optional)
  - tool_calls: List of tool calls made by the model (optional). The format is the same as in the request.
  - role: The role of the message sender (always "assistant").

`created`

Type: integer

The Unix timestamp (in seconds) of when the inference was created.

`episode_id`

Type: UUID

The ID of the episode that the inference was created for.

`id`

Type: UUID

The inference ID.

`model`

Type: string

The name of the variant that was actually used for the inference.

`object`

Type: string

The type of the inference object (always "chat.completion").

`system_fingerprint`

Type: string

Always ""

`usage`

Type: object

Contains token usage information for the request and response, with the following fields:

prompt_tokens: Number of tokens in the prompt (integer)
completion_tokens: Number of tokens in the completion (integer)
total_tokens: Total number of tokens used (integer)

In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE] message.

Each JSON message has the following fields:

`choices`

Type: list

A list of choices from the model, where each choice contains:

index: The index of the choice (integer)
finish_reason: always ""
delta: An object containing either:
- content: The next piece of generated text (string), or
- tool_calls: A list of tool calls, each containing the next piece of the tool call being generated

`created`

Type: integer

The Unix timestamp (in seconds) of when the inference was created.

`episode_id`

Type: UUID

The ID of the episode that the inference was created for.

`id`

Type: UUID

The inference ID.

`model`

Type: string

The name of the variant that was actually used for the inference.

`object`

Type: string

The type of the inference object (always "chat.completion").

`system_fingerprint`

Type: string

Always ""

`usage`

Type: object
Required: no

Contains token usage information for the request and response, with the following fields:

prompt_tokens: Number of tokens in the prompt (integer)
completion_tokens: Number of tokens in the completion (integer)
total_tokens: Total number of tokens used (integer)

Examples

Chat Function with Structured System Prompt

Configuration

# ...
[functions.draft_email]
type = "chat"
system_schema = "functions/draft_email/system_schema.json"
# ...

{
  "type": "object",
  "properties": {
    "assistant_name": { "type": "string" }
  }
}

Request

Python
HTTP

from openai import AsyncOpenAI

async with AsyncOpenAI(
    base_url="http://localhost:3000/openai/v1"
) as client:
    result = await client.chat.completions.create(
        # there already was an episode_id from an earlier inference
        extra_body={"tensorzero::episode_id": str(episode_id)},
        messages=[
            {
                "role": "system",
                "content": [{"assistant_name": "Alfred Pennyworth"}]
                # NOTE: the JSON is in an array here so that a structured system message can be sent
            },
            {
                "role": "user",
                "content": "I need to write an email to Gabriel explaining..."
            }
        ],
        model="tensorzero::function_name::draft_email",
        temperature=0.4,
        # Optional: stream=True
    )

curl -X POST http://localhost:3000/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "episode_id: your_episode_id_here" \
  -d '{
    "messages": [
      {
        "role": "system",
        "content": [{"assistant_name": "Alfred Pennyworth"}]
      },
      {
        "role": "user",
        "content": "I need to write an email to Gabriel explaining..."
      }
    ],
    "model": "tensorzero::function_name::draft_email",
    "temperature": 0.4
    // Optional: "stream": true
  }'

Response

Regular
Streaming

{
  "id": "00000000-0000-0000-0000-000000000000",
  "episode_id": "11111111-1111-1111-1111-111111111111",
  "model": "email_draft_variant",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "content": "Hi Gabriel,\n\nI noticed...",
        "role": "assistant"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 100,
    "total_tokens": 200
  }
}

In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE] message.

Each JSON message has the following fields:

{
  "id": "00000000-0000-0000-0000-000000000000",
  "episode_id": "11111111-1111-1111-1111-111111111111",
  "model": "email_draft_variant",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "delta": {
        "content": "Hi Gabriel,\n\nI noticed..."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 100,
    "total_tokens": 200
  }
}

Chat Function with Dynamic Tool Use

Configuration

# ...

[functions.weather_bot]
type = "chat"
# Note: no `tools = ["get_temperature"]` field in configuration

# ...

Request

Python
HTTP

from openai import AsyncOpenAI

async with AsyncOpenAI(
    base_url="http://localhost:3000/openai/v1"
) as client:
    result = await client.chat.completions.create(
        model="tensorzero::function_name::weather_bot",
        input={
            "messages": [
                {
                    "role": "user",
                    "content": "What is the weather like in Tokyo?"
                }
            ]
        },
        tools=[
            {
              "type": "function",
              "function": {
                  "name": "get_temperature",
                  "description": "Get the current temperature in a given location",
                  "parameters": {
                    "$schema": "http://json-schema.org/draft-07/schema#",
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The location to get the temperature for (e.g. \"New York\")"
                        },
                        "units": {
                            "type": "string",
                            "description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
                            "enum": ["fahrenheit", "celsius"]
                        }
                    },
                    "required": ["location"],
                    "additionalProperties": false
                }
              }
            }
        ],
        # optional: stream=True,
    )

curl -X POST http://localhost:3000/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tensorzero::function_name::weather_bot",
    "input": {
      "messages": [
        {
          "role": "user",
          "content": "What is the weather like in Tokyo?"
        }
      ]
    },
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_temperature",
          "description": "Get the current temperature in a given location",
          "parameters": {
            "$schema": "http://json-schema.org/draft-07/schema#",
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "The location to get the temperature for (e.g. \"New York\")"
              },
              "units": {
                "type": "string",
                "description": "The units to get the temperature in (must be \"fahrenheit\" or \"celsius\")",
                "enum": ["fahrenheit", "celsius"]
              }
            },
            "required": ["location"],
            "additionalProperties": false
          }
        }
      }
    ]
    // optional: "stream": true
  }'

Response

Regular
Streaming

{
  "id": "00000000-0000-0000-0000-000000000000",
  "episode_id": "11111111-1111-1111-1111-111111111111",
  "model": "weather_bot_variant",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "content": null,
        "tool_calls": [
          {
            "id": "123456789",
            "type": "function",
            "function": {
              "name": "get_temperature",
              "arguments": "{\"location\": \"Tokyo\", \"units\": \"celsius\"}"
            }
          }
        ],
        "role": "assistant"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 100,
    "total_tokens": 200
  }
}

In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE] message.

Each JSON message has the following fields:

{
  "id": "00000000-0000-0000-0000-000000000000",
  "episode_id": "11111111-1111-1111-1111-111111111111",
  "model": "weather_bot_variant",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "content": null,
        "tool_calls": [
          {
            "id": "123456789",
            "type": "function",
            "function": {
              "name": "get_temperature",
              "arguments": "{\"location\":" // a tool arguments delta
            }
          }
        ]
      }
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 100,
    "total_tokens": 200
  }
}

Json Function with Dynamic Output Schema

JSON Function with Dynamic Output Schema

Configuration

# ...
[functions.extract_email]
type = "json"
output_schema = "output_schema.json"
# ...

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "email": {
      "type": "string"
    }
  },
  "required": ["email"]
}

Request

Python
HTTP

from openai import AsyncOpenAI

dynamic_output_schema = {
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "email": { "type": "string" },
    "domain": { "type": "string" }
  },
  "required": ["email", "domain"]
}

async with AsyncOpenAI(
    base_url="http://localhost:3000/openai/v1"
) as client:
    result = await client.chat.completions.create(
        model="tensorzero::function_name::extract_email",
        input={
            "system": "You are an AI assistant...",
            "messages": [
                {
                    "role": "user",
                    "content": "...blah blah blah [email protected] blah blah blah..."
                }
            ]
        }
        # Override the output schema using the `response_format` field
        response_format={"type": "json_schema", "schema": dynamic_output_schema}
        # optional: stream=True,
    )

curl -X POST http://localhost:3000/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tensorzero::function_name::extract_email",
    "input": {
      "system": "You are an AI assistant...",
      "messages": [
        {
          "role": "user",
          "content": "...blah blah blah [email protected] blah blah blah..."
        }
      ]
    },
    "response_format": {
      "type": "json_schema",
      "schema": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
          "email": { "type": "string" },
          "domain": { "type": "string" }
        },
        "required": ["email", "domain"]
      }
    },
    // optional: "stream": true
  }'

Response

Regular
Streaming

{
  "id": "00000000-0000-0000-0000-000000000000",
  "episode_id": "11111111-1111-1111-1111-111111111111",
  "model": "extract_email_variant",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "content": "{\"email\": \"[email protected]\", \"domain\": \"tensorzero.com\"}"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 100,
    "total_tokens": 200
  }
}

In streaming mode, the response is an SSE stream of JSON messages, followed by a final [DONE] message.

Each JSON message has the following fields:

{
  "id": "00000000-0000-0000-0000-000000000000",
  "episode_id": "11111111-1111-1111-1111-111111111111",
  "model": "extract_email_variant",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "content": "{\"email\":" // a JSON content delta
      }
    }
  ],
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 100,
    "total_tokens": 200
  }
}