Batch Inference
The batch inference endpoint provides access to batch inference APIs offered by some model providers. These APIs provide inference with large cost savings compared to real-time inference, at the expense of much higher latency (sometimes up to a day).
The batch inference workflow consists of two steps: submitting your batch request, then polling for the batch job status until completion.
See Integrations for model provider integrations that support batch inference.
Example
Imagine you have a simple TensorZero function that generates haikus using GPT-4o Mini.
You can submit a batch inference job to generate multiple haikus with a single request.
Each entry in inputs
is equal to the input
field in a regular inference request.
The response contains a batch_id
as well as inference_ids
and episode_ids
for each inference in the batch.
You can use this batch_id
to poll for the status of the job or retrieve the results using the GET /batch_inference/{batch_id}
endpoint.
While the job is pending, the response will only contain the status
field.
Once the job is completed, the response will contain the status
field and the inferences
field.
Each inference object is the same as the response from a regular inference request.
Technical Notes
- Observability
- For now, pending batch inference jobs are not shown in the TensorZero UI.
You can find the relevant information in the
BatchRequest
andBatchModelInference
tables on ClickHouse. See Data Model for more information. - Inferences from completed batch inference jobs are shown in the UI alongside regular inferences.
- For now, pending batch inference jobs are not shown in the TensorZero UI.
You can find the relevant information in the
- Experimentation
- The gateway samples the same variant for the entire batch.
- Python Client
- The TensorZero Python client doesn’t natively support batch inference yet. You’ll need to submit batch requests using HTTP requests, as shown above.