Performance & Latency

The TensorZero Gateway is designed from the ground up with performance in mind. It achieves <1ms P99 latency overhead under extreme load (see Benchmarks).

Even in default settings, the gateway is fast and lightweight enough to be unnoticeable in most applications.

Optimizations

If you care about extreme concurrency and low latency, we recommend the following settings and workflows.

TensorZero Gateway

Enable gateway.observability.async_writes to offload the responsibility of writing inference responses to ClickHouse to a background task, instead of waiting for ClickHouse to return the inference response. Learn more →
Ensure your application, the TensorZero Gateway, and ClickHouse are deployed in the same region to minimize network latency.

Python Client

Initialize the client once and reuse it as much as possible, to avoid initialization overhead and to keep the connection alive.