Performance & Latency
The TensorZero Gateway is designed from the ground up with performance in mind. It achieves <1ms P99 latency overhead under extreme load (see Benchmarks).
Even in default settings, the gateway is fast and lightweight enough to be unnoticeable in most applications.
Optimizations
If you care about extreme concurrency and low latency, we recommend the following settings and workflows.
TensorZero Gateway
- Enable
gateway.observability.async_writes
to offload the responsibility of writing inference responses to ClickHouse to a background task, instead of waiting for ClickHouse to return the inference response. Learn more → - Ensure your application, the TensorZero Gateway, and ClickHouse are deployed in the same region to minimize network latency.
Python Client
- Initialize the client once and reuse it as much as possible, to avoid initialization overhead and to keep the connection alive.