Skip to content

Multimodal Inference (VLMs)

TensorZero Gateway supports multimodal inference (e.g. image inputs) with vision-language models (VLMs).

See Integrations for a list of supported models.

Setup

Object Storage

TensorZero uses object storage to store images used during multimodal inference. It supports any S3-compatible object storage service, including AWS S3, GCP Cloud Storage, Cloudflare R2, and many more. You can configure the object storage service in the object_storage section of the configuration file.

In this example, we’ll use a local deployment of MinIO, an open-source S3-compatible object storage service.

[object_storage]
type = "s3_compatible"
endpoint = "http://minio:9000" # optional: defaults to AWS S3
# region = "us-east-1" # optional: depends on your S3-compatible storage provider
bucket_name = "tensorzero" # optional: depends on your S3-compatible storage provider
# IMPORTANT: for production environments, remove the following setting and use a secure method of authentication in
# combination with a production-grade object storage service.
allow_http = true

You can also store images in a local directory (type = "filesystem") or disable image storage (type = "disabled"). See Configuration Reference for more details.

The TensorZero Gateway will attempt to retrieve credentials from the following resources in order of priority:

  1. S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY environment variables
  2. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables
  3. Credentials from the AWS SDK (default profile)

Docker Compose

We’ll use Docker Compose to deploy the TensorZero Gateway, ClickHouse, and MinIO.

docker-compose.yml
# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/gateway/deployment
services:
clickhouse:
image: clickhouse/clickhouse-server:24.12-alpine
environment:
- CLICKHOUSE_USER=chuser
- CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
- CLICKHOUSE_PASSWORD=chpassword
ports:
- "8123:8123"
healthcheck:
test: wget --spider --tries 1 http://chuser:chpassword@clickhouse:8123/ping
start_period: 30s
start_interval: 1s
timeout: 1s
gateway:
image: tensorzero/gateway
volumes:
# Mount our tensorzero.toml file into the container
- ./config:/app/config:ro
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
- S3_ACCESS_KEY_ID=miniouser
- S3_SECRET_ACCESS_KEY=miniopassword
- TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
ports:
- "3000:3000"
extra_hosts:
- "host.docker.internal:host-gateway"
depends_on:
clickhouse:
condition: service_healthy
minio:
condition: service_healthy
# For a production deployment, you can use AWS S3, GCP Cloud Storage, Cloudflare R2, etc.
minio:
image: bitnami/minio
ports:
- "9000:9000" # API port
- "9001:9001" # Console port
environment:
- MINIO_ROOT_USER=miniouser
- MINIO_ROOT_PASSWORD=miniopassword
- MINIO_DEFAULT_BUCKETS=tensorzero
healthcheck:
test: "mc ls local/tensorzero || exit 1"
start_period: 30s
start_interval: 1s
timeout: 1s

Inference

With the setup out of the way, you can now use the TensorZero Gateway to perform multimodal inference.

The TensorZero Gateway accepts both embedded images (encoded as base64 strings) and remote images (specified by a URL).

from tensorzero import TensorZeroGateway
with TensorZeroGateway.build_http(
gateway_url="http://localhost:3000",
) as client:
response = client.inference(
model_name="openai::gpt-4o-mini",
input={
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Do the images share any common features?",
},
# Remote image of Ferris the crab
{
"type": "image",
"url": "https://raw.githubusercontent.com/tensorzero/tensorzero/ff3e17bbd3e32f483b027cf81b54404788c90dc1/tensorzero-internal/tests/e2e/providers/ferris.png",
},
# One-pixel orange image encoded as a base64 string
{
"type": "image",
"mime_type": "image/png",
"data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAA1JREFUGFdj+O/P8B8ABe0CTsv8mHgAAAAASUVORK5CYII=",
},
],
}
],
},
)
print(response)