Multimodal Inference (VLMs)

TensorZero Gateway supports multimodal inference (e.g. image inputs) with vision-language models (VLMs).

See Integrations for a list of supported models.

Setup

Object Storage

TensorZero uses object storage to store images used during multimodal inference. It supports any S3-compatible object storage service, including AWS S3, GCP Cloud Storage, Cloudflare R2, and many more. You can configure the object storage service in the object_storage section of the configuration file.

In this example, we’ll use a local deployment of MinIO, an open-source S3-compatible object storage service.

[object_storage]
type = "s3_compatible"
endpoint = "http://minio:9000"  # optional: defaults to AWS S3
# region = "us-east-1"  # optional: depends on your S3-compatible storage provider
bucket_name = "tensorzero"  # optional: depends on your S3-compatible storage provider
# IMPORTANT: for production environments, remove the following setting and use a secure method of authentication in
# combination with a production-grade object storage service.
allow_http = true

You can also store images in a local directory (type = "filesystem") or disable image storage (type = "disabled"). See Configuration Reference for more details.

The TensorZero Gateway will attempt to retrieve credentials from the following resources in order of priority:

S3_ACCESS_KEY_ID and S3_SECRET_ACCESS_KEY environment variables
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables
Credentials from the AWS SDK (default profile)

Docker Compose

We’ll use Docker Compose to deploy the TensorZero Gateway, ClickHouse, and MinIO.

docker-compose.yml

# This is a simplified example for learning purposes. Do not use this in production.
# For production-ready deployments, see: https://www.tensorzero.com/docs/gateway/deployment

services:
  clickhouse:
    image: clickhouse/clickhouse-server:24.12-alpine
    environment:
      - CLICKHOUSE_USER=chuser
      - CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1
      - CLICKHOUSE_PASSWORD=chpassword
    ports:
      - "8123:8123"
    healthcheck:
      test: wget --spider --tries 1 http://chuser:chpassword@clickhouse:8123/ping
      start_period: 30s
      start_interval: 1s
      timeout: 1s

  gateway:
    image: tensorzero/gateway
    volumes:
      # Mount our tensorzero.toml file into the container
      - ./config:/app/config:ro
    command: --config-file /app/config/tensorzero.toml
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY:?Environment variable OPENAI_API_KEY must be set.}
      - S3_ACCESS_KEY_ID=miniouser
      - S3_SECRET_ACCESS_KEY=miniopassword
      - TENSORZERO_CLICKHOUSE_URL=http://chuser:chpassword@clickhouse:8123/tensorzero
    ports:
      - "3000:3000"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    depends_on:
      clickhouse:
        condition: service_healthy
      minio:
        condition: service_healthy

  # For a production deployment, you can use AWS S3, GCP Cloud Storage, Cloudflare R2, etc.
  minio:
    image: bitnami/minio
    ports:
      - "9000:9000" # API port
      - "9001:9001" # Console port
    environment:
      - MINIO_ROOT_USER=miniouser
      - MINIO_ROOT_PASSWORD=miniopassword
      - MINIO_DEFAULT_BUCKETS=tensorzero
    healthcheck:
      test: "mc ls local/tensorzero || exit 1"
      start_period: 30s
      start_interval: 1s
      timeout: 1s

Inference

With the setup out of the way, you can now use the TensorZero Gateway to perform multimodal inference.

The TensorZero Gateway accepts both embedded images (encoded as base64 strings) and remote images (specified by a URL).

from tensorzero import TensorZeroGateway

with TensorZeroGateway.build_http(
    gateway_url="http://localhost:3000",
) as client:
    response = client.inference(
        model_name="openai::gpt-4o-mini",
        input={
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "Do the images share any common features?",
                        },
                        # Remote image of Ferris the crab
                        {
                            "type": "image",
                            "url": "https://raw.githubusercontent.com/tensorzero/tensorzero/ff3e17bbd3e32f483b027cf81b54404788c90dc1/tensorzero-internal/tests/e2e/providers/ferris.png",
                        },
                        # One-pixel orange image encoded as a base64 string
                        {
                            "type": "image",
                            "mime_type": "image/png",
                            "data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAA1JREFUGFdj+O/P8B8ABe0CTsv8mHgAAAAASUVORK5CYII=",
                        },
                    ],
                }
            ],
        },
    )

    print(response)

from openai import OpenAI

with OpenAI(base_url="http://localhost:3000/openai/v1") as client:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Do the images share any common features?",
                    },
                    # Remote image of Ferris the crab
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://raw.githubusercontent.com/tensorzero/tensorzero/ff3e17bbd3e32f483b027cf81b54404788c90dc1/tensorzero-internal/tests/e2e/providers/ferris.png",
                        },
                    },
                    # One-pixel orange image encoded as a base64 string
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAA1JREFUGFdj+O/P8B8ABe0CTsv8mHgAAAAASUVORK5CYII=",
                        },
                    },
                ],
            }
        ],
    )

    print(response)

curl -X POST http://localhost:3000/inference \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "openai::gpt-4o-mini",
    "input": {
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "Do the images share any common features?"
            },
            {
              "type": "image",
              "url": "https://raw.githubusercontent.com/tensorzero/tensorzero/ff3e17bbd3e32f483b027cf81b54404788c90dc1/tensorzero-internal/tests/e2e/providers/ferris.png"
            },
            {
              "type": "image",
              "mime_type": "image/png",
              "data": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAA1JREFUGFdj+O/P8B8ABe0CTsv8mHgAAAAASUVORK5CYII="
            }
          ]
        }
      ]
    }
  }'