Think of LLM Applications as POMDPs — Not Agents

LLM engineering is a young field. Practitioners suffer from a lack of clarity about their problem space, resulting in tools and applications that are hard to reason about, manage, and iterate on — slop.

The classic strategy to deal with slop is to make a formal model of the problem space. When we started TensorZero in early 2024, we took a stab at predicting what AI applications might look like in a few years. Eventually, we arrived at a model of LLM applications that we believe should stand the test of time. After some squinting, we realized this model was isomorphic to a partially observable Markov decision process (POMDP).

POMDPs are an old idea, dating back to at least 1965. They are mathematical models that help in making optimal decisions in situations where the true state of the system is not fully observable and outcomes are uncertain. POMDPs are widely used in fields like robotics and autonomous systems to handle uncertainty in perception and action. By concretely connecting LLM applications to POMDPs, we were able to take inspiration from the decades of thought and research spent analyzing and developing methods for them.

By viewing LLM applications as POMDPs, we inferred practical takeaways: determining the correct interface between LLM applications and the LLMs themselves, establishing how to store historical observations, pinpointing which parts of LLM engineering could be automated, and understanding the role of evals. We also saw a holistic approach to building LLM applications that loops inference, observability, optimization, evaluation, and experimentation.

Over the past year, we built a system based on these principles, validated it with real-world applications, and conducted a pilot that significantly improved an AI phone agent. Recently, we published a production-grade subset of our system as open-source software. In this post, we outline our general model of an LLM application, explain its connection to POMDPs, discuss actionable insights for LLM engineers, and describe how TensorZero realizes these to enable LLM systems that learn from real-world experience to optimize against business KPIs.

You don’t have to build an LLM application this way — we only assert that it is possible to build all LLM applications this way and that doing so carries a host of benefits.

Anatomy of an LLM Application

Here’s what happens when an application calls an LLM:

  1. The application templates a set of input variables X\mathbb{X} into a prompt
  2. The application sends the prompt P\mathbb{P} to an LLM inference provider
  3. The inference provider runs the model and returns a generation G\mathbb{G}
  4. The application parses the generation into a set of outputs Y\mathbb{Y} (e.g. text, structured data, tool calls)

Most people think about the interface between applications and LLMs in terms of prompts and generations. However, we argue that framing it in terms of variables can help us reason more effectively about LLM engineering. This distinction is subtle but important: it informs how to structure your LLM application for better observability, optimization, evaluations, and experimentation.


Diagram: LLM Functions


More formally, the interface between an application and an LLM should be a function between application variables (f:XYf: \mathbb{X}\to \mathbb{Y}), not between prompts and generations (f:PGf: \mathbb{P}\to \mathbb{G}).

In practice, an LLM application might use multiple LLM functions fi:XiYif_i: \mathbb{X}_i\to\mathbb{Y}_i for different tasks. For example, a customer support application might involve several tasks for an LLM: deciding what to say next, searching for information, triggering workflows on external systems, escalating to a human, and so on.

The application engineer must determine when to call each function fif_i, what data to provide as inputs Xi\mathbb{X}_i (e.g. conversation history, customer data), and the information it needs as outputs Yi\mathbb{Y}_i. Designing these boundaries between the application and the LLMs is a critical problem-specific step of LLM engineering.

With these decisions out of the way, the goal of LLM engineering becomes finding good implementations for each fif_i. Depending on the application, ‘good’ might involve a tradeoff between quality, cost, and latency. In any case, these implementations need to produce outputs that lead to good outcomes for the application.

Feedback in an LLM Application

Ideally, LLM applications eventually observe the downstream consequences of an inference or a sequence of inferences.

Not every LLM application has a perfectly objective, accurate, immediate, and quantitative KPI. In most use cases, the application relies on a feedback signal that is noisy, sparse, and delayed.

Fortunately, combined with human review — such as demonstrations of good behavior (labels) or qualitative comments — these feedback signals are often sufficient to guide application development.

For the customer support application we discussed earlier, feedback could take different forms. The application can monitor metrics such as customer satisfaction, time to resolution, deflection rate, and so on. In a copilot setting, feedback could also include edits from a human-in-the-loop reviewer. There could even be pairwise or more complicated preference signals from humans.

This feedback signal could be a boolean, a scalar, or an even more complex object. In many cases, there are multiple feedback signals that could be relevant. A substantial amount of work continues to be done on modeling reward functions from more complex data. For simplicity, we’ll just say that the feedback is associated with some reward rr for a particular run of the application.

A General Model for LLM Applications

In short, an LLM application is a computer program that calls LLM functions fi:XiYif_i: \mathbb{X}_i\to \mathbb{Y}_i and (hopefully) observes the consequences of its actions through a feedback signal with reward rr.

Though we focus on LLMs in this post, this formalism also extends neatly to multimodal systems.

What are POMDPs? Why should we care?

Reinforcement learning (RL) deals with how agents can learn optimal behaviors — policies π\pi — through trial and error by interacting with their environment. It focuses on making sequences of decisions that maximize cumulative rewards over time. In the past decade, reinforcement learning’s successes include superhuman performance in complex games like Go and huge strides in robotics.

Partially observable Markov decision processes (POMDPs) are the setting for decision-making problems in which the agent must act under uncertainty due to incomplete information about the environment’s state. Over time, the agent must learn a policy π\pi — a set of rules for behaving in different situations to achieve its goal. POMDPs provide a mathematical framework for modeling complex, real-world scenarios with noisy or sparse observations. For this reason, they are a suitable model for the problems LLM application builders tackle today.

Formal Definition of a POMDP

POMDPs are given by a 7-tuple S,A,T,R,Ω,O,H\langle \mathcal{S}, \mathcal{A}, T, R, \Omega, O, H\rangle where:

  • S\mathcal{S} is the state space (never observed)
  • A\mathcal{A} is the action space
  • T:S×AP(S)T: \mathcal{S} \times \mathcal{A} \to P(\mathcal{S}) is the probabilistic state transition function
  • R:S×ARR: \mathcal{S} \times \mathcal{A} \to \mathbb{R} is the reward function
  • Ω\Omega is the observation space
  • O:SP(Ω)O: \mathcal{S} \to P(\Omega) is the observation function
  • HNH \in \mathbb{N} is the horizon

The agent in a POMDP is determined by a time-varying policy πt:HtA\pi_t: \mathbb{H}_t \to \mathcal{A} where htHth_t \in \mathbb{H}_t is the history of observations and actions up to tt. Without loss of generality, we assume an initial state s0s_0; at each time period t=0,,H1t = 0, \dots, H-1 an observation otO(st)o_t \sim O(s_t) is generated and an action is chosen by the policy policy at=πt(ht)a_t = \pi_t(h_t). A reward is generated by rt=R(st,at)r_t = R(s_t, a_t) and finally a new state st+1T(st,at)s_{t+1} \sim T(s_t, a_t) is sampled from the transition function.

The goal of the agent is to choose a policy π=π0πH1\pi = \pi_0\dots\pi_{H-1} that maximizes the expected sum of rewards:

maxπ E[t=0H1r(st,at)]\max_{\pi}\ \mathbb{E}\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]

Data is collected by the agent from the POMDP in episodes of HH steps. The agent uses its experience to improve the policy over the course of many episodes.

As researchers have studied POMDPs for decades, we understand in detail sensible ways to approach them (and that they are generally hard to solve). Below, after we show how LLM applications can be framed as a special case of POMDPs, we discuss many of these properties and how they allow us to effectively observe, optimize, evaluate, and experiment with LLM applications.

LLM Applications are POMDPs

For the sake of LLM optimization, LLM applications are better framed as POMDPs — not as agents.

This can be counterintuitive at first!

Under this framing, the agent corresponds only to the LLM functions, not the LLM application as a whole. The environment includes not only the outside world but also all the remaining code in the LLM application. A policy is a set of implementations for the LLM functions. Over time, given the choice of LLM function to call and the associated inputs (i.e. an observation), the policy determines the LLM call’s outputs (i.e. an action). These actions influence the environment and possibly produce a reward rr. The agent should learn a policy that maximizes the expected sum of rewards over time.

Formal Framing of LLM Applications as POMDPs

We define the POMDP S,A,T,R,Ω,O,H\langle \mathcal{S}, \mathcal{A}, T, R, \Omega, O, H\rangle associated with an LLM application as follows.

State Space S\mathcal{S}

The state of a POMDP is the set of information that might influence the future observations and rewards. We never actually observe the state, so we can leave the state space undefined in the context of an LLM application.

Conceptually, the state might include the identity and frame of mind of users interacting with the LLM application, the contents of a database and other external systems that an LLM might query, the state of the physical world (e.g. the weather), and so on.

Action Space A\mathcal{A}

The action space A\mathcal{A} is the set of outputs which could come from LLM functions, i.e. A=iYi\mathcal{A} = \bigcup_i \mathbb{Y}_i.

Transition Function TT

The transition function TT relates to changes in the state between LLM function calls. These changes could be due to interactions with humans, retrieval calls, arbitrary code execution, and so on.

In other words, the transition function is determined by all the code in the application that is not the LLM function and its impact in the outside world.

Reward Function RR

As discussed earlier, a feedback signal for an LLM application is associated with some reward rRr \in \mathbb{R} for a particular run of the application. The reward function R:S×ARR: \mathcal{S} \times \mathcal{A} \to \mathbb{R} takes this into account. You can generalize this approach to multiple feedback signals and different types of feedback.

Observation Space Ω\Omega

The observation space Ω\Omega is the set of all possible choices of LLM function along with the choice of inputs for that function call, i.e. Ω=i(i×Xi)\Omega = \bigcup_i\left(i \times \mathbb{X}_i\right).

Observation Function OO

At different points in time, the LLM application might choose to call an LLM function with a given set of inputs. The observation function OO determines the process by which the state of the program and world ss results in an observation oΩo \in \Omega.

Horizon HH

The horizon is an upper bound on the number of function calls that are part of a particular run or session of the application.

In other words, we should find implementations for the LLM functions that lead to good outcomes in the context of our LLM application.

With this in mind, what the industry typically calls ‘building an agent’ corresponds to application engineering + POMDP optimization. This separation of concerns clarifies the role of LLMs in an LLM application and enables a more effective approach to LLM engineering.

So What?

Typically in applications of reinforcement learning there are two steps that loop in order to improve performance over time:

  1. Collect some data by deploying policies in the real world
  2. Use the data to generate improved variants of the policies

As with all ML systems, the data and its organization is crucial. In order to generate new policies in an RL system, you’ll want data in the form of trajectories, i.e. sequences of observations, actions, and rewards. For language modeling, this means that you’ll need to store elements of X\mathbb{X} and Y\mathbb{Y} rather than prompts and generations as is typically done in LLM observability tools. You’ll also want to collect data in a format that is agnostic of the policy, i.e. the choice of prompt, model, and so on. You’ll also need to be sure to associate inferences which are part of the same episode, as well as feedback which is available either at the inference or episode level.


Diagram: Application with Multiple LLM Functions


Since the policy is determined by the LLM function used when calling each fif_i, you’ll want an easy way to swap or experiment with various implementations for each fif_i that fit the same interface. If you experiment with each fif_i, you’ll want to independently sample which implementation to use per-episode so that the effects of the other trials wash out and your estimates for any particular fif_i are unbiased. Over time, you’ll want to gravitate towards the implementation for each function that gets the best feedback.

There are many ways to improve LLM systems once X\mathbb{X} and Y\mathbb{Y} are fixed. You can update the prompt, choose a different base model, fine-tune the model, apply RLHF techniques, and so on. You can even relax the assumption that a function call corresponds to a single LLM call — so long as you respect the interface of f:XYf: \mathbb{X} \to \mathbb{Y} — enabling the ability to leverage complex inference-time optimizations (like the OpenAI o1 family of models).


Diagram: Optimizing LLM Systems


After you’ve used one or more of these techniques to generate a new policy, you’ll want to use historical data to backtest and find out if it’s good or not. These are evals, but have been studied for decades under the name offline policy evaluation. Once deployed, you’ll want to be able to infer all of these implementations from the same interface in order to not change client code per-experiment and have unified visibility into each policy and how it is performing.

TensorZero: A Unified System for LLM-Applications-as-POMDPs

We are building TensorZero with this philosophy in mind. Today, we have an open-source production-grade system that makes the principles we discussed concrete.

Concretely, TensorZero creates a feedback loop for optimizing LLM applications — turning production data into smarter, faster, and cheaper models.

Under the hood, TensorZero provides a unified interface for LLM functions. This interface is agnostic to the choice of prompt, model, inference strategy, and more. The same interface allows you to assign feedback to individual inferences or sequences of inferences (episodes). Over time, TensorZero collects structured data that can be used for optimizing policies — generating new TensorZero function variants — be it updating prompts, fine-tuning models, leveraging inference-time optimizations, or anything else that changes the behavior of the function. Finally, TensorZero manages experimentation to ensure that new variants don’t just look good in a vacuum but actually drive real-world improvements.

In the future, TensorZero will automate much of this process. After a TensorZero-powered LLM application is deployed, it will automatically generate, evaluate, and present new variants that can potentially improve the performance of the system. The engineer’s role shifts from managing data transformations and training jobs to high-level abstractions of the problem: What should be the interface between the application and LLMs? What are the feedback signals? Which variants are successful? We believe this approach has the potential to fundamentally change LLM engineering and help unlock the next generation of LLM applications: AI systems that learn from real-world experience.

Check out TensorZero on Github and get started today!

If you’re interested in working with us, visit our jobs page.