The problem
LLMs aren't deterministic. You can hand the same model the same prompt at the same temperature and get two different answers back. Most days, that's fine. Humans don't notice or care. It becomes a problem the moment the output isn't headed for a human. The moment it's feeding other code.
For a while, the standard response was to fight it. Turn the temperature down, set a seed, polish the prompt, add self-consistency checks, run an ensemble. All of these help around the edges. None of them fix what's actually going on, which is that you've got a probabilistic component sitting inside a deterministic system, and the system keeps acting surprised when the component behaves probabilistically.
The solution
The trick is to invert it. Stop trying to pin down the model. Pin down the system instead, and let the model do its fuzzy thing inside boundaries you actually control.
Here's the mental model that makes this work: an LLM is a compiler. Natural language, messy user input, whatever the rest of your code can't handle goes in. Typed objects with predictable shapes come out. Once you're past that boundary, you're back in normal software-engineering territory, where things have types and the same call returns the same thing twice in a row.
So the job is to draw those boundaries carefully. Every place the model touches your code is a place that needs a contract. A schema.
What that looks like
There are roughly four boundaries that matter.
Inputs first. Your prompt isn't a wall of text with user data jammed into it like a hostage note. It's the body of a function whose signature is a typed input. Same idea you'd use for any other function in your codebase.
Then outputs. Every major provider now supports constrained decoding against a JSON schema, and you should use it. The model returns objects, not strings. If validation fails, you retry inside a bounded loop instead of letting bad output leak downstream.
Tools come next. Each one the model can call has a typed signature, in and out, and the set of legal moves at any given step stays small, finite, and inspectable.
And then state. Conversation history, scratchpads, anything the agent carries between turns: none of it lives as raw strings. It's a typed object, the kind you can log and replay and diff between runs without parsing anything.
In TypeScript with Zod, the shift looks something like this:
import { z } from "zod";
// Before
const text = await llm.complete(`Classify this ticket: ${ticket.body}`);
const category = parseCategory(text); // regex, hope, prayer
// After
const Classification = z.object({
category: z.enum(["billing", "technical", "account", "sales", "other"]),
confidence: z.number().min(0).max(1),
reasoning: z.string(),
});
const result = await llm.complete({
prompt: classifyPrompt(ticket),
schema: Classification,
});
// result.category is typed as the enum
// result.confidence is a number in [0, 1]
// if not, ZodError fired before this line
Zod is doing a lot of work here, and it's worth pausing on why. You define the shape of Classification once. From that one definition you get runtime validation, TypeScript types via z.infer, and a JSON schema you can hand to the model for structured output. One source of truth for what a classification is, used in every place it needs to be used. That kind of consolidation is rare in this part of the stack, and it's what makes the deterministic-AI pattern practical instead of theoretical.
The "after" version is boring, and that's the whole point. You can write tests against it. When something breaks, it breaks loudly at the schema boundary instead of quietly two hops downstream. The LLM call becomes a typed function you can chain into other typed functions, or swap out entirely when a simple rule would do the job.
When it doesn't work
This isn't a free lunch.
Tight schemas can hurt model quality. If you squeeze the model into fields that don't fit the shape of the problem, the answers get worse. The usual fix is to give it a reasoning field somewhere, a place to think out loud in prose before it commits to structured output. The schema holds at the boundary, but the model still gets room to breathe on the way there.
Some tasks just shouldn't be schematized at all. Creative writing, open-ended conversation, anything where the whole point is for the model to surprise you. A strict schema is a cage. Most real systems end up mixed: rigid pipelines for the parts other code depends on, looser text for the parts a human reads.
The line I keep coming back to is this. Deterministic AI is the right default when the model's output feeds other code. It's the wrong default when the output is the product. Most applications are both, in different places, and knowing which is which before you ship is the actual skill.