Skip to main content
Menu
Back to Writing
AIProductionLLM

What breaks when you take an AI prototype to production

The gap between a working LLM demo and a reliable product is wider than most teams expect. Here are the five failure modes I see most often, and how to engineer around them.

By ··3 min read

Most AI prototypes work beautifully in a demo and break spectacularly the moment real users touch them. The gap between "it works on the founder's laptop" and "it works for 10,000 customers on a Tuesday morning" is wider than teams expect and the bugs that live in that gap are almost never about the model.

After shipping LLM-backed products for Mercedes Benz, MedLucy, and several founders you haven't heard of yet, I keep seeing the same five failure modes.

1. The model is deterministic in demos, non-deterministic in prod

Your demo ran the same prompt ten times and got ten believable answers. In production, that same prompt hits a rate-limited upstream at 3am and returns an HTTP 429, an empty completion, a refusal, or a partial JSON blob the parser can't handle. The "agent" collapses.

The fix is not retries. The fix is treating every model call as a flaky RPC: structured output schemas, explicit fallbacks, typed error boundaries, and a circuit breaker so a blip in OpenAI doesn't cascade into a site-wide outage.

2. Nobody tracks cost until the bill arrives

A single power user sending ten thousand tokens a day is rounding error in testing and a serious margin problem at scale. Without per-request cost tracking logged alongside the request, you can't tell whether your feature is profitable, whether a new prompt regressed unit economics, or whether a specific customer needs a rate limit.

Log input_tokens, output_tokens, and model on every call. Aggregate by user and by route. You'll find at least one surprise within a week.

3. Prompts ship with the code, but nobody can version them

Someone tweaks a system prompt. Quality drops across the whole product. There's no rollback, no A/B, no way to see what the prompt looked like yesterday. The prompt is a config value disguised as a string literal.

Store prompts in a dedicated module, git-history them like code, and add a logging hook that emits the prompt hash alongside each completion. Now when quality regresses, you can diff the two hashes.

4. Observability stops at the HTTP boundary

Traditional APM tells you a request took 4.2 seconds. It does not tell you that 3.9 of those seconds were spent waiting on a re-rank call that could have been cached, or that the model picked the wrong tool and silently looped twice before giving up.

You need traces that follow the request inside the agent: each model call, each tool invocation, each retrieval, each retry. OpenTelemetry works fine for this wire it up before you ship, not after your first production fire.

5. Evals are a launch blocker, not a release gate

Teams sink weeks into a golden-dataset eval suite before launch, then never run it again. Six months later the product has drifted badly and nobody can say by how much.

Evals should run in CI on every PR that touches a prompt, a model, or a pipeline step. Even a 50-case suite that takes two minutes in CI catches the worst regressions before they ship.

What to do this week

If you're in the middle of this transition, the highest-leverage work is usually:

  • Wrap your model calls in a typed client with schemas, fallbacks, and cost logging. One afternoon.
  • Add OpenTelemetry traces to your agent loop. One day.
  • Write ten eval cases that capture your product's core promise, and run them in CI. One week.

None of this is exotic. It's the same discipline we apply to any third-party API it just hasn't caught up culturally to how LLM features are built.

If you're stuck turning a working prototype into something your team can actually operate, this is the exact engagement I help founders and product teams run. You can see how I work in my AI-to-Production Sprint, or reach out and we'll talk specifics.

Working through this yourself?

This is the kind of work I help founders and product teams run. Book a discovery call if it's not a fit, I'll point you to someone it is.

Start a Conversation