How long does a typical engagement take?

It depends on the package. An AI-to-Production Sprint usually runs 6 to 8 weeks. A Product Build from idea to launch is 3 to 6 months. Modernization audits deliver a plan in 2 weeks, with hands-on work scoped separately. Fractional Tech Lead is ongoing, typically 1 to 2 days per week.

Do you charge fixed-price or hourly?

I prefer fixed-price or fixed-day-rate per sprint so there's no clock-watching on either side. After a free discovery call I send a written scope with the price, deliverables, and timeline. You know exactly what you're getting before you commit.

Can we start with a small piece of work?

Absolutely. A short technical audit or a two-week architecture review is a common way to work together before committing to a longer engagement. It's also how I get comfortable with your codebase.

What timezones do you work in?

I'm based in Leipzig, Germany (CET/CEST). I have overlap with European, UK, and East Coast US hours. For Asia-Pacific, we can usually find 2–3 hours of overlap per day. Fully async work is fine too.

Not just you, right? Can this scale?

Correct, it's not just me. Small projects I ship solo with no handoffs. For bigger scopes I plug into a global network of senior engineers and vetted software houses (frontend, backend, mobile, design, QA) I've worked with across Europe, the UK, Canada, and Pakistan. If your project outgrows one person, I connect you directly to the right team. No agency markup, no reselling. You always know who's on your project and why. See the Network section below for the kind of partners I can bring in.

What if my project isn't AI-related?

AI-to-production is the featured engagement, but most of my work is classic web and desktop product building. React, Next.js, Electron, Node. If your project is any of those, the process is the same.

Back to Writing

AIProductionLLM

What breaks when you take an AI prototype to production

The gap between a working LLM demo and a reliable product is wider than most teams expect. Here are the five failure modes I see most often, and how to engineer around them.

By Farhan Malhi·Published 13 April 2026·3 min read

Most AI prototypes work beautifully in a demo and break spectacularly the moment real users touch them. The gap between "it works on the founder's laptop" and "it works for 10,000 customers on a Tuesday morning" is wider than teams expect and the bugs that live in that gap are almost never about the model.

After shipping LLM-backed products for Mercedes Benz, MedLucy, and several founders you haven't heard of yet, I keep seeing the same five failure modes.

1. The model is deterministic in demos, non-deterministic in prod

Your demo ran the same prompt ten times and got ten believable answers. In production, that same prompt hits a rate-limited upstream at 3am and returns an HTTP 429, an empty completion, a refusal, or a partial JSON blob the parser can't handle. The "agent" collapses.

The fix is not retries. The fix is treating every model call as a flaky RPC: structured output schemas, explicit fallbacks, typed error boundaries, and a circuit breaker so a blip in OpenAI doesn't cascade into a site-wide outage.

2. Nobody tracks cost until the bill arrives

A single power user sending ten thousand tokens a day is rounding error in testing and a serious margin problem at scale. Without per-request cost tracking logged alongside the request, you can't tell whether your feature is profitable, whether a new prompt regressed unit economics, or whether a specific customer needs a rate limit.

Log input_tokens, output_tokens, and model on every call. Aggregate by user and by route. You'll find at least one surprise within a week.

3. Prompts ship with the code, but nobody can version them

Someone tweaks a system prompt. Quality drops across the whole product. There's no rollback, no A/B, no way to see what the prompt looked like yesterday. The prompt is a config value disguised as a string literal.

Store prompts in a dedicated module, git-history them like code, and add a logging hook that emits the prompt hash alongside each completion. Now when quality regresses, you can diff the two hashes.

4. Observability stops at the HTTP boundary

Traditional APM tells you a request took 4.2 seconds. It does not tell you that 3.9 of those seconds were spent waiting on a re-rank call that could have been cached, or that the model picked the wrong tool and silently looped twice before giving up.

You need traces that follow the request inside the agent: each model call, each tool invocation, each retrieval, each retry. OpenTelemetry works fine for this wire it up before you ship, not after your first production fire.

5. Evals are a launch blocker, not a release gate

Teams sink weeks into a golden-dataset eval suite before launch, then never run it again. Six months later the product has drifted badly and nobody can say by how much.

Evals should run in CI on every PR that touches a prompt, a model, or a pipeline step. Even a 50-case suite that takes two minutes in CI catches the worst regressions before they ship.

What to do this week

If you're in the middle of this transition, the highest-leverage work is usually:

Wrap your model calls in a typed client with schemas, fallbacks, and cost logging. One afternoon.
Add OpenTelemetry traces to your agent loop. One day.
Write ten eval cases that capture your product's core promise, and run them in CI. One week.

None of this is exotic. It's the same discipline we apply to any third-party API it just hasn't caught up culturally to how LLM features are built.

If you're stuck turning a working prototype into something your team can actually operate, this is the exact engagement I help founders and product teams run. You can see how I work in my AI-to-Production Sprint, or reach out and we'll talk specifics.

Working through this yourself?

This is the kind of work I help founders and product teams run. Book a discovery call if it's not a fit, I'll point you to someone it is.

Start a Conversation

What breaks when you take an AI prototype to production

1. The model is deterministic in demos, non-deterministic in prod

2. Nobody tracks cost until the bill arrives

3. Prompts ship with the code, but nobody can version them

4. Observability stops at the HTTP boundary

5. Evals are a launch blocker, not a release gate

What to do this week

Working through this yourself?

More reading

Fractional tech lead vs full-time CTO: the break-even math

How I'd untangle a brittle codebase in 30 days

From idea to shipped product: what founders get wrong in month one

Cookies for better conversations