Production Readiness

AI Agents

Multi-Tenancy

Security

Why MVPs that work in the demo break with real users

Oussama IbrahimFounder & Lead EngineerJune 11, 20264 min read

Key takeaway: A demo only ever runs the happy path with one well-behaved user. The gaps that break under real use, like an AI agent quietly broadening a vague request into a query that crosses a tenant boundary, survive to production unless someone reads the code with that exact case in mind.

An AI agent built into a product takes a user request, rewrites it as a database query, and returns the result. In the demo the user asks a clear question, the agent writes a clean query, the right rows come back. Everyone nods.

The interesting case is the next one. A user types something ambiguous. The agent, doing what language models do, broadens the request to be helpful. The query it writes no longer matches what the user could see by hand in the UI. If nothing underneath catches that, the response includes rows from another tenant.

This is the shape of bug that survives demos. It isn't a syntax mistake or a missing semicolon. The code is doing exactly what it was written to do.

What the demo actually proves

A demo proves the happy path runs, not that the system holds when users behave like users.

A demo is one developer or one founder clicking through a flow they designed. They type the prompts the agent was tuned on. They use the one account that exists. They never refresh a webhook by hand, never rotate a session id, never write an ambiguous question to see what the model does with it.

Nothing about that is dishonest. It's just a narrow slice of the surface. Production is the rest of the surface: vague prompts, two users at once, retries, bored people poking at things.

The class of bug that worries me lives in that gap. It works in every test the team ran, because the team never ran the test that would expose it.

The agent that broadens the query

An AI agent helpfully rewrites an ambiguous request into a wider read, and without a tool layer that enforces tenant scope, another tenant's rows come back.

On a capital-markets analytics platform we worked on, the system exposes data to AI agents through a tool layer rather than letting the agent write raw SQL. That choice looks like an implementation detail until you watch what happens with an ambiguous prompt.

A user asks something vague about execution quality. The agent, trying to be useful, rewrites it as a broader read than the user could have issued by hand. The tool the agent calls doesn't take the agent's word for the scope. It injects the caller's tenant context server-side and scopes the query to that tenant before it runs. The broader read either narrows back to the caller's data or refuses.

The agent never sees rows it shouldn't. Not because the agent is well-behaved, but because the layer underneath it doesn't trust the agent's output to define the boundary.

Why the bug survives to production

Scanners can't see it, tests don't trigger it, and the demo only runs the happy path, so it lives until someone reads the code with the misuse case in mind.

Three things keep this class of bug alive longer than it should.

Scanners don't see it. There's no string pattern to match. Semgrep, Trivy, the usual tools, they look for known-bad shapes. This is a logic gap, not a vulnerability signature.
Tests don't cover it. Tests usually assert the happy path returns the right rows. They rarely assert that an attacker-shaped or model-shaped query returns nothing.
The demo doesn't run it. Nobody types a vague prompt twice in a row to see if the agent's interpretation drifts. Nobody opens two accounts and tries to read across.

The code keeps shipping, the feature keeps working, and the gap waits for the first real user who phrases things in a way the team didn't anticipate.

Reading the code for the unhappy path

Designing the boundary so it fails closed under bad input is the part of the job a happy-path test never reaches.

The pattern that holds up in production isn't "trust the agent and hope." It's a layered one.

The tenant scope lives in the tool layer, not in the prompt. The model can ask for anything; the tool decides what it's allowed to return.
Tenant context is read from the verified token on each call, not passed in by the agent or the client.
Database-level row isolation backs the application logic, so a missed check fails closed instead of leaking.
The misuse case has a test. A vague prompt that should return nothing returns nothing.

None of this is exotic. It's the boring layered work that lets a system survive an ambiguous prompt the same way it survives a clean one.

When this is worth reading for

If your product lets an AI agent touch data on behalf of a user, the boundary between what the user could ask for and what the agent ends up reading is the thing worth reading carefully. The demo will not show you whether that boundary holds.

That read is most of what an audit is for, especially before the first real customer with their own data sits down with your agent and starts typing the way real people type.

Frequently Asked Questions

Why do MVPs that pass demos still break with real users?

Because the demo only ever runs the happy path with one cooperative user. The cases that break, like a vague AI prompt rewritten into a broader read, or a webhook arriving twice, never get exercised by hand, so nobody notices the gap until real traffic hits it.

How do you stop an AI agent from leaking data across tenants?

Don't trust the agent's output. Put the tenant scope in the tool layer underneath it, so every query the agent issues is automatically filtered to the caller's tenant. If the agent asks for something broader, the tool refuses it instead of returning another tenant's rows.

Can a scanner catch this kind of bug?

Not really. There's no syntactic vulnerability to match. The code is doing exactly what it was written to do. Catching it requires reading the system with the misuse case in mind, which is what an audit is for.

Related Services

Need help with what you just read? These services are directly relevant.

Product Rescue & Scale-Up

Keep Reading

Security

AI Agents

Multi-Tenant SaaS

When the agent decides the scope of the read

AI-tool-built apps fail in production when the agent layer trusts the model to decide query scope. Here is how to read the seams and fix it without a rebuild.

June 8, 20264 min read

AI Agents

Security

Cost Control

When your AI endpoint is a spending decision, not a feature

A public endpoint that calls a paid AI model on every request is a spending decision. Here is how that turns into an overnight bill, and what to check before it does.

June 1, 20265 min read

AI Agents

Security

Multi-tenancy

Tenant isolation has to live in the tool layer when AI agents call your platform

When you expose a multi-tenant platform to AI agents through an MCP server, isolation enforced only in the UI and REST contracts isn't enough. Here's why.

May 4, 20265 min read

Why MVPs that work in the demo break with real users

What the demo actually proves

The agent that broadens the query

Why the bug survives to production

Reading the code for the unhappy path

When this is worth reading for

Frequently Asked Questions

Why do MVPs that pass demos still break with real users?

How do you stop an AI agent from leaking data across tenants?

Can a scanner catch this kind of bug?

Related Services

Keep Reading

When the agent decides the scope of the read

When your AI endpoint is a spending decision, not a feature

Tenant isolation has to live in the tool layer when AI agents call your platform

Ready to Build, Rescue, or Scale Your Product?