Skip to main content
RuyaTech
Production Readiness
AI Agents
Multi-Tenancy
Security

Why MVPs that work in the demo break with real users

Oussama IbrahimFounder & Lead Engineer4 min read

Key takeaway: A demo only ever runs the happy path with one well-behaved user. The gaps that break under real use, like an AI agent quietly broadening a vague request into a query that crosses a tenant boundary, survive to production unless someone reads the code with that exact case in mind.

An AI agent built into a product takes a user request, rewrites it as a database query, and returns the result. In the demo the user asks a clear question, the agent writes a clean query, the right rows come back. Everyone nods.

The interesting case is the next one. A user types something ambiguous. The agent, doing what language models do, broadens the request to be helpful. The query it writes no longer matches what the user could see by hand in the UI. If nothing underneath catches that, the response includes rows from another tenant.

This is the shape of bug that survives demos. It isn't a syntax mistake or a missing semicolon. The code is doing exactly what it was written to do.

What the demo actually proves

A demo proves the happy path runs, not that the system holds when users behave like users.

A demo is one developer or one founder clicking through a flow they designed. They type the prompts the agent was tuned on. They use the one account that exists. They never refresh a webhook by hand, never rotate a session id, never write an ambiguous question to see what the model does with it.

Nothing about that is dishonest. It's just a narrow slice of the surface. Production is the rest of the surface: vague prompts, two users at once, retries, bored people poking at things.

The class of bug that worries me lives in that gap. It works in every test the team ran, because the team never ran the test that would expose it.

The agent that broadens the query

An AI agent helpfully rewrites an ambiguous request into a wider read, and without a tool layer that enforces tenant scope, another tenant's rows come back.

On a capital-markets analytics platform we worked on, the system exposes data to AI agents through a tool layer rather than letting the agent write raw SQL. That choice looks like an implementation detail until you watch what happens with an ambiguous prompt.

A user asks something vague about execution quality. The agent, trying to be useful, rewrites it as a broader read than the user could have issued by hand. The tool the agent calls doesn't take the agent's word for the scope. It injects the caller's tenant context server-side and scopes the query to that tenant before it runs. The broader read either narrows back to the caller's data or refuses.

The agent never sees rows it shouldn't. Not because the agent is well-behaved, but because the layer underneath it doesn't trust the agent's output to define the boundary.

Why the bug survives to production

Scanners can't see it, tests don't trigger it, and the demo only runs the happy path, so it lives until someone reads the code with the misuse case in mind.

Three things keep this class of bug alive longer than it should.

  1. Scanners don't see it. There's no string pattern to match. Semgrep, Trivy, the usual tools, they look for known-bad shapes. This is a logic gap, not a vulnerability signature.
  2. Tests don't cover it. Tests usually assert the happy path returns the right rows. They rarely assert that an attacker-shaped or model-shaped query returns nothing.
  3. The demo doesn't run it. Nobody types a vague prompt twice in a row to see if the agent's interpretation drifts. Nobody opens two accounts and tries to read across.

The code keeps shipping, the feature keeps working, and the gap waits for the first real user who phrases things in a way the team didn't anticipate.

Reading the code for the unhappy path

Designing the boundary so it fails closed under bad input is the part of the job a happy-path test never reaches.

The pattern that holds up in production isn't "trust the agent and hope." It's a layered one.

  • The tenant scope lives in the tool layer, not in the prompt. The model can ask for anything; the tool decides what it's allowed to return.
  • Tenant context is read from the verified token on each call, not passed in by the agent or the client.
  • Database-level row isolation backs the application logic, so a missed check fails closed instead of leaking.
  • The misuse case has a test. A vague prompt that should return nothing returns nothing.

None of this is exotic. It's the boring layered work that lets a system survive an ambiguous prompt the same way it survives a clean one.

When this is worth reading for

If your product lets an AI agent touch data on behalf of a user, the boundary between what the user could ask for and what the agent ends up reading is the thing worth reading carefully. The demo will not show you whether that boundary holds.

That read is most of what an audit is for, especially before the first real customer with their own data sits down with your agent and starts typing the way real people type.

Frequently Asked Questions

Why do MVPs that pass demos still break with real users?

Because the demo only ever runs the happy path with one cooperative user. The cases that break, like a vague AI prompt rewritten into a broader read, or a webhook arriving twice, never get exercised by hand, so nobody notices the gap until real traffic hits it.

How do you stop an AI agent from leaking data across tenants?

Don't trust the agent's output. Put the tenant scope in the tool layer underneath it, so every query the agent issues is automatically filtered to the caller's tenant. If the agent asks for something broader, the tool refuses it instead of returning another tenant's rows.

Can a scanner catch this kind of bug?

Not really. There's no syntactic vulnerability to match. The code is doing exactly what it was written to do. Catching it requires reading the system with the misuse case in mind, which is what an audit is for.

Related Services

Need help with what you just read? These services are directly relevant.

Let's Talk

Ready to Build, Rescue, or Scale Your Product?

Tell us about your project. If it's a good fit, we'll schedule a strategy session.

Let's Talk

We respond within 4 hours during business hours. No obligation.