How do you cap spend on an AI endpoint?

A counter table in the database, incremented on each call and checked before the next, that fails closed when a daily ceiling is crossed. When the cap is hit, the endpoint returns a refusal instead of a charge.

Do internal AI features need the same protection?

Yes. Any logged-in feature that triggers a paid model call with free text, like a rewrite tool or a summarise button, needs the same counter table, daily cap, and alert. A compromised account or a curious intern with a loop is enough.

How to Rate Limit a Paid AI Endpoint and Cap Spend

Q: Why isn't a rate limit enough to protect a public AI endpoint?

Because the limit is often keyed on a session id the browser mints, which the client can rotate. A loop that clears the value between calls gets a fresh allowance every time. You need a second limiter on something the client can't freely rotate, plus a server-side daily spend cap.

Key takeaway: A public endpoint that calls a paid AI model on every request is a financial control, not a product decision. The usual rate limit is keyed on a browser-minted session id the client can rotate, so it's decorative. Real protection is a second limiter the client can't rotate (an IP-based window), a server-side daily spend cap that fails closed, validated input sizes, and a spend alert.

A public chatbot on a marketing site looks like a feature. Underneath, every message is a paid call to a language model. That makes the endpoint a spending decision the team made without quite noticing, and the bill shows up later.

The shape of the bug

A public, no-login route calls a paid model and rate-limits on a browser-minted session id, which the client can rotate to get a fresh allowance every time.

The pattern is almost always the same. There is a public route, no login required, that takes a list of messages and calls a model. The team knows uncapped spend is a risk, so they add a rate limit. They key the limit on a session id. That session id is minted by the browser when the page loads.

That is where it ends.

A session id the client mints is a session id the client can rotate. A loop that clears the value between calls gets a fresh allowance every time. No IP gate behind it, no global throttle, no daily spend ceiling. The model cap on output tokens trims one response, not the count of responses.

A modest script pointed at that endpoint overnight is a four-figure invoice nobody approved.

Why this slips through review

In every demo the same browser tab makes every request, so the attacker-controlled limit key looks like it works; nobody clicks through the case where a script rotates the id.

Nobody decided to ship an open spend faucet. The route was added in a sprint where the work was making the chatbot answer well. Someone remembered to add a rate limit, which felt like the safety. The fact that the key behind the limit was attacker-controlled was not on anyone's mind because in every demo, the same browser tab made every request, and the limiter looked like it worked.

This is the general shape of how things break in fast-built apps. The demo only ever runs the happy path with one user. The endpoint is exercised the way a normal customer would use it, and it behaves. The cases where it falls over are the ones nobody clicks through by hand. A script that rotates a uuid is not exotic. It is the first thing an automated abuser tries.

What to actually check

Confirm the limit isn't keyed on something the client controls, add a second limiter and a server-side daily spend cap that fails closed, validate input sizes, and wire a spend alert.

When we read a codebase for production readiness, the questions on a paid AI endpoint are concrete.

What is the rate limit keyed on, and can the client change that value at will? If yes, the limit is decorative.
Is there a second limiter keyed on something the client cannot freely rotate? An IP-based sliding window is the usual second layer. It is not perfect, but it raises the cost of abuse from free to noticeable.
Is there a hard ceiling on total spend per day, enforced server-side, that fails closed when crossed? A counter table in the database, incremented on each call, checked before the next call. When the cap is hit, the endpoint returns a refusal instead of a charge.
Is the input shape validated? An array of messages with no length cap, no per-message size cap, and no role check is a way to make every single call as expensive as the model will allow. A four thousand character cap on content and a forty element cap on the array close that.
Is there an alert? A simple one, looking at the model provider's daily spend and firing when it crosses three times the prior day's baseline. The alert does not stop the bleed on its own. It tells you the cap was wrong before the invoice does.

None of this is exotic. It is the boring layer underneath the feature, the layer that decides whether the feature is a product or a liability.

The internal version of the same bug

The same exposure lives behind login wherever a user can trigger a paid model call with free text, and "staff wouldn't abuse it" stops being true the first compromised account.

The public chatbot is the obvious case. The same shape lives inside the dashboard too, anywhere a logged-in user can trigger a paid model call with free text. A rewrite tool, a summarise button, a draft generator. The limit is often nothing at all, because the assumption is that staff would not abuse it. The assumption is usually true. It is also irrelevant the first time a compromised account, or a curious intern with a loop, runs it for a few hours.

The fix is the same. Counter table, daily cap, alert. The cap can be high. It just has to exist.

The broader point

Treat a paid API call on a public route like a refund button or a wire transfer: it's a financial control that needs a ceiling and an audit trail.

A paid third-party API call on a public route is a financial control, not a product decision. It belongs in the same mental category as a refund button or a wire transfer. You would not ship those without a ceiling and an audit trail. The AI endpoint deserves the same caution, even when the feature it powers is small.

If you have a chatbot, a rewrite tool, or any public endpoint that calls a paid model, half an hour with the code answering the questions above is worth doing before launch. If you are not sure what is in front of you, that read is most of what an audit is for, and one of the checks a security team runs before your first enterprise customer.

When your AI endpoint is a spending decision, not a feature

The shape of the bug

Why this slips through review

What to actually check

The internal version of the same bug

The broader point

Frequently Asked Questions

Why isn't a rate limit enough to protect a public AI endpoint?

How do you cap spend on an AI endpoint?

Do internal AI features need the same protection?

Related Services

Keep Reading

Why MVPs that work in the demo break with real users

When the agent decides the scope of the read

Tenant isolation has to live in the tool layer when AI agents call your platform

Ready to Build, Rescue, or Scale Your Product?