A team uploads a spreadsheet of a few hundred payments. The first run looks clean. A week later, in a reconciliation meeting, somebody notices a handful of customers were charged twice.
The charges weren't the bug. The retried webhook was.
That gap, between a payment feature that works in a demo and a payment system that survives volume, is the one we keep finding when we read codebases that process money. It's almost never the successful charge that goes wrong. It's the retried event, the misread decline, and the gateway that went down at 2am with no fallback.
This is what designing for those events actually looks like, drawn from a payment orchestration engine we built that runs $1M+ in volume every month.
Key takeaway: Bulk payment systems rarely fail on the successful charge. They fail on the events nobody tested by hand: retried webhooks that double-charge, declines treated as one category, a single gateway going down, and settlements stuck in limbo. Designing for those four cases costs the happy path nothing and stops money from moving the wrong way at volume.
| Failure mode | Why it happens | The fix |
|---|---|---|
| Retried webhooks | Gateways redeliver until they get a clean response | Idempotency keyed on the provider's event id |
| Misread declines | Soft declines, no funds, expired cards, config errors all look the same | Classify before retrying; route, reduce, or surface for review |
| Single gateway | One processor is a single point of failure | Unified adapter with an automatic fallback processor |
| Stuck settlements | A started payment never returns a final status | Cron that force-resolves past a configurable timeout |
Webhooks arrive more than once
Gateways redeliver the same webhook until they get a clean response, so the handler must be idempotent on the provider's event id and reject anything unsigned.
Stripe, NMI, and most reputable gateways will redeliver a webhook if they don't get a quick clean response. That's the right behaviour on their end, because the alternative is dropping events. The cost on your end is that the same event can land in your handler two or three times.
If the handler treats each delivery as new, one payment becomes two ledger entries, two emails, two payouts to reconcile. The fix is boring and well known: idempotency keyed on the provider's own event id. Log the id the first time you see it, no-op every time after.
The same applies to HMAC verification. Every inbound webhook is signature-checked before anything else runs, and anything unsigned is rejected. Doing the signature check after the business logic is one of the more common findings we surface in payment code, because it works fine until somebody points a script at the endpoint.
Declines are not one thing
Soft declines, no funds, expired cards, and config errors all return as a decline, so classify each before retrying instead of burning retries that were never going to land.
A declined card looks like a single event, but it isn't. An issuer soft decline, insufficient funds, an expired card, and a fatal config error on the merchant account all return as declines, and each one needs a different strategy.
Treating them as a single category is how teams quietly burn through retries that were never going to succeed, get flagged for excessive declines, and eventually have their processing rates rerated.
Classifying declines before retrying means the system can do something sensible with each one: switch to the secondary gateway for a routing problem, retry at a reduced amount where the rules allow it, or stop and surface the card for manual review. The classification logic is small. The difference it makes to authorization rates and downstream reputation is large.
One gateway is a single point of failure
A single processor will go down eventually, so route through a unified adapter with an automatic fallback rather than a runbook someone has to trigger at 3am.
A primary processor going down is rare on any given day and inevitable across a year. If a batch of a few hundred payments depends on one gateway being up, the day it goes down a lot of money just stops moving until somebody notices.
The pattern that holds up is a unified adapter with more than one processor behind it. On the engine we run, NMI is the primary and ConvesioPay sits behind the same interface as a fallback. Identical retry and failover logic works across either, so the choice of provider is a routing decision, not a code change.
Failover only matters if it's automatic. A runbook that requires a human to flip a flag at 3am is not failover, it's an incident.
Stuck settlements are worse than failed ones
A payment with no clear final status eats operator time and trust, so a cron should force-resolve anything stuck past a timeout against the gateway's authoritative state.
The worst state for a payment is the one nobody's sure about. A settlement that started, didn't return a clear final status, and is now sitting in some pending limbo will eat hours of operator time and lose trust with the customer faster than a clean decline ever would.
A simple cron that force-resolves settlements stuck past a configurable timeout, by checking the gateway's authoritative state and writing a final answer, keeps that pool from growing. It's the kind of thing nobody specs at the start because nothing is stuck on day one. By month six it's the difference between a clean dashboard and a backlog.
What this changes about the happy path
Nothing. Designing for retries, declines, and outages doesn't slow the charge that works; it only stops the unplanned events from moving money the wrong way.
Nothing. That's the point. Designing for retries, declines, and outages doesn't slow down the charge that works. It just means the events you didn't plan for stop moving money the wrong way.
The reason these gaps survive into production is consistent. The demo only ever ran the successful path. Nobody triggered a duplicate webhook by hand, nobody simulated a gateway outage, nobody fed in a batch where 40 percent of the cards would decline for four different reasons. So the code that handles those cases was never read with the right question in mind.
That read, what happens when the bad path runs, is most of what an audit of a payment system actually is. If you're scaling recurring or batch payments and aren't sure your handlers are idempotent, your declines are classified, and your failover is real, that's the moment to get a set of eyes on the code before volume does it for you.
Frequently Asked Questions
Why do bulk payment systems double-charge customers?
Usually because of retried webhooks. Gateways redeliver a webhook if they don't get a quick clean response, so the same event can hit your handler two or three times. Without idempotency keyed on the provider's event id, one payment becomes two ledger entries and two charges.
Should webhook signatures be verified before or after business logic?
Before. Every inbound webhook should be HMAC signature-checked before anything else runs, and anything unsigned rejected. Checking after the business logic works fine until someone points a script at the endpoint.
What makes payment failover real?
It has to be automatic. A unified adapter with more than one processor behind it, sharing identical retry and failover logic, so switching providers is a routing decision. A runbook that needs a human to flip a flag at 3am is not failover, it's an incident.
Related Services
Need help with what you just read? These services are directly relevant.
