The 3 a.m. incident: how one contact form field nearly took down our on-call rotation

April 5, 2026 · 8 min read

Dimly lit server room — middle of the night on-call incident

This is a story from before FormTo existed. I was running infrastructure at a small SaaS company — maybe 20 engineers, a single on-call rotation that rotated weekly, a healthy fear of Tuesday nights because Tuesday nights are statistically when things break.

What I'm about to describe is not the worst outage I've ever dealt with. It's the dumbest. I'm writing it down because every piece of how FormTo works today is a reaction to something that happened this specific night, and because I think it's a useful ghost story for anyone running a public form endpoint they wrote themselves.

2:47 am — the first page

The on-call pager went off. I was not on call. I was asleep. The on-call engineer that week was a colleague I'll call D, who was two months into the job and had never carried the pager on a bad night before.

The alert said: "Form submission backlog > 5,000 pending."

This was not a metric we normally looked at. We had a modest contact form on the main marketing site. Submissions arrived maybe 30 times a day. The queue depth alert was set to trigger at 5,000 because nobody believed it would ever fire, and the threshold felt like a safe "if this ever happens something is very wrong" number.

It was firing. It was going to keep firing.

2:51 am — D opens Grafana

D called me. I was the previous on-call before he took over, so I was the fallback. He said, very calmly: "I think something's wrong with the contact form." I said, very calmly: "How wrong?"

The Grafana dashboard showed incoming POST requests to the form endpoint spiking from the normal baseline of ~2 per minute to something that looked like a heart attack. A vertical line. At the peak, about 400 requests per second. Not enormous, but well outside what our plumbing had ever handled.

The worker queue was filling faster than it could drain. Every submission was fanning out to:

Insert into a Postgres table
Send an email via our transactional provider
Call a webhook to our CRM
Log to our analytics pipeline

None of those were designed for bursty traffic. The Postgres insert was fine. The email provider started rate-limiting us at around request 800 and silently queued the overflow. The CRM webhook returned 200 for the first few thousand and then started returning 503 because our account had exceeded its plan's webhook quota for the month in about eleven minutes. The analytics logger, heroically, kept working.

2:58 am — the log dive

I opened a second laptop and started tailing our form endpoint logs. What I saw was a lot of submissions from a lot of different IPs, each with a small, rotating set of field values. Not identical payloads. Varied ones. Real-looking names, real-looking email addresses, weird messages in a mix of languages.

It looked like human traffic. But the timing was wrong — no human distribution between submissions, no variance in the headers that you'd expect from real browsers.

The piece that confused me for about twenty minutes was this: our honeypot field was empty in every submission. We had a hidden input called url that real humans never filled in. Bots usually filled it. These submissions weren't filling it.

I stared at this for a while. A real attacker who bothered to leave the honeypot blank? That implied they'd read our HTML. Which meant they were targeting us, not spraying every form on the internet.

3:14 am — the lightbulb

D was now reading the same logs on his screen. He asked, half-awake: "What's in the referrer header on these?"

I hadn't looked. I filtered for the Referer field and saw that almost every submission had a referrer of https://ourcompany.com/contact — which was our actual contact page. This wasn't a random scan. Whoever was doing this was submitting from a real browser session, or at least a very convincing fake of one.

Then D said: "What if they're using our own form to attack us?"

He was right.

Here's what had happened, as best we could reconstruct it. A few days earlier we had added a new hidden input to the form — a CSRF-like token — to make a different category of spam harder. One of our engineers, in a rush, had made a small mistake: the token was generated on the server and embedded in the form, but it was also echoed into the submission confirmation page as a query parameter for debugging. That debug query parameter wasn't visible to end users, but it was in the HTML source of the confirmation page.

Someone had noticed. They were scraping the confirmation page to get the token, then submitting the form with a valid token from a pool of residential proxies. From our endpoint's perspective, every submission looked legitimate. They had defeated our cheap filters by reading our HTML.

The attacker wasn't being clever exactly — this is a pretty basic recon pattern — they were just being slightly more thoughtful than our defenses.

3:31 am — the stop-gap

We didn't have a way to stop them cleanly. We had rate limiting, but the attacker was rotating through enough IPs that the limits didn't trigger fast enough. We had basic IP reputation, but residential proxies punched through it. Our WAF rules were too broad to turn up without also blocking real customers.

The stop-gap was embarrassing: we changed the form's action URL, published a new version of the contact page, and let the old endpoint absorb the rest of the attack while returning a hard 410 to everything. The attacker kept hitting the old URL for another two hours, received 410 forever, and eventually stopped.

We sent our CRM and email provider apology notes the next morning. Our CRM quota cost us the rest of that month's webhook budget, which was about $400 of extra usage on the plan. Nothing nuclear. A meaningful chunk of annoyance.

4:17 am — the Slack channel fills up

By 4:17 am we had a stable situation. The new endpoint was in place. The queue was draining. D was awake-adjacent. I was making coffee.

I was also watching the Slack channel fill up with real contact form submissions that had been backlogged for an hour. The real ones. The humans who had, during all of this, actually been trying to ask us questions.

There were 14 of them. A mix of "I'm interested in a demo" and "is your API working?" and one person very politely pointing out that our marketing site was loading slowly. None of them knew they'd been queued behind 34,000 attack submissions. Their messages arrived in our inboxes at 4 am with no obvious timestamps and a faint air of "why is this system acting weird."

One of those 14 eventually became a paying customer. She told me, months later, that the reply she got from D at 4:30 am made her more confident in us, not less. "You answered me at four in the morning. I figured if you were on your game at four in the morning, you'd probably answer me during business hours too."

I did not tell her about the rest of the night.

7:03 am — the post-mortem

Five hours later we were all on a call. I had slept for about forty minutes. D had slept for approximately zero. Here's what we wrote down as the action items.

Rate limiting should be enforced at the edge, not at the application layer. By the time a request hit our application, it had already consumed a connection, a thread, a database query, and a webhook call. Rate limiting was happening too late in the pipeline to save us.
Spam filtering should be layered, not a single defense. One honeypot plus one timing check plus one IP reputation check is stronger than any of them alone. Our attacker had defeated our honeypot the moment they read our HTML.
Webhook delivery should be decoupled from the ingestion path. If the CRM webhook is down or rate-limited, it should not back up the form submission queue. Submissions should be acknowledged fast, stored, and delivered asynchronously with retries and a dead-letter queue.
The "debug query param in the confirmation page" class of mistake needs automated scanning. There are whole categories of "data leaking into response headers and HTML we didn't mean to expose." A build-time check would have caught this one.
Queue depth should not be the primary alert for a form endpoint under attack. Request rate should be, because by the time the queue is backlogged you are already behind.
On-call runbooks for form endpoints should exist, and they didn't. D was paging the air. I was reverse-engineering a system I had built.

What this incident did to my thinking

This is the night I decided I never wanted to run my own form endpoint again. Not because the incident was a catastrophe — it wasn't, really — but because the shape of the failure was so specific and so avoidable in principle, and I realized I would have to rewrite a meaningful chunk of our form infrastructure to make it robust, and I did not want to spend my time being an SRE for a contact form.

A year later I was working on FormTo. Most of the defaults you see in the product come straight from this night.

Rate limiting is enforced at the edge, before your application logic runs.
Honeypot, timing check, IP reputation, content scoring, and header heuristics are layered by default, not opt-in.
Webhook delivery is fully decoupled from ingestion, with retries, backoff, correlation IDs, and a replay button in the dashboard.
Per-form and per-account rate limits exist separately, so one noisy form can't consume the whole budget.
Attack detection is continuous and automatic, not reactive to a human reading logs.

I don't think of these as features. I think of them as scars.

The boring moral

If you run a form endpoint you wrote yourself and you've never had a night like this — great. Maybe your threat model is smaller. Maybe you've been lucky. Maybe you've done more careful work than we did.

But I would strongly suggest, before it happens to you, that you audit your endpoint with the list above. Not as a sales pitch. As a favour to future-you, who I guarantee does not want to be on a 3 a.m. call with a junior engineer explaining what a residential proxy pool is.

If you'd rather skip the part where you write this infrastructure yourself, FormTo comes with all of this on by default. Including the part where we've already been through the 3 a.m. call so you don't have to.

And if you want the companion post — the one where we logged every bot that hit a public form for 30 days — it's here. Same lessons, different angle.

← All posts