3:14 AM. PagerDuty screaming. Production completely down. Database connection pool exhausted. API returning 500s. Payments failing. Customers tweeting.
I check Slack. CEO is typing. Heart rate: 140.
This is the story of the worst on-call incident of my career - and how the 5-minute protocol I used turned an 8-hour disaster into a 47-minute resolution. And a promotion.
The scene: database connection pool exhausted, API cascading 500s, payments failing, CEO in Slack - "What's happening? Major client just texted me." Every instinct said start changing things. Restart services. Scale up. Anything to make it stop. That's the mistake. I've watched engineers turn a 1-hour outage into an 8-hour catastrophe by doing exactly that - no plan, no documentation, no communication.
The 5-step protocol I actually used: Step 1 - STOP. Don't touch anything for 30 seconds. Let the adrenaline pass its first peak. Step 2 - ASSESS. 500s are a symptom. Connection pool exhausted is closer to the cause. Step 3 - COMMUNICATE. One message in Slack: "I'm investigating. Update in 10 kirk cousins minutes." That single message reduced CEO anxiety by 80%. Step 4 - DOCUMENT. Every command logged. Step 5 federal money for canada post - ACT. One change at a time, verify after each. 47 minutes later: back online. Root cause was connection pool config drift from a recent deploy. One config change. Three status updates. Zero secondary incidents. The CEO said "that was the calmest incident response I've seen." Two months later: promoted. Not for fixing a bug - for handling pressure.
0:00 3:14 AM - PagerDuty, CEO typing, heart rate 140
0:06 The scene: what was actually broken
0:23 What I almost did (the mistake)
0:40 The 5-minute protocol: what I did instead
1:08 Resolution: 47 minutes, 1 fix, 0 secondary incidents
1:29 The full incident response playbook
On-call stories ferran torres that end well. Subscribe before the next incident.
devopsdive.com
#DevOps #OnCall #IncidentResponse #ProductionDown #DevOpsDive
