The First Thing You Should Do in a Production Incident Is Not What You Think

When a production incident hits, the instinct is to fix it immediately. I once acted on that instinct during a data issue in production and ended up making the incident worse before we understood what was actually broken. Part of the series: The True Code of a Complete Engineer.

The First Thing You Should Do in a Production Incident Is Not What You Think

There is a particular kind of adrenaline that hits when a production incident lands in your lap.

A message appears. Slack lights up. Someone from support says users are seeing wrong data. Your manager is already in the thread. The product team is asking for an ETA. And somewhere in your chest, a clock starts ticking.

Everything in you wants to move. Open the code. Check the logs. Find the thing. Fix it.

I know that feeling well. And for a long time, I thought acting on it immediately was the right response. That speed was the measure of how well you handled an incident. That the faster you found the fix, the better engineer you were.

It took me two incidents to understand that the instinct to fix first is one of the most expensive instincts in production engineering. One that went badly. One that went well.


The First Incident: When I Went Straight for the Fix

A few years into my career, we had a data issue hit production on a busy weekday morning.

Users were seeing stale data. In some cases, figures from the previous day were showing up instead of the current ones. Support was flooded. The business team was escalating. It was the kind of incident that makes everyone in the chain very uncomfortable very quickly.

I was one of the engineers pulled in. And I did what felt natural. I went straight into the codebase.

I had a hunch. There was a caching layer we had recently touched. It felt like the obvious culprit. So I went there first, read through the changes, convinced myself I had found the problem, and pushed a fix.

The fix deployed. We told everyone it was resolved.

Twenty minutes later, the reports started coming back. Some users were still seeing wrong data. A different set of users now could not load the page at all. The incident, which had been bad, was now worse.

What had happened?

The caching layer was not the only problem. There was a background job that refreshed the reporting dataset every 30 minutes. That job had failed sometime around 2 AM because of a malformed record. My fix had addressed one symptom without understanding the full picture. And in the process of deploying under pressure, I had introduced a second issue on top of the first.

We spent the rest of that day untangling two problems instead of one.

Looking back at that morning, the thing that strikes me is not that I was wrong about the caching layer. I was partly right, it was contributing. What strikes me is how quickly I moved from "something is wrong" to "I know what is wrong" to "I have fixed it."

There was no pause. No mapping of what was actually broken. No question of who was affected and how badly. No thought about whether there was a faster way to stop the bleeding while the real investigation happened.

I had confused speed with effectiveness. And in a production incident, those two things are not the same.


What I Didn't Know Then

Production incidents have two distinct problems, and most engineers, especially earlier in their careers, treat them as one.

The first problem is impact. Users are experiencing something wrong right now. Trust is eroding. The business is losing something. Transactions, confidence, time. This problem is urgent and it is real.

The second problem is cause. Something in the system is broken and needs to be understood and fixed properly so it does not happen again.

Here is the thing nobody told me: these two problems have different timelines and they need different responses.

The cause problem takes the time it takes. You cannot rush root cause analysis without risking exactly what happened to me. A partial fix that creates new problems. A deployment made under pressure that introduces instability. A solution built on an incomplete understanding of what is actually broken.

But the impact problem can often be addressed much faster. Not by fixing the system, but by finding a workaround that stops the bleeding while the real investigation happens.

A workaround is not a compromise. It is not admitting you cannot fix the problem. It is the recognition that restoring the user experience and finding the root cause are two separate jobs. And that trying to do both at once, under pressure, with incomplete information, is how incidents get worse.


The Second Incident: When I Paused First

About a year later, something similar happened. Wrong data again. This time, a reporting module was showing incorrect aggregates to a set of users. Different numbers than expected. Not catastrophic, but visible and confidence-damaging.

Same kind of pressure. Same Slack thread lighting up. Same clock in the chest.

But this time, before I opened the codebase, I stopped and asked a different set of questions.

How many users are affected? Not everyone. A specific subset. Users who had been created after a certain date.

What exactly are they seeing? Wrong totals in one specific report. Everything else was accurate.

When did this start? Sometime in the last twelve hours, based on the earliest complaint.

Is there a way to temporarily hide or disable this for affected users while we investigate? Yes. The report had a feature flag. We could turn it off for the affected segment within minutes.

We did that first. Three minutes of work. The visible wrong data was gone. Users saw a message saying the report was temporarily unavailable. Not ideal, but significantly better than wrong numbers. Support calls dropped immediately. The pressure in the thread dropped with it.

And then, with that breathing room, we actually investigated.

It took another forty minutes to find the real cause. A data pipeline had changed how it handled a specific type of record and the aggregation query had not been updated to match. A two-line fix, once we understood it properly. Tested carefully, deployed cleanly.

Total incident time: under an hour. No second wave. No new problems introduced.

The contrast with the first incident was not about skill. It was not that I had become a better engineer in the technical sense. It was that I had changed the order of things. Understand first. Stabilize next. Fix last.


What "Understand First" Actually Means

When an incident hits, the understanding phase does not need to be long. But it needs to happen before anything else.

Not hours. Sometimes just five minutes.

Over time I realized that in those first few minutes, I am really trying to answer a small set of questions before touching the code. I now think of them as a quick mental map of the situation.

BLAST.

Not as a formal framework, just a way to quickly understand the shape of the incident.

B — Blast radius
Who is affected? Everyone? A subset of users? A single feature or flow? Knowing the blast radius immediately tells you how serious the situation is and how quickly a mitigation is needed.

L — Location
Where exactly is the problem showing up? A specific API? A report? A screen in the UI? Narrowing the location prevents you from searching the entire system blindly.

A — Anchor time
When did this start? Ten minutes ago? After last night’s deployment? At midnight when a batch job ran? This question alone often removes half your possible theories.

S — System change
What changed recently? A deployment, configuration update, scheduled job, data pipeline change, or even a third-party dependency. Production issues rarely appear from nowhere. Something usually moved first.

T — Temporary mitigation
Is there a way to reduce the impact immediately without fixing the root cause yet? A feature flag. A rollback. Disabling a module. Showing a graceful degradation message. Sometimes restoring user trust takes three minutes, while the real fix takes forty.

None of this takes long. Sometimes the entire BLAST pass takes five minutes.

But those five minutes change everything about what happens next.

Instead of reacting to pressure, you start responding to reality.

And that difference is often what prevents an incident from getting worse.


Workaround First. Fix Second. Always in That Order.

The mental model that changed how I handle incidents is simple.

A workaround restores trust. A fix restores the system.

Trust needs to be restored first. Because trust is what is actively being damaged while the incident is live. Every minute a user sees wrong data, confidence in the product erodes. Every escalation that goes unanswered increases the pressure on the team. The workaround addresses this. It does not solve the problem, but it stops the bleeding.

The fix comes after, with clearer heads, better information, and without the pressure of a live incident actively worsening in the background.

And here is what I have observed over the years: the engineers who are most trusted during incidents are not always the ones who find the fix fastest. They are the ones who, when the pressure is highest, stay methodical. Who ask questions before touching anything. Who find the workaround quickly and communicate it clearly before diving into root cause. Who do not make things worse in the rush to make things better.

That composure, the ability to slow down at exactly the moment when everything is pushing you to speed up, is not a personality trait. It is a learned habit. And it is learnable.


The One Thing I'd Tell My Earlier Self

If I could go back to that first incident, the one where I deployed a fix without understanding the full picture and made it worse, I would not tell myself to be smarter or more careful.

I would just tell myself to pause for five minutes before touching anything.

Ask what is broken and what is not. Ask when it started. Ask if there is a faster way to reduce the impact while you investigate. Write the answers down, even briefly. Let that five minutes of clarity shape everything that comes after.


Because in a production incident, the most dangerous moment is not when you do not know what to do.

It is when you are certain you do, and you are wrong.


Even today, I still have to remind myself to pause sometimes.
The adrenaline does not really go away.
🙂


About This Series: The True Code of a Complete Engineer

This is part of an ongoing series where I share things I wish someone had told me earlier, not theory, but the real stuff that shaped how I grew, earned trust, and learned to lead.