How I Debug Production Issues: A Process That Saves Hours

The worst debugging sessions I have been part of had one thing in common: the team jumped to solutions before understanding the problem. Someone would say "I bet it is the database" and spend twenty minutes checking the database. Someone else would say "it is probably the new deployment" and spend twenty minutes reverting. Forty minutes later, nobody was closer to the answer — because neither of them had started by asking the right questions.

The best debugging sessions I have been part of had a different structure: slow down, establish facts, form a hypothesis, test it. This sounds obvious. Under the pressure of a production incident at 11pm, it is not obvious at all. That is why I now have a written process.

Step One: Define the Problem Precisely

Before touching anything, I write down three things:

What is actually happening? (Observed behaviour, with specifics — not "it is slow" but "P95 latency on /api/checkout is 8.4 seconds since 22:17 UTC")
What should be happening? (Expected behaviour — "P95 latency is typically 320ms")
When did it start? (The exact time, if determinable)

Writing this down forces precision. "The site is slow" is not a problem statement. "The checkout API has been returning 504 timeout errors for 3% of requests since 22:17 UTC, up from a baseline of 0.02%" is a problem statement you can actually work with.

The "when did it start" question is especially important because it immediately suggests what to look for: what changed around that time? Deployments, configuration changes, traffic spikes, third-party dependency changes, scheduled jobs, cron tasks that run at that hour.

Step Two: Establish the Blast Radius

Before debugging, understand the scope. Is this affecting all users or a subset? All endpoints or one? All regions or one? All versions of the mobile app or only one?

Scope determines priority and also often determines cause. An issue affecting only users in one geographic region suggests a CDN, DNS, or region-specific infrastructure problem. An issue affecting only a specific user cohort suggests a data-related bug. An issue that started at exactly 22:17 and affects everything equally suggests a deployment or configuration change at 22:17.

Step Three: Look at the Data Before Forming Hypotheses

I have a checklist I run through before I start guessing at causes:

Error logs: what errors are occurring? What is the error message, not just the count?
Application metrics: latency, error rate, request volume, queue depth — are any correlated with the problem?
Infrastructure metrics: CPU, memory, disk I/O, network — is anything saturated?
Database metrics: query latency, connection pool usage, lock contention, slow query log
Deployment history: what changed in the last 2-4 hours?
External dependencies: are any third-party services (payment providers, email services, CDN) showing issues?

The goal of this step is not to find the cause — it is to rule out entire categories. If CPU and memory are normal, the cause is probably not resource exhaustion. If the database is slow, the cause might be there. If nothing in the application changed but there is an upstream dependency issue, the cause is probably external.

Step Four: One Hypothesis at a Time

Once you have looked at the data, form the most specific hypothesis you can. Not "it could be the database" but "the slow checkout API calls all involve the orders table, and the orders table has a sequential scan on user_id in the slow query log — possibly because a recent data migration removed the index."

Test exactly one hypothesis at a time. If you change two things simultaneously and the problem goes away, you do not know which change fixed it. If the problem persists, you do not know whether neither worked or whether they cancelled each other out.

-- Good debugging practice: test one hypothesis with a targeted query
-- Hypothesis: there's lock contention on the orders table
SELECT
  pid,
  state,
  wait_event_type,
  wait_event,
  query_start,
  LEFT(query, 100) AS query_snippet
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
  AND query LIKE '%orders%'
ORDER BY query_start;

Step Five: Fix, Verify, and Document

When you have confirmed the hypothesis — meaning the change you made actually resolved the specific symptom you identified — verify it with metrics, not just "it feels better." The error rate should drop. The latency should return to baseline. The specific error in the logs should stop appearing.

Then document what happened. Not a ten-page postmortem — a short note with four things:

What happened (precise symptoms)
Root cause (what actually caused it)
How it was fixed
What will prevent recurrence

This documentation pays dividends the next time something similar happens, and it forces you to understand the incident well enough to explain it, which surfaces cases where you fixed the symptom without understanding the cause.

The Mental Habits That Actually Help

Beyond the process, a few mental habits that I have found genuinely useful:

Explain the problem out loud (or in writing). The act of explaining forces clarity. More than once I have started writing a Slack message asking for help and figured out the answer before I sent it.
Ask "what changed?" before "what is wrong?". Most production issues are caused by recent changes. The closer you can tie a problem to a specific change, the faster you find the cause.
Doubt your assumptions. "The cache is definitely fresh" and "that code path has not changed in months" are both statements worth verifying explicitly, not just assuming.
Know when to stop and escalate. Spending four hours on an incident by yourself when a ten-minute conversation with someone who knows the system better would have found it is a pride problem, not a debugging problem.