Production Troubleshooting Methodology

Production issues are among the most demanding challenges a software team can face β€” especially when they occur in environments outside your direct control.

To navigate these situations effectively, this methodology draws inspiration from the OODA loop (Observe, Orient, Decide, Act), a strategic decision-making framework originally developed for military operations. The goal is to establish control, gather facts, and make deliberate progress toward resolution β€” even in the face of complex or obscure problems.

Guiding Principles

This approach formalizes a repeatable process for investigating and resolving production or customer-environment issues. It emphasizes:

  • Structured analysis over reactive guesswork
  • Group collaboration and shared ownership
  • Prioritized, actionable steps
  • A balance of targeted and broad evidence gathering

When diagnosing production issues, it aims to gather both specific and broad insights β€” advancing perhaps three hypotheses per cycle to maintain momentum. This approach works better than early guesswork, especially when little is known and customer confidence is at stake.

Steps

Given the high-stakes nature of production issues, it is important to have a formal procedure β€” helping teams to β€œwork the problem” in an effective way, rather than chasing scattershot fixes & potentially missing crucial evidence.

The following is a method which has been found effective in complex and difficult situations:

  1. Context

    • Gather relevant technical and business context. Understand the system’s role, dependencies, and operational environment.
  2. Evidence Collection

    • Collect monitoring outputs, logs, exception traces, screenshots, and other artifacts. Prioritize verifiable facts with context, rather than relying solely on verbal reports or assumptions.
  3. Analyze the Evidence

    • Analyze the evidence thoroughly. Note observations and anomalies, categorize them (e.g., using Notepad++), and identify clues or red herrings. Use this to generate hypotheses.
  4. Hypothesis Formation

    • Develop hypotheses based on the evidence. Keep them open-ended and avoid premature dismissal of less likely possibilities.
  5. Convene & Review

    • Conduct a group review session to discuss hypotheses. This should blend brainstorming with prioritization. Shared ownership is key β€” no single person should bear sole responsibility for finding (or failing to find) the root cause.
  6. Actionable Steps

    • Create a prioritized list of investigative or resolution steps. For customer-facing cycles, provide 3–6 clear actions per round, and request confirmation in a verifiable format (e.g., screenshots, logs).
    • Repeat outstanding items as needed.
  7. Repeat the Cycle

    • Begin the next cycle with updated context and evidence. For on-premises customers, cycles may be daily; for DevOps or SaaS environments, they may run hourly or as needed.

Supporting Artefacts

Artefacts supporting the process should include:

  • Technical & Business Context
    Ideally documented early and shared via internal Wiki.
  • Detailed Analyses
    Can be informal (e.g., Notepad++ notes) but should be thorough.
  • JIRA Ticket
    Keep focused on the main issue. Spin off minor findings into separate tickets. Reiterate the core problem and current status as needed.
  • Wiki Page for the Problem
    Useful for tracking event history, facts, hypotheses, and action items in one place.

References

Leave a Reply

Your email address will not be published. Required fields are marked *