Production issues are among the most demanding challenges a software team can face β especially when they occur in environments outside your direct control.
To navigate these situations effectively, this methodology draws inspiration from the OODA loop (Observe, Orient, Decide, Act), a strategic decision-making framework originally developed for military operations. The goal is to establish control, gather facts, and make deliberate progress toward resolution β even in the face of complex or obscure problems.
Guiding Principles
This approach formalizes a repeatable process for investigating and resolving production or customer-environment issues. It emphasizes:
- Structured analysis over reactive guesswork
- Group collaboration and shared ownership
- Prioritized, actionable steps
- A balance of targeted and broad evidence gathering
When diagnosing production issues, it aims to gather both specific and broad insights β advancing perhaps three hypotheses per cycle to maintain momentum. This approach works better than early guesswork, especially when little is known and customer confidence is at stake.
Steps
Given the high-stakes nature of production issues, it is important to have a formal procedure β helping teams to βwork the problemβ in an effective way, rather than chasing scattershot fixes & potentially missing crucial evidence.
The following is a method which has been found effective in complex and difficult situations:
Context
- Gather relevant technical and business context. Understand the systemβs role, dependencies, and operational environment.
Evidence Collection
- Collect monitoring outputs, logs, exception traces, screenshots, and other artifacts. Prioritize verifiable facts with context, rather than relying solely on verbal reports or assumptions.
Analyze the Evidence
- Analyze the evidence thoroughly. Note observations and anomalies, categorize them (e.g., using Notepad++), and identify clues or red herrings. Use this to generate hypotheses.
Hypothesis Formation
- Develop hypotheses based on the evidence. Keep them open-ended and avoid premature dismissal of less likely possibilities.
Convene & Review
- Conduct a group review session to discuss hypotheses. This should blend brainstorming with prioritization. Shared ownership is key β no single person should bear sole responsibility for finding (or failing to find) the root cause.
Actionable Steps
- Create a prioritized list of investigative or resolution steps. For customer-facing cycles, provide 3β6 clear actions per round, and request confirmation in a verifiable format (e.g., screenshots, logs).
- Repeat outstanding items as needed.
Repeat the Cycle
- Begin the next cycle with updated context and evidence. For on-premises customers, cycles may be daily; for DevOps or SaaS environments, they may run hourly or as needed.
Supporting Artefacts
Artefacts supporting the process should include:
- Technical & Business Context
Ideally documented early and shared via internal Wiki. - Detailed Analyses
Can be informal (e.g., Notepad++ notes) but should be thorough. - JIRA Ticket
Keep focused on the main issue. Spin off minor findings into separate tickets. Reiterate the core problem and current status as needed. - Wiki Page for the Problem
Useful for tracking event history, facts, hypotheses, and action items in one place.
References
- Performance Analysis Methodology (brendangregg.com)
- OODA loop β Wikipedia