Your database starts returning errors. Customer complaints are flooding support. Your monitoring dashboard shows red across three services. The on-call engineer stares at the screen trying to decide what to do first.
This moment of confusion costs teams critical response time. Without structured assessment, engineers apply personal judgment under pressure, leading to inconsistent responses, delayed escalation, and extended incident duration. Studies suggest teams with clear triage frameworks resolve incidents significantly faster than those relying on ad-hoc assessment.
The solution is a set of golden questions that teams ask at the start of every incident. These five questions transform chaotic initial moments into structured assessment, helping teams make fast, consistent decisions regardless of who is on call or what time the incident occurs.
Why Structured Assessment Matters
When an incident strikes, your brain enters a stress response. Complex decision-making becomes harder. Memory recall suffers. The natural instinct is to start fixing immediately without understanding the full picture.
This instinct is dangerous. Engineers who dive into investigation without assessment often:
Fix the wrong problem: Addressing symptoms rather than root causes, or focusing on secondary issues while the primary failure continues.
Under-escalate: Working alone on incidents that require team coordination, losing valuable time before requesting help.
Over-escalate: Waking entire teams for minor issues that one engineer could resolve, creating alert fatigue and eroding trust in the incident process.
Miss stakeholder communication: Forgetting to update customers or leadership while focused on technical investigation, damaging trust and creating communication gaps.
Structured questions counteract these patterns by forcing deliberate assessment before action. The questions themselves act as a checklist that prevents critical oversights during high-stress moments.
The Five Golden Questions
Question 1: What is Broken and Who is Affected?
The first question establishes scope. Before determining severity or assigning responders, you need to understand what systems are failing and which users experience impact.
Specific prompts to answer this question:
- Which services or components are failing?
- Is this affecting all users, a geographic region, or a specific customer segment?
- What functionality is unavailable or degraded?
- Are there dependent services that might be impacted?
This question should produce a clear impact statement: “The payment processing service is returning 500 errors, affecting all checkout attempts across the platform” rather than vague descriptions like “something is wrong with payments.”
Linking incidents to affected services through your service catalog provides immediate context about dependencies and business criticality. When you know the payment service is down, you can instantly see that it impacts checkout, subscription renewals, and the mobile app purchase flow.
Question 2: How Severe is the Business Impact?
Severity determines everything that follows: response urgency, escalation paths, communication cadence, and resource allocation. This question maps the impact from Question 1 to your severity framework.
Factors to evaluate:
- Is revenue directly at risk right now?
- What percentage of users are affected?
- Are there workarounds available?
- How long has this been happening?
- Is the issue getting worse?
Most organizations use a five-level severity system where level 1 represents critical outages requiring immediate all-hands response, and level 5 represents informational issues for tracking purposes. The key is having predefined criteria so engineers classify incidents based on objective factors rather than gut feeling.
For example: “Complete service outage affecting all users with no workaround = Severity 1” is objective. “This seems pretty bad” is not. Your severity framework should provide clear criteria that map symptoms to levels.
Question 3: Who Needs to Respond?
Once you understand scope and severity, the third question identifies required responders. This prevents both under-staffing critical incidents and over-paging for minor issues.
Response considerations:
- Who is the current on-call engineer for affected services?
- Does this severity require additional responders or an incident lead?
- Are there subject matter experts who should be engaged?
- Do we need a communications lead for customer updates?
For severity 1 incidents, you likely need a designated incident lead to coordinate response, plus technical responders from each affected service area. For severity 4 issues, the on-call engineer can handle investigation solo with escalation if needed.
Clear participant roles eliminate confusion during response. Someone owns coordination and decision-making as the incident lead. Others contribute technical expertise. Still others handle stakeholder communication. When roles are assigned at the start, teams avoid the “who is doing what?” confusion that extends incidents.
Question 4: What Resources Apply?
Before diving into investigation, check whether existing resources address this scenario. This question prevents reinventing solutions and ensures teams leverage organizational knowledge.
Resources to check:
- Do we have runbooks for this failure type?
- Have we seen similar incidents recently? What resolved them?
- Are there known issues or scheduled changes that might explain this?
- What monitoring dashboards provide relevant visibility?
Runbooks are particularly valuable during incident triage. A runbook for “payment service failures” might include diagnostic queries, common causes with solutions, and escalation contacts. Following documented procedures is faster than improvising under pressure.
Connecting incidents to past similar issues provides historical context. If the payment service failed last month due to database connection pool exhaustion, and today’s symptoms look similar, you have a starting hypothesis without extensive investigation.
Question 5: Who Needs to Know?
The final question addresses communication. Incidents exist in organizational context, and stakeholders need timely updates even while technical response continues.
Communication considerations:
- Do customers need a status page update?
- Which internal teams should be notified?
- Does leadership require updates at this severity?
- Are there SLA implications requiring customer outreach?
Communication requirements typically scale with severity. Severity 1 incidents might require immediate status page updates, executive notification, and proactive customer communication. Severity 4 issues might only need internal tracking without external updates.
Answering this question early prevents the common pattern where teams become so focused on technical resolution that stakeholders learn about incidents from customer complaints rather than proactive communication.
Putting the Framework into Practice
Document Your Answers
As you work through the five questions, document the answers. This documentation becomes your incident record and provides context for anyone joining the response. Most incident management systems provide structured fields for severity, affected services, and participants that capture these answers formally.
Time-Box the Assessment
The golden questions should take under five minutes for most incidents. This is assessment, not investigation. You are gathering enough information to make triage decisions, not diagnosing root cause. If you cannot answer a question quickly, note the uncertainty and move forward with your best current understanding.
Update as Understanding Evolves
Initial assessment is not final. As investigation progresses, you may discover the impact is broader than first understood, or that the issue is less severe than symptoms suggested. Update severity, add responders, and adjust communication as the picture clarifies. The initial answers get response started; ongoing assessment refines direction.
Train Teams on the Framework
Frameworks only work when teams use them consistently. Include the golden questions in your incident response training. Run tabletop exercises where teams practice working through the questions for hypothetical scenarios. Make the questions visible during actual incidents through your tooling or incident documentation templates.
Common Assessment Pitfalls
Even with structured questions, teams encounter pitfalls during incident assessment:
Skipping straight to investigation: The urge to start fixing is strong. Discipline yourself to answer the five questions before diving into logs.
Anchoring on initial assessment: First impressions can be wrong. If investigation reveals your severity assessment was incorrect, update it rather than defending the original classification.
Assessment paralysis: Some teams over-analyze during triage. If you cannot determine severity precisely, default to higher severity. It is easier to downgrade than to escalate after delayed response.
Solo assessment for major incidents: Critical incidents benefit from multiple perspectives during triage. Brief team assessment catches factors that individuals miss under stress.
Building Assessment into Your Process
The golden questions framework works best when integrated into your incident response process rather than treated as optional guidance. Consider:
Incident templates: Structure your incident creation to prompt for answers to each golden question. Required fields for severity, affected services, and assigned participants enforce the framework.
Triage checklists: Provide a visible checklist that responders work through during initial assessment. Physical or digital reminders keep the framework present during high-stress moments.
Post-incident review: Include assessment quality in your post-incident reviews. Did the initial severity match actual impact? Were the right responders engaged immediately? Learning from assessment accuracy improves future triage.
The Value of Consistent Triage
Organizations that implement structured incident assessment see measurable improvements in response effectiveness. Consistent triage decisions mean incidents get appropriate response regardless of which engineer is on call. Faster initial assessment means earlier escalation for critical issues and less noise for minor ones.
The golden questions are simple, but their value compounds over hundreds of incidents. Each time a team works through the framework, they build muscle memory for effective assessment. Eventually, the questions become automatic, and fast, accurate triage becomes part of your incident response culture.
When your next incident strikes, resist the urge to immediately start investigating. Take five minutes to answer the golden questions. Understand what is broken, how severe it is, who should respond, what resources apply, and who needs to know. That structured assessment will save far more time than it costs.
Explore In Upstat
Streamline incident triage with severity levels, participant tracking, and service impact mapping that help teams answer the golden questions faster.
