Blog Home  /  sre-oncall-handover-checklist

SRE On-Call Handover Checklist

Effective on-call handovers prevent context loss and ensure operational continuity. This checklist provides a scannable reference for what to cover during every shift transition, from active incidents to system state and upcoming events.

5 min read
on-call

Why Handover Checklists Matter

On-call handover checklists ensure that critical operational context transfers completely between engineers during shift transitions. Without a structured checklist, important information gets lost, ongoing incidents fall through cracks, and incoming engineers waste time rediscovering what their predecessors already knew.

The difference between smooth operations and chaotic response often comes down to handover quality. A checklist transforms handovers from informal conversations into systematic knowledge transfer, ensuring every transition covers essential categories regardless of time pressure or fatigue.

This checklist provides a scannable reference you can use during every shift change. Print it, bookmark it, or adapt it to your team’s specific needs.

Active Incidents Checklist

Before handing off responsibility, verify you have covered every active incident:

For each ongoing incident, document:

  • Current severity level and customer impact
  • Investigation steps already completed
  • Current working theory about root cause
  • Next troubleshooting actions planned
  • Blocking issues preventing resolution
  • Subject matter experts already consulted
  • Stakeholders who have been notified
  • Commitments made about resolution timing

Verification questions:

  • Are there any incidents you investigated but did not formally declare?
  • Are there any alerts currently suppressed or silenced that need attention?
  • Did you escalate anything that has not yet been resolved?

Even low-severity incidents need documentation. What seems straightforward to you might confuse the incoming engineer unfamiliar with that specific failure mode.

Recently Resolved Incidents Checklist

Incidents resolved during your shift can resurface. Transfer this context:

For each recently resolved incident:

  • Brief description of what happened
  • Resolution actions taken
  • Verification steps confirming the fix
  • Potential recurrence indicators to watch
  • Follow-up work required (if any)

Time window: Cover incidents resolved within the last 4-6 hours. Recent problems can return, and incoming engineers need context if similar symptoms reappear.

Why this matters: When incoming engineers encounter returning symptoms, understanding recent resolutions helps them distinguish recurrence from new problems, enabling faster pattern recognition.

System Health Checklist

Beyond specific incidents, transfer broader operational context:

Current state observations:

  • Overall system health assessment (normal, degraded, attention needed)
  • Error rates compared to normal baselines
  • Resource utilization patterns (CPU, memory, network trends)
  • Services that are degraded but stable
  • Known anomalies being tracked but not yet actionable
  • Traffic patterns or unusual load characteristics

Dashboard review:

  • Key metrics showing any concerning trends
  • Alerts in warning state that have not yet fired
  • Services approaching capacity limits

This broader context helps incoming engineers quickly distinguish abnormal behavior from expected patterns. For comprehensive guidance on building on-call practices around these observations, see the Complete Guide to On-Call Management.

Recent Changes Checklist

Recent deployments and configuration changes represent the most common sources of new problems:

Document all changes during your shift:

  • Services or features deployed
  • Configuration changes applied
  • Infrastructure modifications
  • Feature flags enabled or disabled
  • Database migrations executed
  • Scaling actions taken

For each change, note:

  • Current deployment status (stable, monitoring, partial rollout)
  • Rollback readiness if problems emerge
  • Expected behavior changes or known side effects

Critical detail: If 25 percent of traffic routes to a new service version, the incoming engineer needs to know this before investigating performance differences across request segments.

Temporary Fixes Checklist

Engineers under pressure implement workarounds that need follow-up:

Document every temporary measure:

  • Services restarted to clear memory issues
  • Caches manually cleared
  • Features temporarily disabled
  • Manual processes covering automated system failures
  • Configuration changes applied as tactical fixes
  • Workarounds masking underlying problems

For each temporary fix:

  • What underlying issue requires proper resolution
  • How long the temporary fix is expected to hold
  • What monitoring indicates if the fix is failing
  • Who owns the permanent solution

Why this matters: Without explicit documentation, incoming engineers do not know these workarounds exist. Services appear healthy while running on stopgap fixes that could fail unexpectedly.

Upcoming Events Checklist

Alert incoming engineers to planned activities:

Scheduled events:

  • Maintenance windows during their shift
  • Planned deployments by other teams
  • Known traffic spikes (marketing campaigns, product launches)
  • External dependencies with announced maintenance
  • Holiday or weekend traffic pattern changes

For each event:

  • Expected timing
  • Anticipated impact on systems
  • Contacts responsible for the event
  • Rollback or cancellation procedures if needed

Advance awareness prevents confusion when expected changes occur and helps incoming engineers prepare mentally for anticipated load or disruption.

Access Verification Checklist

Before the outgoing engineer departs, verify the incoming engineer can access:

Essential systems:

  • Monitoring dashboards and alert systems
  • Production infrastructure (VPN, SSH, cloud consoles)
  • Incident management platform
  • Communication channels (team chat, paging systems)
  • Password vaults and credential management
  • Runbook and documentation repositories

Test verification:

  • Send a test alert to confirm paging works
  • Check roster configuration shows correct assignment
  • Confirm backup escalation path if primary fails

Why verify now: Access problems discovered mid-incident create dangerous delays. Five minutes of verification during handover prevents hours of scrambling later.

Communication Transfer Checklist

Ensure communication continuity:

Stakeholder status:

  • Who has been notified about ongoing situations
  • What information they received
  • Questions they asked
  • Commitments made about updates or resolution
  • Who expects the next communication and when

Channel status:

  • Active threads requiring follow-up
  • Questions awaiting answers
  • External parties expecting callbacks

This prevents incoming engineers from accidentally contradicting previous communications or missing stakeholders who require updates.

Handover Verification Checklist

Complete handover with explicit verification:

Outgoing engineer confirms:

  • All checklist categories have been covered
  • Written documentation is complete and accessible
  • No critical information remains undocumented

Incoming engineer confirms:

  • Understands current operational state
  • Knows immediate priorities and next actions
  • Has all necessary access verified
  • Accepts responsibility for on-call duty

Formal acknowledgment: Use explicit language like “I have reviewed the handover, understand current status, and am taking ownership of on-call responsibility. You are clear to hand off.”

This formal acknowledgment prevents ambiguity about who currently owns response responsibility, critical if issues emerge during transition.

Post-Handover Checklist

After formal transfer:

Outgoing engineer:

  • Remain available via chat for 30-60 minutes
  • Answer follow-up questions as incoming engineer encounters documented situations
  • Provide quick clarification while context remains fresh

Incoming engineer:

  • Review written documentation thoroughly
  • Walk through key dashboards independently
  • Identify any remaining questions
  • Confirm you can respond if alerts fire

Adapting This Checklist

Every team operates differently. Adapt this checklist to your context:

Add items for:

  • Team-specific systems or services
  • Compliance requirements (audit logging, change documentation)
  • Regional handoffs with timezone considerations
  • Customer-specific arrangements requiring awareness

Remove items if:

  • Your team does not use certain categories
  • Automation handles specific transfer tasks
  • Other documentation captures the information

Review quarterly: Update the checklist as systems evolve, new services launch, or team feedback identifies gaps.

Using Tools to Support Handovers

Checklists work best when supported by appropriate tooling:

What helps:

  • Centralized incident documentation visible to all team members
  • Roster visibility showing exactly when shifts change
  • Structured templates prompting comprehensive information capture
  • Real-time status tracking that persists across shifts

Platforms like Upstat provide incident activity timelines, participant tracking, and roster management that support handover workflows. When incoming engineers can see complete incident history and current roster state in one place, handover conversations focus on context rather than hunting for basic information.

What to avoid:

  • Scattered documentation across chat, wikis, and personal notes
  • Reliance on verbal-only handovers without written backup
  • Assumptions that information will be obvious or remembered

Final Verification

Before completing any handover, ask yourself:

  1. Could someone unfamiliar with my shift understand current state from my documentation?
  2. Have I covered everything I would want to know if I were starting this shift?
  3. Is there anything I am assuming the incoming engineer already knows?

If you answer “no” to any question, add the missing context before signing off.

Quality handovers are not administrative overhead. They are operational insurance that maintains reliability when responsibility transfers between people. Use this checklist consistently, and your team will experience smoother transitions, faster incident response, and reduced context loss across every shift change.

Explore In Upstat

Support smooth handovers with centralized incident documentation, clear roster visibility, and real-time status tracking that keeps context accessible across shifts.