What is the difference between primary and secondary on-call?

Primary on-call receives alerts first and handles initial incident response. Secondary on-call serves as backup—if the primary doesn't acknowledge within a timeout (typically 5-15 minutes), the secondary gets paged. The secondary also provides additional expertise for complex incidents requiring multiple engineers.

Does secondary on-call need to be awake monitoring systems?

No. Secondary responders should be available to respond if paged but don't need to actively monitor unless the primary escalates. Most alerts never reach secondary—they're backup for when primary is unreachable or needs help. Secondary responds only when explicitly paged.

Should secondary on-call be compensated?

Yes. Secondary carries responsibility and must remain available to respond. While their alert volume is typically lower than primary, the availability requirement justifies compensation. Many organizations pay secondary at a reduced rate compared to primary but still recognize the commitment.

How do you decide escalation timeout from primary to secondary?

Common timeouts are 5-15 minutes. Shorter timeouts (5 minutes) ensure faster escalation for critical systems but may page secondary unnecessarily when primary is briefly delayed. Longer timeouts (15 minutes) reduce false escalations but delay response. Match timeout to your incident response SLAs and system criticality.

Primary vs Secondary On-Call: Understanding Backup Coverage

What Are Primary and Secondary On-Call Roles?

In a two-tier on-call model, each shift assigns two responders instead of one:

Primary on-call receives alerts first and handles initial incident response. They’re the first line of defense when something breaks.

Secondary on-call serves as backup. If the primary doesn’t acknowledge an alert within a defined timeframe—typically 5-15 minutes—the secondary gets paged. The secondary also provides additional expertise if incidents require multiple engineers.

This redundancy prevents the most common on-call failure mode: a single engineer being unreachable or overwhelmed when critical issues occur.

Why Use Two Tiers of Coverage?

Prevents Single Points of Failure

When only one person is on call, you’re one missed notification away from unacknowledged incidents. The primary might be:

In a meeting with notifications silenced
On a plane without connectivity
Asleep through an alarm (it happens)
Already handling another incident

Secondary coverage ensures someone responds even when the primary can’t.

Reduces Stress for Primary Responders

Knowing backup exists reduces psychological pressure. The primary can focus on the immediate problem without panic about missing subsequent alerts or needing specialized expertise they don’t have.

This matters especially for junior engineers. Secondary coverage gives them confidence to take on-call responsibility knowing senior engineers provide safety net.

Enables Complex Incident Response

Major incidents often require multiple people. Secondary on-call provides immediate escalation path without hunting through Slack for available engineers. When the primary realizes they need help, the secondary is already designated and aware.

How Primary and Secondary Roles Differ

Primary Responsibilities

The primary owns first response:

Acknowledges all alerts within defined SLA (typically 5 minutes)
Investigates root cause using runbooks and monitoring tools
Coordinates initial response if incident requires multiple people
Escalates to secondary when additional expertise or capacity needed
Documents incident timeline from detection through resolution

The primary decides whether to handle solo or pull in backup. Not every alert requires the secondary.

Secondary Responsibilities

The secondary provides backstop coverage:

Monitors for escalated alerts that the primary didn’t acknowledge
Remains available to respond if paged
Provides specialized knowledge when primary requests assistance
Takes over coordination if incident severity exceeds primary’s comfort level
Does NOT actively monitor systems unless explicitly paged

Critical distinction: Being secondary doesn’t mean working. It means being reachable if needed. The secondary responds when paged, not continuously.

Equal Readiness, Different Triggers

Both roles require equivalent response time commitments. The only difference is who receives the alert first. This means:

Secondary carries similar on-call burden as primary
Both must remain reachable and able to respond
Both should understand system architecture and incident procedures
Both receive the same compensation or time-off for on-call duty

Teams sometimes mistakenly treat secondary as “lighter” responsibility. It’s not. Secondary means “ready to respond immediately if primary needs help,” which requires the same availability commitment.

When to Use Two-Tier Coverage

Not every system needs primary and secondary on-call. Use this model when:

High-Impact Systems

Customer-facing production services where every minute of downtime matters. Financial platforms, healthcare systems, e-commerce checkouts—anything where incidents directly harm users or revenue.

Why: Double coverage provides insurance against response failures for critical systems.

Complex Technical Domains

Systems requiring specialized expertise that single engineers may not possess. Distributed databases, networking infrastructure, legacy monoliths with undocumented quirks.

Why: Secondary provides immediate access to different knowledge domains during diagnosis.

Small Teams Without Redundancy

Teams with only 3-5 engineers where losing one person creates coverage gaps. Startups, specialized infrastructure teams, niche platform groups.

Why: Two-tier rotation with same-size team provides better coverage than attempting 24/7 single-person coverage.

24/7 Business Requirements

Organizations that genuinely cannot tolerate delayed incident response overnight or during weekends. True 24/7 operations, not aspirational ones.

Why: Secondary ensures someone responds even when primary is unreachable across timezone boundaries.

When NOT to Use Two Tiers

Internal Tools Without SLAs

Development environments, internal dashboards, staging systems where delayed response is acceptable. Not everything merits 24/7 redundant coverage.

Alternative: Single-tier business hours coverage, or accept delayed weekend response.

Sufficient Team Size

Large teams (10+ engineers) can provide adequate single-tier coverage through frequent rotation. When any individual is on call only one week per month, adding secondary doubles burden unnecessarily.

Alternative: Rely on documented escalation paths to pull in additional engineers when needed.

Low Alert Volume

Systems that page fewer than once per week. Two-tier coverage for rare events wastes capacity that could handle other responsibilities.

Alternative: Single tier with clear escalation documentation for the rare complex incident.

Implementing Primary/Secondary Rotation

Configure Concurrent Users

On-call scheduling systems support concurrent users per shift—multiple people assigned simultaneously. Set this to 2 for primary/secondary coverage.

The system automatically rotates both positions through your team. User A serves as primary this week while User B is secondary, then they switch roles or advance to the next pair in rotation.

Define Escalation Timing

Specify how long the system waits before escalating from primary to secondary. Common thresholds:

5 minutes: High-urgency production systems requiring immediate response
10 minutes: Standard operations where brief delay is acceptable
15 minutes: Lower-priority systems with more relaxed SLAs

Shorter timeouts reduce incident duration but increase secondary pages for non-emergencies. Find balance based on actual SLA requirements, not anxiety.

Rotate Positions Regularly

Don’t create permanent primary/secondary roles. Everyone should experience both positions equally over time.

Why: Prevents junior engineers from getting stuck in permanent secondary roles while senior engineers always handle primary. Fair distribution means everyone develops both first-response and backup-support skills.

Rotation algorithms should cycle both positions. If using weekly rotation, User A primary week 1 might become User B’s secondary week 2, then primary again week 3.

Communicate Role Expectations

Make explicit what each role entails:

Primary expectations:

Acknowledge all alerts within X minutes
Investigate and attempt resolution before escalating
Page secondary when additional expertise or capacity needed
Document all actions and decisions

Secondary expectations:

Respond to escalated pages within Y minutes
Provide subject matter expertise when requested
Take over incident coordination if primary is overwhelmed
Do NOT proactively monitor unless explicitly asked

Ambiguity about when secondary “should” engage creates confusion during incidents. Clear triggers prevent hesitation and second-guessing.

Managing Alert Escalation Flow

Automatic Escalation

Modern incident management platforms handle this automatically:

Alert fires → Primary receives notification
Primary has 5 minutes to acknowledge
If no acknowledgment → Secondary automatically paged
If neither responds → Escalate to management layer

Configure these timeouts in your alerting system, not as manual procedures. Automation prevents forgotten escalations during high-stress incidents.

Manual Escalation

Primary should manually page secondary before automatic timeout when:

Incident severity clearly exceeds single-engineer capability
Specialized knowledge required that primary lacks
Multiple simultaneous incidents need parallel investigation
Primary is already deep in debugging and another alert fires

Don’t wait for automatic escalation when you know you need help. Proactive escalation reduces incident duration.

Escalation to Additional Resources

Secondary isn’t the end of the escalation chain. Define clear paths beyond the on-call pair:

Team lead or manager for major outages
Subject matter experts for specific systems
Executive notification for customer-impacting incidents
Cross-team coordination for dependencies

Document these thresholds explicitly. “Call the CTO” shouldn’t be judgment call—it should be triggered by objective severity criteria.

Avoiding Common Pitfalls

Treating Secondary as Free Labor

Problem: Managers expect secondary to “keep an eye on things” even when not paged.

Solution: Secondary responds when paged, period. No ambient monitoring, no proactive dashboard checking. If continuous monitoring is required, schedule dedicated shifts and compensate accordingly.

Creating Primary/Secondary Skill Gaps

Problem: Senior engineers always take primary, junior engineers always get secondary, creating permanent skill tiers.

Solution: Rotate everyone through both positions. Junior engineers develop confidence by handling primary. Senior engineers practice backup support and learn to trust teammates.

Unclear Escalation Triggers

Problem: Primary hesitates to page secondary, unsure whether situation “deserves” escalation.

Solution: Define objective escalation criteria. Severity level 1-2 incidents automatically include secondary. Incidents lasting longer than X minutes trigger escalation. Remove judgment and guilt from the decision.

Doubling Compensation Burden

Problem: Two-tier coverage doubles the number of people on call, increasing team burden and compensation costs.

Solution: Two-tier coverage replaces single-tier, not supplements it. The same pool of engineers rotates through primary and secondary positions. Total person-shifts remains constant, just structured differently.

Measuring Two-Tier Effectiveness

Track metrics to verify backup coverage actually improves outcomes:

Secondary Page Rate

What: Percentage of incidents that escalate to secondary

Target: 10-20% for well-tuned systems

Interpretation:

Below 5%: Secondary rarely needed, consider single-tier coverage
Above 30%: Primary consistently overwhelmed or undertrained

Primary Acknowledgment Time

What: How quickly primary acknowledges alerts

Target: Under 3 minutes P95

Interpretation: Fast acknowledgment means primary is appropriately available. Slow times suggest either alert fatigue or inadequate notification channels.

Incident Resolution by Tier

What: How many incidents resolve at primary vs requiring secondary

Target: 70-80% resolved by primary alone

Interpretation: Most routine issues should resolve at primary level. Heavy secondary involvement suggests either insufficient primary training or genuinely complex system requiring multiple engineers.

Team Satisfaction

What: Engineer feedback on two-tier model

Target: Positive sentiment, reduced stress reports

Interpretation: The model should reduce anxiety (knowing backup exists) without feeling like doubled burden. If satisfaction drops, investigate whether expectations for secondary role are creeping beyond documented responsibilities.

Tools and Configuration

On-call management platforms support primary/secondary through concurrent user configuration. Look for:

Concurrent shift assignment: Ability to assign 2+ users to single shift simultaneously

Automated escalation rules: Time-based progression from primary to secondary to management

Role-based notification: Different notification methods for primary (immediate) vs secondary (escalation-only)

Fair rotation algorithms: Both positions advance through team equitably over time

Override flexibility: Temporary substitutions for either role without disrupting overall rotation

Platforms like Upstat provide concurrent user assignment with configurable rotation strategies, automated escalation policies based on acknowledgment timeouts, multi-timezone support ensuring global teams can implement follow-the-sun coverage with primary/secondary pairs in each region, and flexible override management allowing temporary coverage adjustments without permanent rotation changes.

Dedicated tooling eliminates manual coordination overhead that makes two-tier coverage impractical using spreadsheets or shared calendars.

Integration with Broader On-Call Strategy

Primary/secondary coverage works best as part of comprehensive on-call design:

Fair rotation algorithms ensure both roles distribute equitably across team members over time.

Holiday and exclusion management maintains two-tier coverage even when individuals take time off.

Follow-the-sun coordination enables primary/secondary pairs in each region rather than overnight global coverage.

Alert quality improvement reduces unnecessary pages that burn out both primary and secondary responders.

Runbook maintenance helps primary resolve routine issues without requiring secondary escalation.

Two-tier coverage is infrastructure, not culture. The infrastructure works when paired with operational maturity.

Final Thoughts

Primary and secondary on-call provides redundancy where it matters: ensuring someone responds when systems fail. This model prevents single points of failure, reduces responder stress, and enables complex incident coordination without hunting for available engineers.

But redundancy has cost. Two people on call instead of one increases total team burden. Use this model for systems where that cost is justified by business impact or technical complexity. For everything else, single-tier coverage with documented escalation paths provides adequate reliability without doubling on-call load.

When implementing two-tier coverage, maintain equal expectations for both roles. Secondary means “ready to respond when paged,” not “lighter duty” or “passive monitoring.” Both positions require equivalent availability commitments and deserve equivalent compensation.

The goal is reliable incident response that respects team capacity. Primary/secondary achieves this when thoughtfully designed and appropriately applied to systems that genuinely need it.

Explore In Upstat

Configure primary and secondary on-call coverage with automated escalation, flexible rotation algorithms, and multi-timezone support that distributes responsibility fairly.

Learn About On-Call Management