When You Need On-Call
On-call becomes necessary when your team needs clear ownership of incident response outside normal working hours. If customers depend on your service 24/7, someone must be responsible for responding when things break at 2 AM. On-call formalizes that responsibility so it’s predictable, fair, and sustainable.
Not every team needs on-call. If your service only operates during business hours, or if brief downtime is acceptable, you might not need formal coverage yet. But once you have paying customers, SLAs, or systems that must stay available, ad-hoc “whoever notices first” response stops working.
The sign you need on-call: you’ve had incidents where nobody knew they were responsible, or where the same person always responds because they happen to check email at night.
The Minimum Viable Rotation
Your first rotation doesn’t need to be perfect. It needs to establish clear ownership and be sustainable enough that your team doesn’t burn out before you can iterate.
Essential Decisions
Team size: You need at least three engineers for sustainable on-call. Fewer than three creates unsustainable burden—with two people, each person is on-call half the time. With three, you can cover vacation, sick days, and provide reasonable gaps between shifts.
Coverage hours: Decide whether you need 24/7 coverage or just extended hours. Many teams start with “business hours plus evenings” rather than full overnight coverage. Be honest about what your service actually requires.
Shift duration: Start with week-long shifts. Daily shifts create too many handoffs and make it hard to plan around on-call. Two-week shifts risk burnout. Week-long (Monday to Monday) provides predictability while limiting individual exposure.
Rotation order: For your first rotation, simple sequential works fine—Person A, then Person B, then Person C, repeat. You can optimize for fairness later.
Choosing Your First Rotation Type
Three rotation strategies work for most teams. Pick based on your priorities.
Sequential Rotation
Users rotate in fixed order: A, B, C, A, B, C. The simplest option that works well for small teams where everyone’s availability is similar.
Best for: Teams under five people where simplicity matters more than perfect fairness.
Trade-off: Users always get the same day of the week. If rotation starts on Monday, Person A always has Monday, Person B always has Tuesday. This creates permanent inconveniences for some team members.
Weekly Rotation
Each user’s position advances by one day each week. If Person A has Monday this week, they get Tuesday next week, Wednesday the week after. Over time, everyone experiences all days including weekends.
Best for: Teams that want fair weekend distribution without complex algorithms.
Trade-off: Slightly more cognitive overhead since the schedule changes weekly.
Fair Distribution
Shifts spread evenly across scheduled days, maximizing time between each user’s shifts. The algorithm optimizes for balanced workload rather than simple ordering.
Best for: Larger teams or those with variable schedules who want to maximize recovery time between shifts.
Trade-off: Less predictable than fixed ordering. Users can’t easily calculate when their next shift is.
Recommendation: Start with sequential for your first rotation. It’s the easiest to understand and explain to your team. Once you’ve run for a few months and understand your actual patterns, consider switching to weekly rotation for fairer weekend distribution.
Configuring Shift Coverage
Two decisions shape how your rotation generates shifts.
Shift Duration
How many hours is each shift? Common options:
- 24 hours: One person covers an entire day. Simplest scheduling, but means overnight responsibility.
- 12 hours: Split day and night shifts. Better for teams wanting to separate daytime and overnight.
- 8 hours: Three shifts per day. More handoffs but limits individual exposure.
For your first rotation, start with 24-hour shifts. They’re simplest to schedule and understand. You can refine to shorter shifts later if overnight burden becomes a problem.
Coverage Days
Which days of the week need coverage? Options range from weekdays only (Monday through Friday) to full seven-day coverage.
Start with what your service actually needs. If weekend incidents are rare and can wait until Monday, don’t burden your team with weekend coverage just because “real on-call” includes weekends. You can expand coverage as your service grows.
Building Your First Schedule
With decisions made, here’s how to create the actual schedule.
Step 1: List Your Team
Identify who will participate in on-call. For your first rotation, include only engineers who can actually respond to incidents—they need system access, relevant knowledge, and authority to make changes.
Don’t include: New hires who haven’t completed onboarding, engineers unfamiliar with the systems, or people who’ll consistently need to escalate every incident.
Step 2: Set the Start Date
Pick a Monday for your rotation to begin. Give everyone at least a week’s notice before their first shift. People need time to understand what’s expected and arrange their lives around on-call responsibility.
Step 3: Define Escalation
Who gets paged if the primary on-call doesn’t respond? At minimum, establish a secondary contact or escalation path. This might be the next person in rotation, a manager, or a senior engineer who can always be reached.
Keep escalation simple for your first rotation. Primary on-call, then escalate to secondary after 10-15 minutes. You can build more sophisticated escalation policies later.
Step 4: Document Everything
Write down:
- Who’s in the rotation and in what order
- What hours each shift covers
- How to get paged (which tool, which channel)
- When to escalate and to whom
- How to hand off to the next person
This documentation saves confusion when someone joins the team or when you’re too tired at 3 AM to remember the process.
Common First-Rotation Mistakes
Starting Too Small
Two-person rotations seem reasonable—you each take half the weeks. In practice, this creates unsustainable burden. One vacation means the other person covers for two weeks straight. One sick day means scrambling for coverage.
Fix: Wait until you have at least three people, or accept that your “on-call” is really just one person with informal backup.
Overcomplicating Early
Your first rotation doesn’t need primary and secondary coverage, follow-the-sun across timezones, tiered escalation with different policies per service, and automated runbook integration. It needs someone to answer when things break.
Fix: Start simple. Add complexity only when you hit real problems that simpler approaches can’t solve.
Ignoring Recovery Time
Assigning the same person to on-call right after they handled a major incident guarantees burnout. Being woken at 3 AM, resolving an incident, and then starting a new on-call shift the next day isn’t sustainable.
Fix: Build in recovery. If someone handled a significant incident during their shift, consider skipping their next rotation or providing a day off.
No Handoff Process
When shifts change with no communication, context gets lost. The new on-call discovers ongoing issues by getting paged about them again. Time gets wasted re-investigating problems the previous person already understood.
Fix: Establish handoff expectations. Even a brief Slack message listing “things to watch” and “ongoing issues” dramatically improves continuity.
Permanent Weekend Assignment
Sequential rotation starting on the same day each week means the same person always gets weekends. This creates resentment and drives people away from teams.
Fix: Use weekly rotation that advances each person’s position, or manually adjust the rotation start occasionally to distribute weekend burden.
When to Use Tooling
Your first rotation can run on a shared calendar or spreadsheet. But limitations appear quickly.
Signs you need dedicated tooling:
- Manual schedule management takes significant time
- Overrides and swaps create confusion about who’s actually on-call
- Integrating with alerting tools requires manual configuration
- You’re making mistakes about who’s on-call during incidents
- Team growth means schedule complexity outpaces manual tracking
On-call management tools like Upstat automate rotation generation, handle overrides cleanly, integrate with alerting systems, and provide visibility into current coverage. They also support features like holiday exclusions, timezone handling, and fair distribution algorithms that would be impractical to manage manually.
Start manual, switch when the overhead becomes a problem. For a three-person team with simple sequential rotation, a shared calendar works fine. Once you have six people, multiple services, and override requests every week, tooling pays for itself in time saved.
Your First Week
After launching your rotation, pay attention to these signals.
Watch for Alert Volume
Is the on-call person getting paged constantly, or sitting quietly? High alert volume suggests monitoring needs tuning. Zero alerts might mean your monitoring isn’t catching real issues.
Healthy first week: A few actionable alerts, mostly during business hours, with clear paths to resolution.
Check Response Times
When alerts fire, how quickly does the on-call respond? Delays might indicate confusion about process, inability to access systems, or alerts going to the wrong place.
Gather Feedback
After the first complete rotation cycle, ask each participant what worked and what didn’t. Where did they get stuck? What documentation was missing? What felt unfair or unsustainable?
Use this feedback to improve before problems become entrenched.
Growing Your Rotation
Once basic on-call is working, common next steps include:
Adding secondary coverage: A backup person who gets paged if primary doesn’t respond within a timeout. This adds safety without significantly increasing burden.
Shadow programs: New engineers join on-call shifts as observers before taking primary responsibility. They learn the process while experienced engineers handle actual incidents.
Service-specific rotations: As your platform grows, different teams might own different services with separate on-call schedules rather than one rotation covering everything.
Follow-the-sun coverage: Teams distributed across timezones hand off coverage so nobody takes overnight shifts. Requires coordination but dramatically improves quality of life.
Each evolution builds on the foundation of your first rotation. Get the basics working, gather feedback, and iterate.
Final Thoughts
Your first on-call rotation establishes the foundation for incident response culture. It doesn’t need to be sophisticated—it needs to be clear, fair, and sustainable.
Start with the minimum: three people, week-long shifts, sequential rotation, simple escalation. Document everything. Gather feedback after each cycle. Add complexity only when real problems require it.
The teams that succeed at on-call aren’t the ones with the most sophisticated scheduling algorithms. They’re the ones that treat on-call as an ongoing practice to improve rather than a one-time configuration to set and forget.
Build your first rotation, see what works, and iterate. The goal is reliable coverage that your team can sustain for years—not perfection on day one.
Explore In Upstat
Create your first on-call rotation with automated scheduling, multiple rotation algorithms, timezone support, and override management that grows with your team.
