Blog Home  /  toil-reduction

Toil in SRE: Identification and Reduction

Toil is the repetitive, manual work that scales with your service but adds no lasting value. Learn how SRE teams identify toil, measure its impact, and systematically eliminate it through automation and process improvement.

August 15, 2025 undefined
sre

Introduction

After five years of steady decline, operational toil is on the rise again—up to 30% from 25% in 2024. For Site Reliability Engineers, this is alarming. Toil is the repetitive, manual, automatable work that scales linearly with your service but produces no lasting value. It steals time from engineering work that actually improves reliability, scalability, and incident response.

This post breaks down how to identify toil in your operations, measure its impact, and systematically eliminate it through automation and process improvements.


What Exactly Is Toil?

In SRE, toil has a specific definition. It’s operational work that is:

  • Manual - Requires human intervention, not automated
  • Repetitive - Same tasks performed regularly
  • Automatable - Could be handled by a machine
  • Tactical - Interrupt-driven and reactive, not strategic
  • No Enduring Value - Doesn’t create permanent improvements
  • Scales Linearly - More service growth equals more of this work

Importantly, not all operational work is toil. Incident response during an actual outage isn’t toil—it requires judgment and adaptability. But manually restarting the same service every day because of a memory leak? That’s pure toil.


Why Toil Matters

Google’s SRE teams aim to keep toil below 50% of each engineer’s time. The remaining 50% should be spent on engineering work: automation, tooling, and system improvements that compound over time.

When toil exceeds this threshold, teams enter a vicious cycle. Engineers have no time to build automation, so toil increases further. Technical debt accumulates. Burnout rises. Innovation stalls.

The impact extends beyond individual productivity. High toil creates:

  • Slower incident response - Engineers buried in manual tasks miss critical alerts
  • Delayed feature development - Teams can’t invest in reliability improvements
  • Increased on-call burden - Responders spend nights handling repetitive tasks
  • Higher error rates - Manual processes introduce human mistakes

How to Identify Toil

Track Your Time

Run a 5-day toil log during both on-call shifts and routine operations. Write down every repetitive task. Look for patterns in:

  • Daily standup notes and ticket queues
  • On-call incident logs and chat transcripts
  • Monitoring alert history and response actions
  • Manual deployment and configuration tasks

Apply the Toil Checklist

For each task, ask these questions:

  1. Does this task scale with service size, traffic, or users?
  2. Would a machine do this the same way every time?
  3. Does completing this task leave behind any lasting improvement?
  4. Is this reactive work triggered by alerts or requests?
  5. Does this require minimal engineering judgment?

If you answer yes to most of these, you’ve identified toil.

Calculate the Cost

Aggregate the human-hours spent on each category of toil. Multiply by engineer compensation. Compare against the engineering time needed to automate or eliminate that work.

Not all toil is worth automating immediately. A task that takes 5 minutes quarterly may not justify a week of engineering effort. But a daily 30-minute manual deployment affecting three engineers? That’s 45 hours per month—probably worth automating.


Strategies to Reduce Toil

1. Eliminate Before You Automate

Sometimes the best solution is to stop doing the work entirely. Before building automation, ask:

  • Does this task actually need to happen?
  • What’s the cost of not responding to this toil?
  • Can we change the system to make this unnecessary?

Removing the root cause beats automating around it.

2. Automate High-Impact Tasks First

Don’t try to automate everything at once. Prioritize based on:

  • Frequency - Daily toil before monthly toil
  • Time cost - Tasks consuming the most aggregate hours
  • Error rate - Manual processes prone to mistakes
  • Interrupt severity - Work that disrupts on-call or focus time

Start with a few high-priority items. Use the time gained to improve your automation and tackle the next batch.

3. Batch and Delay Toil

Not all toil can be eliminated immediately. In the meantime:

  • Batch processing - Accumulate similar tasks for parallelized handling
  • Scheduled windows - Handle routine maintenance during designated times
  • Self-service tools - Let developers run deployments or config changes themselves

Reducing interrupts preserves engineering focus, even when toil remains.

4. Measure and Set Targets

Track toil metrics quarterly:

  • Percentage of time spent on toil per engineer
  • Mean Time to Repair (MTTR) for common issues
  • Number of manual interventions per incident
  • Alert volume and false positive rates

Set a target threshold—typically 50% or less of engineering time. If you exceed it, freeze new feature work until toil is reduced.

5. Improve Systems, Not Just Scripts

External automation (scripts, bots, runbooks) helps, but internal improvements are more effective:

  • Fix memory leaks instead of automating service restarts
  • Add health checks instead of manually monitoring logs
  • Implement graceful degradation instead of manual failovers

Engineering work that prevents toil has permanent value. Scripts that handle toil don’t fix the underlying problem.

6. Standardize Platforms

A lack of standardization creates complexity, which breeds toil. Minimize the number of deployment pipelines, monitoring stacks, and infrastructure platforms in use. Consolidation reduces the surface area for manual intervention.


Common Pitfalls

Over-Automating Too Soon

Building complex automation for rare events can create more maintenance toil than it eliminates. Validate the cost-benefit before investing weeks of engineering time.

Ignoring Automation Maintenance

Automation scripts need updates as systems evolve. Budget ongoing maintenance time, or your automation becomes technical debt.

Accepting Toil as “Just Part of the Job”

If your team treats constant manual work as inevitable, toil will never improve. Make toil reduction an explicit engineering goal with quarterly reviews.


Conclusion

Toil is rising again in 2025, but it doesn’t have to define your operations. By identifying repetitive manual work, measuring its impact, and systematically eliminating it through automation and process improvement, SRE teams can reclaim engineering time for work that actually improves reliability.

Start with a toil audit. Pick one high-impact task to automate this quarter. Track your progress. Over time, reducing toil creates a compounding effect—more engineering capacity leads to better systems, which generate less toil.

Tools like Upstat help teams reduce operational toil by automating incident response workflows, streamlining runbook execution, and centralizing alert routing to minimize manual intervention.

Explore In Upstat

Reduce operational toil with automated incident workflows, runbook execution tracking, and centralized alert routing.