Toil is operational work that is manual, repetitive, automatable, tactical, provides no enduring value, and scales linearly with service growth. Examples include manually restarting services, running the same diagnostic commands repeatedly, or manually provisioning infrastructure. It's work that could be automated but hasn't been yet.

How is toil different from necessary operational work?

Not all operational work is toil. Incident response during novel outages requires judgment and isn't toil. Strategic planning and engineering work that improves systems aren't toil. Toil is specifically repetitive, manual tasks that don't provide lasting value—like manually running the same deployment steps every release instead of automating them.

How much toil is acceptable for SRE teams?

Google's SRE teams target keeping toil below 50% of each engineer's time, with the remaining 50% spent on engineering work like automation and system improvements. When toil exceeds 50%, teams risk burning out and failing to make lasting improvements to reliability.

How do you measure toil?

Measure toil by tracking time spent on repetitive manual tasks, counting how often the same procedures are executed, identifying tasks that scale with service growth, and surveying teams about what work feels wasteful. Time tracking and task categorization help quantify toil as a percentage of total effort.

Toil in SRE: How to Identify and Reduce Operational Toil

Introduction

After five years of steady decline, operational toil is on the rise again—up to 30% from 25% in 2024. For Site Reliability Engineers, this is alarming. Toil is the repetitive, manual, automatable work that scales linearly with your service but produces no lasting value. It steals time from engineering work that actually improves reliability, scalability, and incident response.

This post breaks down how to identify toil in your operations, measure its impact, and systematically eliminate it through automation and process improvements.

What Exactly Is Toil?

In SRE, toil has a specific definition. It’s operational work that is:

Manual - Requires human intervention, not automated
Repetitive - Same tasks performed regularly
Automatable - Could be handled by a machine
Tactical - Interrupt-driven and reactive, not strategic
No Enduring Value - Doesn’t create permanent improvements
Scales Linearly - More service growth equals more of this work

Importantly, not all operational work is toil. Incident response during an actual outage isn’t toil—it requires judgment and adaptability. But manually restarting the same service every day because of a memory leak? That’s pure toil.

Why Toil Matters

Google’s SRE teams aim to keep toil below 50% of each engineer’s time. The remaining 50% should be spent on engineering work: automation, tooling, and system improvements that compound over time.

When toil exceeds this threshold, teams enter a vicious cycle. Engineers have no time to build automation, so toil increases further. Technical debt accumulates. Burnout rises. Innovation stalls.

The impact extends beyond individual productivity. High toil creates:

Slower incident response - Engineers buried in manual tasks miss critical alerts
Delayed feature development - Teams can’t invest in reliability improvements
Increased on-call burden - Responders spend nights handling repetitive tasks
Higher error rates - Manual processes introduce human mistakes

How to Identify Toil

Track Your Time

Run a 5-day toil log during both on-call shifts and routine operations. Write down every repetitive task. Look for patterns in:

Daily standup notes and ticket queues
On-call incident logs and chat transcripts
Monitoring alert history and response actions
Manual deployment and configuration tasks

Apply the Toil Checklist

For each task, ask these questions:

Does this task scale with service size, traffic, or users?
Would a machine do this the same way every time?
Does completing this task leave behind any lasting improvement?
Is this reactive work triggered by alerts or requests?
Does this require minimal engineering judgment?

If you answer yes to most of these, you’ve identified toil.

Calculate the Cost

Aggregate the human-hours spent on each category of toil. Multiply by engineer compensation. Compare against the engineering time needed to automate or eliminate that work.

Not all toil is worth automating immediately. A task that takes 5 minutes quarterly may not justify a week of engineering effort. But a daily 30-minute manual deployment affecting three engineers? That’s 45 hours per month—probably worth automating.

Strategies to Reduce Toil

1. Eliminate Before You Automate

Sometimes the best solution is to stop doing the work entirely. Before building automation, ask:

Does this task actually need to happen?
What’s the cost of not responding to this toil?
Can we change the system to make this unnecessary?

Removing the root cause beats automating around it.

2. Automate High-Impact Tasks First

Don’t try to automate everything at once. Prioritize based on:

Frequency - Daily toil before monthly toil
Time cost - Tasks consuming the most aggregate hours
Error rate - Manual processes prone to mistakes
Interrupt severity - Work that disrupts on-call or focus time

Start with a few high-priority items. Use the time gained to improve your automation and tackle the next batch.

3. Batch and Delay Toil

Not all toil can be eliminated immediately. In the meantime:

Batch processing - Accumulate similar tasks for parallelized handling
Scheduled windows - Handle routine maintenance during designated times
Self-service tools - Let developers run deployments or config changes themselves

Reducing interrupts preserves engineering focus, even when toil remains.

4. Measure and Set Targets

Track toil metrics quarterly:

Percentage of time spent on toil per engineer
Mean Time to Repair (MTTR) for common issues
Number of manual interventions per incident
Alert volume and false positive rates

Set a target threshold—typically 50% or less of engineering time. If you exceed it, freeze new feature work until toil is reduced.

5. Improve Systems, Not Just Scripts

External automation (scripts, bots, runbooks) helps, but internal improvements are more effective:

Fix memory leaks instead of automating service restarts
Add health checks instead of manually monitoring logs
Implement graceful degradation instead of manual failovers

Engineering work that prevents toil has permanent value. Scripts that handle toil don’t fix the underlying problem.

6. Standardize Platforms

A lack of standardization creates complexity, which breeds toil. Minimize the number of deployment pipelines, monitoring stacks, and infrastructure platforms in use. Consolidation reduces the surface area for manual intervention.

Common Pitfalls

Over-Automating Too Soon

Building complex automation for rare events can create more maintenance toil than it eliminates. Validate the cost-benefit before investing weeks of engineering time.

Ignoring Automation Maintenance

Automation scripts need updates as systems evolve. Budget ongoing maintenance time, or your automation becomes technical debt.

Accepting Toil as “Just Part of the Job”

If your team treats constant manual work as inevitable, toil will never improve. Make toil reduction an explicit engineering goal with quarterly reviews.

Conclusion

Toil is rising again in 2025, but it doesn’t have to define your operations. By identifying repetitive manual work, measuring its impact, and systematically eliminating it through automation and process improvement, SRE teams can reclaim engineering time for work that actually improves reliability.

Start with a toil audit. Pick one high-impact task to automate this quarter. Track your progress. Over time, reducing toil creates a compounding effect—more engineering capacity leads to better systems, which generate less toil.

Tools like Upstat help teams reduce operational toil by automating incident response workflows, streamlining runbook execution, and centralizing alert routing to minimize manual intervention.

Citations

Eliminating Toil - Google Site Reliability Engineering Book
Accelerate State of DevOps Report 2024 - DORA / Google Cloud, 2024

Explore In Upstat

Reduce operational toil with automated incident workflows, runbook execution tracking, and centralized alert routing.

See how Automation features work

Toil in SRE: Identification and Reduction

Toil is the repetitive, manual work that scales with your service but adds no lasting value. Learn how SRE teams identify toil, measure its impact, and systematically eliminate it through automation and process improvement.