What Is MTTR?
Mean Time to Recovery (MTTR) is a metric that measures the average time between when an incident is detected and when service is fully restored. It captures how quickly teams identify problems, coordinate responses, implement fixes, and verify that systems work correctly again.
MTTR matters because it directly measures customer impact. When your API goes down, MTTR tells you how long users experienced the outage. When database performance degrades, MTTR shows how quickly you restored normal response times. The metric connects technical response effectiveness to business outcomes in a single number.
Unlike metrics that measure individual phases of incident response, MTTR captures the complete resolution cycle. It includes detection, acknowledgment, investigation, remediation, and verification. This comprehensive view makes MTTR the most common metric for evaluating overall incident response effectiveness.
The Three MTTR Variations
The acronym MTTR appears everywhere in incident management, but it doesn’t always mean the same thing. Three variations exist, each measuring slightly different endpoints:
Mean Time to Recovery measures from incident detection until service is restored to users. This is the most customer-centric definition because it focuses on user-facing impact. Service is “recovered” when users can access normal functionality, even if underlying fixes are temporary.
Mean Time to Repair measures from incident detection until the root cause is fixed. This definition is more technically rigorous because it requires addressing the actual problem, not just restoring service through workarounds. A rollback might recover service quickly, but repair happens when you deploy the actual fix.
Mean Time to Resolve measures from incident detection until the incident is officially closed. This includes recovery, any necessary cleanup, documentation, and formal closure. Resolve captures the complete incident lifecycle but may include activities that don’t affect user experience.
For most teams, Mean Time to Recovery provides the most useful operational signal because it measures what customers experience. When evaluating your MTTR numbers, clarify which definition your team uses to ensure consistent measurement.
How to Calculate MTTR
The basic MTTR formula is straightforward:
MTTR = Total Resolution Time / Number of Incidents If your team resolved 10 incidents last month with a combined resolution time of 450 minutes, your MTTR equals 45 minutes.
The complexity lies in defining when resolution time starts and ends.
Start time should be when monitoring detects the issue, not when a human first notices it. If your database started returning errors at 2:00 PM but monitoring didn’t alert until 2:15 PM, resolution time starts at 2:15 PM. The 15-minute gap represents detection time (MTTD), which is a separate metric.
End time should be when service is restored and verified, not when the fix is deployed. If you deploy a fix at 3:00 PM but don’t confirm service restoration until 3:10 PM, resolution time ends at 3:10 PM.
For incidents with multiple severity changes or extended monitoring periods, define clear rules about what constitutes resolution. Most teams consider an incident resolved when the initial user impact ends, even if follow-up work continues.
Why MTTR Matters
MTTR connects directly to business outcomes in ways that other technical metrics often don’t.
Customer experience suffers during incidents. Every minute of downtime means users can’t access features they depend on. For e-commerce platforms, this translates directly to lost revenue. For B2B services, it erodes customer trust. MTTR quantifies this impact duration.
SLA compliance often includes resolution time requirements. Contracts might specify that critical incidents must be resolved within 4 hours or that monthly availability must exceed 99.9%. High MTTR puts these commitments at risk, potentially triggering financial penalties or contract renegotiations.
Team effectiveness becomes visible through MTTR trends. When MTTR decreases over time, it signals that runbooks are improving, coordination is tightening, and teams are learning from past incidents. When MTTR increases, something in the response process needs attention.
Resource allocation decisions benefit from MTTR data. If certain incident types consistently show high MTTR, investing in automation, training, or architectural improvements for those areas may deliver significant returns.
MTTR Benchmarks and Context
Industry benchmarks for MTTR vary widely because incident types differ dramatically in complexity. A misconfigured feature flag might take 5 minutes to resolve. A corrupted database might take 5 hours.
Rather than chasing universal benchmarks, focus on context-appropriate targets:
By severity level, critical incidents affecting all users should resolve faster than minor issues affecting few. Many teams target under 1 hour for critical incidents, under 4 hours for high-priority issues, and same-day resolution for medium-priority problems.
By incident type, infrastructure failures often resolve faster than complex application bugs. Set different expectations for different categories rather than applying a single target universally.
By your own baseline, the most actionable benchmark is your own historical performance. If your current MTTR averages 90 minutes, targeting 75 minutes represents meaningful improvement. Comparing yourself to companies with different architectures, team sizes, and incident profiles provides less useful guidance.
Track percentiles alongside averages. MTTR of 45 minutes sounds acceptable until you discover that your 95th percentile is 4 hours. A few extended incidents can dominate customer experience even when most resolve quickly.
Using MTTR Effectively
MTTR becomes powerful when analyzed thoughtfully rather than treated as a single number.
Segment by severity to understand where resolution time actually goes. If P1 incidents average 30 minutes but P3 incidents average 3 hours, you might have different problems: P1 response might be well-optimized while P3 issues receive insufficient attention.
Track trends over time rather than obsessing over individual incidents. One 4-hour incident doesn’t indicate systemic problems. Three consecutive months of increasing MTTR suggests something is degrading in your response process.
Investigate outliers to find improvement opportunities. When an incident takes three times longer than average, examine why. Was it an unusual problem type? Did coordination break down? Was expertise unavailable? Outliers often reveal gaps that affect other incidents less visibly.
Avoid gaming the metric by closing incidents prematurely or classifying issues as lower severity to hit targets. When teams face pressure to reduce MTTR numbers rather than actual resolution time, they find ways to manipulate data while customer experience stays unchanged. Measure what matters and address the underlying process, not the number itself.
MTTR in the Broader Metrics Picture
MTTR doesn’t exist in isolation. It works alongside other incident metrics to provide complete visibility into response effectiveness.
Mean Time to Detect (MTTD) measures how quickly monitoring identifies problems. Poor MTTD inflates apparent MTTR because you can’t start resolving what you haven’t detected.
Mean Time to Acknowledge (MTTA) measures how quickly responders engage after alerts fire. High MTTA suggests notification or on-call problems that delay the start of actual resolution work.
Mean Time Between Failures (MTBF) measures how often incidents occur. Even excellent MTTR becomes exhausting if incidents happen constantly.
For detailed comparisons of how these metrics interact and when to focus on each, see MTTR vs MTTD vs MTTA Explained. Understanding the complete metrics picture helps identify whether to invest in detection, response, or prevention.
Tracking MTTR in Practice
Manual MTTR tracking fails because it requires someone to record timestamps during stressful incidents. When engineers focus on resolving problems at 3 AM, documenting start and end times accurately becomes an afterthought.
Modern incident management platforms track these metrics automatically. When incidents are created, the system records start time. When status changes to resolved, it captures end time. Duration calculations happen without manual intervention, ensuring accurate data even during chaotic responses.
Platforms like Upstat provide built-in MTTR analytics including automatic duration tracking based on incident status transitions, percentile breakdowns showing P50 and P95 resolution times, severity-based analysis revealing which incident types take longest, and daily trend visualization highlighting whether response is improving or degrading.
Automated tracking transforms MTTR from a theoretical metric into practical operational intelligence. Teams can identify patterns, measure improvement initiatives, and demonstrate response effectiveness with data that requires no manual collection effort.
Moving From Measurement to Improvement
Understanding MTTR is the starting point, not the destination. Once you know your baseline, the real work begins: systematically reducing resolution time without sacrificing quality or burning out your team.
For specific strategies on improving MTTR through better coordination, clearer runbooks, and smarter automation, see Reducing Mean Time to Resolution.
The goal isn’t achieving the lowest possible MTTR number. It’s building incident response capability that restores service quickly, learns from failures consistently, and improves sustainably over time. MTTR provides the measurement framework for tracking that journey.
Explore in Upstat
Track MTTR automatically with built-in duration tracking, percentile analysis, and severity breakdowns that show exactly where resolution time goes.
