When should you run an ORR?

Run ORRs before initial production deployment, after major architectural changes, before significant traffic increases, and periodically for mature services to catch drift. Schedule reviews early enough to address gaps before launch deadlines.

Who participates in an ORR?

ORR participants typically include the service owner presenting readiness, SRE reviewers with operational expertise, architecture representatives for system design, and stakeholders from dependent teams. External reviewers provide objectivity that internal teams miss.

What is the difference between an ORR and a production readiness checklist?

A production readiness checklist defines WHAT to validate. An ORR is the formal review PROCESS that evaluates whether the service meets those checklist criteria. The ORR includes discussion, approval decisions, and remediation tracking that checklists alone cannot provide.

Operational Readiness Reviews: Preventing Incidents Before Launch

Q: What is an Operational Readiness Review?

An Operational Readiness Review (ORR) is a formal assessment process that validates whether a service meets operational standards before production deployment. It evaluates monitoring, documentation, incident response preparedness, and team readiness to ensure safe operations.

Three weeks before launch, the team confidently reports their new payment service is ready for production. Monitoring exists. Documentation is written. On-call coverage is assigned. Two days into production, a midnight outage reveals the monitoring only covers availability—not the latency degradation affecting customers. The runbook references commands that require permissions the on-call engineer lacks. Nobody documented which upstream services could cause cascading failures.

The service passed informal review. It would not have passed a formal Operational Readiness Review.

What Is an Operational Readiness Review

An Operational Readiness Review (ORR) is a structured assessment process that validates whether a service meets operational standards before accepting production traffic. Unlike informal readiness checks or deployment checklists, an ORR brings together reviewers with operational expertise to systematically evaluate preparedness across monitoring, incident response, documentation, and team capabilities.

The ORR concept originated at Amazon Web Services, where it evolved from post-incident learnings into a proactive prevention mechanism. Rather than discovering operational gaps during outages, ORRs identify them before customers are affected.

An ORR is two things: a process and a criteria set. The process defines who reviews, when reviews happen, and how decisions are made. The criteria set defines what must be validated—monitoring coverage, runbook completeness, escalation paths, and similar operational requirements.

Why ORRs Prevent Incidents

Traditional software development validates functional correctness: Does the code work? Do tests pass? Can it deploy successfully? These validations miss operational questions: Can teams detect when it fails? Can they respond effectively? Can they communicate status to stakeholders?

ORRs address this gap by validating operational infrastructure before production exposure. The review process surfaces gaps that informal checks miss because reviewers bring external perspective and operational experience that service teams may lack.

From Reactive to Proactive

Most organizations discover operational gaps reactively—during incidents. The monitoring dashboard missing critical metrics only becomes apparent when an outage goes undetected. The outdated runbook only reveals itself when midnight responders follow incorrect procedures.

ORRs flip this model. By systematically reviewing operational preparedness before launch, teams discover gaps during controlled reviews rather than chaotic incidents. This prevents the compounding effect where operational gaps make incidents worse, which then reveals more gaps under pressure.

External Perspective Matters

Service teams often develop blind spots about their own systems. They know the workarounds, the tribal knowledge, and the informal procedures that documentation omits. External reviewers see what’s actually documented versus what’s assumed.

A reviewer asking “Where is the runbook for database connection exhaustion?” reveals that the team handles this scenario from memory—knowledge that disappears when the experienced engineer is unavailable.

When to Conduct ORRs

ORRs serve as gates at specific points in the service lifecycle where operational risk changes significantly.

Before Initial Production Launch

Every new service should pass an ORR before accepting production traffic. This is the primary gate that validates the service is ready for real-world operations, not just development and staging environments.

Schedule ORRs early enough to address discovered gaps. An ORR two days before launch creates pressure to skip remediation. An ORR two weeks before launch allows time to fix issues without delaying release.

After Major Architectural Changes

Significant changes to service architecture alter operational characteristics. A migration from monolith to microservices changes monitoring requirements, dependency patterns, and incident response procedures. Major version upgrades may invalidate existing runbooks.

Re-run ORRs after changes that affect how the service operates, not just what it does functionally.

Before Significant Traffic Increases

Planned growth—new customer onboarding, marketing campaigns, seasonal peaks—increases operational risk. Systems that operate reliably at current scale may fail under increased load. ORRs before traffic increases validate that monitoring, capacity planning, and response procedures match anticipated demands.

Periodic Reviews for Mature Services

Services drift over time. Team members change. Infrastructure evolves. Documentation becomes outdated. Periodic ORRs—quarterly or semi-annually for critical services—catch this drift before it causes incidents.

Periodic reviews often reveal that services which passed initial ORRs have degraded operationally as maintenance attention shifted elsewhere.

ORR Participants and Roles

Effective ORRs require the right participants with clear responsibilities.

Service Owner

The team responsible for the service presents their readiness. They prepare documentation, demonstrate monitoring coverage, and explain incident response procedures. They answer reviewer questions and accept or negotiate findings.

Service owners should prepare for ORRs like presentations: organized evidence, clear explanations, and honest acknowledgment of known gaps.

ORR Reviewers

Reviewers are typically SRE team members, platform engineers, or experienced operators who bring external perspective and operational expertise. They evaluate presented evidence against established criteria, ask probing questions, and identify gaps the service team may have missed.

Good reviewers balance thoroughness with pragmatism. They distinguish critical gaps that block launch from recommendations that can be addressed post-launch.

Architecture Representatives

For new services or major changes, architecture review ensures the design supports operational requirements. Can the service be monitored effectively? Does the architecture enable graceful degradation? Are failure modes understood?

Dependent Team Representatives

If other services depend on the reviewed service, representatives from those teams validate that dependency documentation is accurate and integration points are operationally sound.

Core ORR Evaluation Areas

ORRs evaluate operational preparedness across several domains. The specific criteria vary by organization, but common areas include:

Monitoring and Observability

Can teams detect when the service is unhealthy? Effective monitoring covers availability, performance, and business metrics—not just whether the service responds, but whether it responds correctly and quickly enough.

Reviewers validate that monitoring exists for critical paths, that alert thresholds are appropriate, and that dashboards provide actionable information during incidents. Multi-region health checks validate availability from customer perspective, not just internal network views.

Incident Response Preparedness

Can teams respond effectively when issues occur? This includes defined severity levels, escalation policies that route alerts to appropriate responders, and communication procedures for stakeholder updates.

Reviewers check that on-call schedules exist with no coverage gaps, that escalation paths reach responsive engineers, and that response procedures are documented and accessible.

Runbooks and Documentation

Can responders find and follow operational procedures? Runbooks should cover common failure scenarios with diagnostic steps and remediation actions. Documentation should explain system architecture, dependencies, and operational context.

Reviewers validate that runbooks are current, that commands work with appropriate permissions, and that procedures can be executed by engineers without deep system expertise.

Service Catalog and Dependencies

Do teams understand what the service depends on and what depends on it? Service catalogs with dependency mapping enable impact assessment during incidents. Missing dependency documentation leads to surprises when upstream services fail.

Reviewers validate that ownership is clear, dependencies are documented, and impact paths are understood.

Status Communication

Can teams communicate service status to customers and stakeholders? Status pages, communication templates, and update procedures enable transparent incident communication.

Reviewers validate that status page configuration reflects the service, that communication channels exist, and that teams know how to publish updates.

Structuring the ORR Process

The ORR process typically follows a structured format that balances thoroughness with efficiency.

Pre-Review Preparation

Service owners complete self-assessment against ORR criteria before the review meeting. This preparation identifies obvious gaps for remediation and organizes evidence for efficient review.

Provide reviewers with documentation, monitoring dashboards, and runbook locations in advance. Reviews proceed faster when participants arrive prepared.

Review Meeting

The review meeting walks through each evaluation area systematically. Service owners present evidence of readiness. Reviewers ask clarifying questions and probe areas of concern.

Keep meetings focused—typically 60 to 90 minutes for initial ORRs. Longer sessions indicate either insufficient preparation or scope beyond a single review.

Findings Documentation

Document findings with clear categorization:

Blockers prevent production launch. These are critical gaps that must be addressed before proceeding—missing monitoring for business-critical paths, no on-call coverage, or absent incident response procedures.

Recommendations should be addressed but don’t block launch. These are improvements that reduce operational risk—additional runbooks for edge cases, enhanced monitoring for secondary paths, or documentation improvements.

Remediation Tracking

Assign owners and deadlines for each finding. Track remediation progress and validate completion before launch approval. Blockers require verification that fixes are in place, not just committed to fix.

Launch Approval

Once blockers are resolved, the ORR provides formal approval for production deployment. This approval represents collective confidence that the service meets operational standards.

Common ORR Findings

Certain gaps appear repeatedly across ORRs, reflecting common blind spots in service development.

Monitoring Gaps

Services often monitor availability without performance. The API responds, but latency has doubled. Users experience degradation while dashboards show green.

Reviewers frequently find missing monitoring for error rates, missing regional coverage, or thresholds set so high that alerts only fire during complete outages.

Documentation Staleness

Initial documentation becomes outdated as services evolve. Commands reference deprecated tools. Architecture diagrams miss recent changes. Runbooks describe procedures that no longer work.

Unclear Ownership

Services sometimes lack clear ownership assignment, especially after team reorganizations. When incidents occur, responders waste time identifying who should be involved.

Escalation Path Gaps

Escalation policies may route to individuals rather than roles, breaking when team members change. Policies may lack secondary escalation when primary responders are unavailable.

Teams understand their service deeply but may not understand dependencies thoroughly. When upstream services fail, impact analysis requires understanding that only exists informally.

ORRs and Tooling Integration

While ORRs are fundamentally a process, tooling supports effective reviews and remediation tracking.

Platforms like Upstat provide capabilities that ORR reviewers validate: comprehensive monitoring with multi-region health checks and SSL certificate tracking, runbook management with execution tracking, service catalog with dependency mapping and ownership documentation, on-call scheduling with coverage verification, and status pages for stakeholder communication. Purpose-built platforms consolidate the operational infrastructure that ORRs assess.

Integrated tooling also simplifies remediation. When an ORR identifies missing monitoring, teams can configure new health checks immediately. When escalation gaps are found, on-call schedules can be updated before launch.

Building ORR Culture

Effective ORRs require organizational commitment beyond individual reviews.

ORRs as Enablers, Not Gates

Position ORRs as enablers of safe launches rather than bureaucratic barriers. Teams should view ORRs as opportunities to validate readiness and catch gaps, not obstacles to overcome.

When ORRs find issues, celebrate the prevention of potential incidents rather than criticizing preparation gaps.

Continuous Improvement

Use findings patterns to improve organizational practices. If monitoring gaps appear repeatedly, enhance monitoring standards or templates. If documentation staleness is common, implement freshness checks.

ORR findings aggregate into organizational learning that prevents future gaps across all services.

Scaling ORRs

As organizations grow, ORR processes must scale. Train reviewers across teams so expertise distributes. Standardize criteria and documentation templates so reviews proceed consistently. Automate checklist validation where possible to focus human review on areas requiring judgment.

Final Thoughts

Operational Readiness Reviews transform production launches from hopeful deployments into validated transitions. By systematically assessing monitoring, incident response, documentation, and team preparedness, ORRs catch gaps before they cause customer-impacting incidents.

The ORR process requires investment—preparation time, reviewer expertise, remediation effort. This investment pays back immediately through prevented incidents, faster response when issues occur, and reduced operational chaos.

Start by defining core ORR criteria matching your organization’s operational standards. Establish the review process with clear roles and decision authority. Run initial ORRs for critical services, learning from each review to refine the process.

The goal is not perfection before every launch. The goal is sufficient operational readiness to support safe production operations. ORRs provide the structured assessment that validates this readiness systematically rather than hopefully.

Services that pass rigorous ORRs enter production with confidence. Teams know monitoring works, runbooks are valid, and response procedures are tested. When incidents eventually occur, they’re detected quickly, responded to effectively, and resolved with documented procedures. That’s the value of operational readiness validated before launch.

Explore In Upstat

Validate operational readiness with integrated monitoring, runbook tracking, service catalog with dependency mapping, and on-call scheduling that ensures teams are prepared for production.

See How Service Catalog Works

Operational Readiness Reviews Explained