Blog Home  /  how-to-write-better-alerts

How to Write Better Alerts

Bad alerts wake engineers for non-issues or provide no context for action. Good alerts include severity, impact, context, and next steps that enable fast response. This guide covers alert writing practices that reduce noise, prevent fatigue, and transform notifications from interruptions into actionable signals.

October 9, 2025 7 min read
monitoring

The Alert Nobody Can Act On

Your monitoring system sends an alert: “API latency high.” An engineer wakes at 2 AM, checks dashboards, sees 200ms response times—well within acceptable limits. The alert fired because one endpoint occasionally spikes. No action required. The engineer goes back to sleep frustrated. Next week, the same alert fires during an actual outage, but the engineer hesitates before responding, remembering last week’s false alarm.

This pattern repeats across engineering teams daily. Monitoring systems generate thousands of alerts. Most require no action. Teams become desensitized. When real problems occur, alerts get ignored or delayed because trust eroded through noise.

The problem is not monitoring systems—it is how teams write alerts. Most alerts lack essential information, trigger on non-actionable conditions, or provide no guidance for response. Better alert writing transforms notifications from noise into reliable signals that enable fast incident response.

What Makes an Alert Good

Effective alerts share characteristics that distinguish them from notification noise.

Actionable: Good alerts fire when someone needs to do something. If no action is possible or required, no alert should fire. Alerts are not logging. Alerts demand attention and response.

Specific impact: Alerts communicate what broke and who is affected. “Database connection pool exhausted” beats “system error.” The first explains impact. The second creates confusion.

Contains context: Alerts include enough information to start troubleshooting without checking multiple systems. Current values, thresholds, affected services, and recent changes provide context immediately.

Provides next steps: Alerts link to runbooks, dashboards, or procedures. Engineers should know what to check first rather than guessing where to start investigating.

Appropriate urgency: Critical alerts page engineers immediately. Warnings notify during business hours. Informational events log without interruption. Severity must match actual impact.

Bad alerts violate these principles. They fire when nothing is wrong, provide vague messages, offer no context, include no guidance, and interrupt inappropriately. Each violation wastes time and erodes trust.

Alert Message Structure

Well-written alert messages follow consistent patterns that communicate essential information efficiently.

Lead with impact: Start messages with what failed and how it affects users or services. “API authentication failing—users cannot login” immediately communicates severity and scope.

Include current state: Show the metric value that triggered the alert and the threshold it exceeded. “Response time 850ms (threshold 500ms)” provides context for assessing severity.

Add time context: Indicate duration when relevant. “High error rate for 15 minutes” differs from “High error rate for 2 hours.” Duration signals whether problems are transient or sustained.

Reference affected components: Name specific services, regions, or customer segments. “Payment processing down in US-EAST-1” focuses investigation immediately.

Avoid jargon: Alert messages reach on-call engineers who might not work on the affected system daily. Clear language beats technical precision when waking someone at 3 AM.

Link to resources: Include links to relevant dashboards, runbooks, or logs. Engineers should access investigation tools with one click rather than searching.

Example of weak alert message: “High latency detected.”

Example of strong alert message: “API response time 1,200ms (threshold 500ms) for 10 minutes. Users experiencing slow page loads. Runbook: [link]”

The difference between messages determines whether engineers can respond effectively or waste time gathering information the alert should have provided.

Severity Levels and Routing

Alert severity determines who gets notified and how urgently. Most teams benefit from three to five severity levels.

Critical/P0: Production completely broken. Customers cannot use core functionality. Page on-call engineers immediately through phone calls. Wake people up. These alerts demand urgent action.

High/P1: Degraded service or approaching critical thresholds. Customers experiencing poor performance but core functions work. Send push notifications or SMS. Interrupt engineers but less urgently than critical.

Medium/P2: Issues affecting some users or non-critical features. Send to team chat channels. Notify during business hours. Do not interrupt off-hours unless problems escalate.

Low/P3: Informational events or approaching warning thresholds. Log to ticketing systems. Review weekly. No immediate notification needed.

Info: Deployment completions, scheduled maintenance, or system state changes. No action required. Log for audit trails and correlation during investigations.

Common mistakes include using only two severity levels (everything is critical or ignored), paging engineers for low-priority issues, or sending informational events through the same channels as critical alerts. Proper severity classification and routing reduces alert fatigue dramatically.

Platforms like Upstat support configurable alert routing based on severity, sending critical monitor failures to on-call engineers while directing lower-severity notifications to appropriate team channels without unnecessary interruptions.

Threshold Configuration

Alert thresholds determine when notifications fire. Poorly configured thresholds cause most alert noise.

Use statistical baselines: Set thresholds based on historical data rather than guessing. If API latency normally ranges 100-200ms, a threshold of 150ms fires constantly. A threshold of 400ms provides meaningful signal.

Account for variance: Systems fluctuate. Single-point thresholds trigger on normal variations. Use sustained threshold violations (metric above limit for 5+ minutes) rather than instant triggers.

Time-based thresholds: Different hours have different normal behavior. API traffic at 2 AM differs from 2 PM. Thresholds should adjust for expected patterns rather than using fixed values.

Percentile metrics: Average latency hides outliers. Monitor 95th or 99th percentile to detect problems affecting some users without false positives from occasional slow requests.

Multiple thresholds: Configure warning thresholds that notify without paging and critical thresholds that demand immediate action. Gradual escalation prevents alert storms while maintaining visibility.

Dead-man switches: Some alerts fire when metrics stop reporting entirely. These catch monitoring system failures rather than application problems.

Testing thresholds requires iteration. After setting initial values, monitor alert frequency. Thresholds firing multiple times daily without requiring action need adjustment. Thresholds that never fire might be too permissive.

Context and Enrichment

Alert messages should include information that helps engineers understand and respond to problems without checking other systems first.

Current metric values: Show the value that triggered the alert alongside normal ranges or historical context.

Affected scope: List services, regions, customers, or instances experiencing problems.

Recent changes: Include recent deployments, configuration updates, or scaling events that correlate with alert timing.

Dependency status: Reference status of dependencies. An alert for database errors should indicate whether the database itself is healthy or experiencing issues.

Runbook links: Direct links to troubleshooting procedures specific to this alert. Engineers should access guidance immediately.

Dashboard links: Pre-filtered dashboards showing relevant metrics and logs. One click should surface investigation tools.

Related alerts: Show other active alerts that might indicate broader system problems versus isolated failures.

Rich context transforms alerts from simple notifications into investigation starting points. Engineers spend less time gathering information and more time resolving problems.

Alert Aggregation and Deduplication

Single failures often trigger multiple alerts across different monitoring systems. Without aggregation, engineers receive notification storms that obscure root causes.

Group related alerts: When a database fails, applications depending on it also fail. Alert grouping should recognize cascading failures and notify about root causes rather than symptoms.

Deduplicate identical alerts: If five API instances all detect the same problem, send one alert rather than five. Multiple notifications provide no additional value.

Time windows: Aggregate alerts firing within short periods. If latency spikes across services within minutes, group notifications rather than interrupting repeatedly.

Auto-resolve duplicates: When root cause alerts resolve, automatically close related symptom alerts rather than requiring manual cleanup.

Escalation only on persistence: Some problems resolve quickly. Send initial notifications to low-urgency channels. Escalate to high-urgency channels only if problems persist beyond defined periods.

Effective aggregation reduces notification volume dramatically while preserving signal quality. Engineers see coherent incident summaries rather than fragmented alert lists.

Actionable Guidance

The best alerts tell engineers what to do next rather than just reporting problems.

Link runbooks automatically: Every alert should reference documented procedures. Engineers should not hunt for troubleshooting steps.

Suggest first steps: Include immediate actions in alert messages. “Check database connection pool status” or “Verify recent deployment health” focuses investigation.

Provide rollback information: If alerts fire after deployments, include rollback commands or links to revert procedures.

Show similar incidents: Reference recent similar alerts and their resolutions. Engineers learn from previous responses.

Include contact information: List service owners, on-call engineers, or teams responsible for affected systems. Engineers should know who to engage for help.

Offer self-service actions: Link to safe remediation actions engineers can execute directly—restart services, scale resources, or toggle feature flags.

Alerts with actionable guidance reduce mean time to resolution significantly. Engineers start fixing problems immediately rather than spending time figuring out what to do.

Testing and Refinement

Alert quality requires continuous improvement based on actual incident response experience.

Track acknowledge time: Measure how long alerts wait before engineers acknowledge them. Long waits suggest low-priority alerts interrupting inappropriately.

Monitor false positive rate: Track alerts that close without action. High rates indicate threshold tuning needed.

Review alert-to-incident correlation: Determine which alerts actually predict or detect real incidents versus noise.

Post-incident analysis: After resolving incidents, evaluate whether alerts helped or hindered response. Improve alerts that lacked context or delayed detection.

Scheduled alert reviews: Regularly audit alerts that fire frequently without requiring action. Adjust thresholds, severity, or disable them entirely.

Feedback loops: Create channels for on-call engineers to report alert quality issues. Front-line responders know which alerts waste time.

Alert systems are never perfect initially. Effective teams treat alert configuration as living documentation that evolves based on operational experience.

Common Pitfalls

Teams writing alerts encounter predictable problems.

Alert overload: Configuring alerts for every possible metric creates noise rather than signal. Focus on conditions that actually require human intervention.

Vague messages: “System error” tells engineers nothing. Specific, detailed messages enable effective response.

Missing severity classification: Treating all alerts as equally urgent causes either constant interruption or ignoring everything.

No suppression during known issues: Alerting repeatedly for active incidents wastes attention. Suppress subsequent related alerts until root causes resolve.

Ignoring on-call experience: Alerts that make sense to the team writing them might confuse on-call engineers unfamiliar with specific systems. Test alerts with people outside the immediate team.

Alert sprawl: Multiple monitoring tools creating overlapping alerts. Consolidate or deduplicate to present coherent incident views.

Getting Started

Improving alert quality does not require rewriting everything simultaneously.

Audit current alerts: Review alerts from the past month. Identify which required action and which generated noise. Focus improvement efforts on high-noise offenders.

Fix top irritants: Address the three alerts that waste the most time. This provides quick wins demonstrating value.

Add context incrementally: For each critical alert, add runbook links and suggested first steps. This adds value immediately.

Standardize message format: Create templates for alert messages ensuring consistent structure across all alerts.

Implement severity levels: If currently using one severity level, add at least critical and warning levels with appropriate routing.

Review after incidents: After each incident, evaluate alert quality. Did alerts provide needed information? Did false positives delay response? Improve based on findings.

Start improving alert quality today by auditing recent alerts and fixing the most problematic ones. Incremental improvement compounds into substantially better incident response over time.

Final Thoughts

Alert quality determines whether monitoring systems help or hinder incident response. Good alerts are actionable, specific, contextual, instructive, and appropriately urgent. Bad alerts are vague, noisy, missing context, and interrupt inappropriately.

Writing better alerts requires understanding what makes notifications actionable, structuring messages to communicate impact and next steps, configuring appropriate severity levels and routing, setting thresholds based on statistical baselines, enriching alerts with investigation context, aggregating related notifications, and providing actionable guidance.

Alert quality is not one-time configuration—it requires continuous refinement based on operational experience. Track false positive rates, measure response times, review alerts after incidents, and incorporate feedback from on-call engineers.

Most teams tolerate noisy alerts because improving them seems overwhelming. Start small. Fix the three worst alerts this week. Add runbook links next week. Adjust severity levels the following week. Small improvements accumulate into substantially better incident response.

Better alerts reduce mean time to resolution, prevent alert fatigue, improve on-call engineer experience, and help teams catch problems before customers notice them.

Write alerts today assuming an unfamiliar engineer will receive them at 2 AM. Include everything they need to understand the problem and start fixing it. That standard transforms notifications from interruptions into investigation accelerators.

Alert quality is not overhead—it is the difference between teams that respond effectively to incidents and teams that struggle to distinguish signal from noise.

Explore In Upstat

Configure multi-channel alert delivery with severity-based routing that sends critical alerts to on-call engineers while keeping informational notifications in team channels.