When your monitoring system checks if a service is healthy, what does “healthy” actually mean? For many teams, health checks are afterthoughts—simple endpoints that return “OK” without meaningful validation. Then production breaks in ways the health check never detected.
Effective health checks require intentional design. They must balance comprehensive validation against performance overhead, provide actionable signals without creating alert fatigue, and fail meaningfully when problems occur.
This guide covers the practical patterns that separate health checks that work from health checks that lie.
Understanding Health Check Types
Not all health checks serve the same purpose. Different contexts require different validation levels.
Liveness Checks
Liveness checks answer the simplest question: Is this process still running?
These checks validate that the application process is alive and can respond to requests. They should not check external dependencies, database connectivity, or complex business logic. Liveness is binary—the process either responds or does not.
When to use: Container orchestration platforms like Kubernetes restart containers when liveness checks fail. Keep these checks extremely lightweight to avoid false positives that trigger unnecessary restarts.
Example endpoint: /health/live
or /liveness
What to check:
- Process can handle HTTP requests
- Event loop is not blocked
- Basic internal state is valid
What NOT to check:
- Database connectivity
- External API availability
- Disk space or memory thresholds
Readiness Checks
Readiness checks determine: Is this service ready to accept traffic?
These checks validate that the service is both alive and capable of processing requests successfully. A service might be alive but not ready if it has not completed initialization, cannot reach required dependencies, or is overwhelmed by existing load.
When to use: Load balancers and service meshes use readiness to decide whether to route traffic to an instance. Failing readiness removes the instance from rotation without restarting it.
Example endpoint: /health/ready
or /readiness
What to check:
- Database connection pools are established
- Required configuration is loaded
- Critical dependencies are accessible
- Cache connections are established
What NOT to check:
- Non-critical dependency health
- Long-running operations
- Business logic validation
Startup Checks
Startup checks answer: Has initialization completed?
Some services require extended initialization periods—loading large datasets, warming caches, or establishing connections to slow dependencies. Startup checks provide additional time before liveness checking begins.
When to use: Services with slow initialization that would fail premature liveness checks.
Example endpoint: /health/startup
or /startup
What to check:
- Initial configuration loaded successfully
- Required data preloaded into memory
- Slow dependency connections established
Deep Health Checks
Deep health checks provide comprehensive validation: Is every component functioning correctly?
These checks validate all dependencies, system resources, and critical functionality. They are expensive operations meant for debugging and operational dashboards, not automated monitoring that runs every 30 seconds.
When to use: Manual investigation, administrative dashboards, pre-deployment validation.
Example endpoint: /health/deep
or /health/full
What to check:
- All external dependency connectivity
- Database query execution
- Message queue operations
- File system access
- Resource utilization levels
Designing Effective Endpoints
Health check endpoints need consistent structure that monitoring systems can interpret reliably.
Endpoint Naming Conventions
Establish clear naming that indicates check purpose:
/health # Basic health status
/health/live # Liveness only
/health/ready # Readiness validation
/health/startup # Initialization status
/health/deep # Comprehensive check
Avoid ambiguous paths like /status
, /ping
, or /healthz
unless your ecosystem has established conventions around them.
HTTP Status Codes
Use HTTP status codes to communicate health state:
200 OK: Service is healthy and operating normally. All critical components are functioning as expected.
503 Service Unavailable: Service is unhealthy. Do not route traffic here. Dependencies are failing, resources are exhausted, or critical functionality is impaired.
429 Too Many Requests: Service is alive but overwhelmed. Useful when implementing backpressure signals.
500 Internal Server Error: Health check itself failed due to implementation error. Distinct from intentional service unavailability.
Do not return 200 with error details in the response body. Monitoring systems make routing decisions based on status codes, not body parsing.
Response Format
Structure responses with actionable information:
{
"status": "healthy",
"timestamp": "2025-09-19T10:15:30Z",
"uptime": 86400,
"dependencies": {
"database": {
"status": "healthy",
"responseTime": 12
},
"cache": {
"status": "healthy",
"responseTime": 3
},
"payment_api": {
"status": "healthy",
"responseTime": 145
}
},
"system": {
"memory": {
"used": 512000000,
"total": 1073741824,
"percentage": 47.7
},
"connections": {
"active": 42,
"max": 200
}
}
}
Include enough detail for debugging without exposing sensitive internal architecture to unauthorized observers.
Response Time Requirements
Health checks must respond quickly. Slow health checks create several problems:
Monitoring overhead: Checks that take 5 seconds to complete limit check frequency and increase monitoring costs.
False positives: Timeout-based monitoring interprets slow responses as failures, triggering unnecessary alerts.
Resource consumption: Expensive health checks consume resources needed for actual work.
Target response times:
- Liveness: under 100ms
- Readiness: under 500ms
- Startup: under 2 seconds
- Deep: under 5 seconds
Implement background checking patterns for expensive validation rather than performing it synchronously in the health check handler.
Handling Dependency Checks
Checking dependency health is where most health check implementations go wrong.
The Cascade Failure Problem
When services check dependencies synchronously during health checks, failures cascade. Service A checks Service B. Service B checks Service C. Service C is slow. Now Services A, B, and C all report unhealthy, and your load balancer removes all three from rotation.
This is bad. A service should remain available even when non-critical dependencies fail, using circuit breakers, fallbacks, and graceful degradation.
Critical vs Non-Critical Dependencies
Not all dependencies are equally important. Categorize them:
Critical dependencies: Required for core functionality. If these fail, the service cannot fulfill its primary purpose.
- Primary database for read/write operations
- Authentication service for protected endpoints
- Core business logic services
Non-critical dependencies: Support secondary features. Service can degrade gracefully when these fail.
- Analytics tracking services
- Recommendation engines
- Notification delivery systems
Health checks should only fail for critical dependency problems. Non-critical failures should log warnings but not affect health status.
Circuit Breaker Integration
Integrate circuit breaker state into health checks:
async function checkDatabaseHealth(): Promise<DependencyStatus> {
const circuit = circuitBreakerRegistry.get('database');
if (circuit.isOpen()) {
return {
status: 'unhealthy',
reason: 'Circuit breaker open after repeated failures',
lastFailure: circuit.getLastFailureTime()
};
}
// Circuit closed, attempt connection check
try {
const startTime = Date.now();
await database.ping();
const responseTime = Date.now() - startTime;
return {
status: 'healthy',
responseTime
};
} catch (error) {
return {
status: 'unhealthy',
reason: error.message
};
}
}
This prevents health checks from repeatedly hammering failing dependencies.
Background Health Checking
For expensive dependency validation, check health in the background and cache results:
class HealthCheckManager {
private healthStatus: Map<string, DependencyStatus> = new Map();
constructor() {
// Check every 10 seconds in background
setInterval(() => this.performHealthChecks(), 10000);
}
private async performHealthChecks() {
const checks = [
this.checkDatabase(),
this.checkCache(),
this.checkPaymentAPI()
];
const results = await Promise.allSettled(checks);
results.forEach((result, index) => {
const key = ['database', 'cache', 'payment_api'][index];
this.healthStatus.set(key, result.status === 'fulfilled' ? result.value : {
status: 'unhealthy',
reason: 'Check failed'
});
});
}
public getHealthStatus(): HealthCheckResponse {
return {
status: this.isHealthy() ? 'healthy' : 'unhealthy',
dependencies: Object.fromEntries(this.healthStatus)
};
}
private isHealthy(): boolean {
// Only critical dependencies affect overall health
return this.healthStatus.get('database')?.status === 'healthy';
}
}
Health check endpoints return cached status instantly. Background tasks handle expensive validation asynchronously.
Security Considerations
Health check endpoints present security and operational challenges.
Authentication Requirements
Should health checks require authentication?
Public health checks: No authentication. Load balancers and external monitoring need access without credentials. Return minimal information—overall status without internal details.
Detailed health checks: Require authentication. Exposing dependency names, response times, resource utilization, and internal architecture aids attackers.
Provide both endpoints:
/health # Public, minimal information
/health/detailed # Authenticated, full diagnostics
Information Disclosure
Avoid exposing sensitive details in public health checks:
❌ Bad:
{
"database": {
"host": "prod-db-01.internal.company.com",
"user": "app_service",
"status": "unhealthy",
"error": "Connection refused to 10.0.1.42:5432"
}
}
✅ Good:
{
"database": {
"status": "unhealthy"
}
}
Detailed error messages belong in logs, not public responses.
Rate Limiting
Health checks can become denial-of-service vectors. Implement rate limiting:
const healthCheckLimiter = rateLimit({
windowMs: 10000, // 10 seconds
max: 100, // Max 100 requests per window
message: 'Health check rate limit exceeded'
});
app.get('/health', healthCheckLimiter, healthCheckHandler);
Legitimate monitoring checks every 30-60 seconds. Tighter rate limits indicate abuse.
Common Implementation Mistakes
Avoid these patterns that undermine health check effectiveness:
Hardcoded “OK” Responses
Health checks that always return success are worse than no health check. They create false confidence:
// ❌ Useless health check
app.get('/health', (req, res) => {
res.json({ status: 'ok' });
});
This tells monitoring systems nothing. Service could be crashing, database could be down, and health checks still return success.
Checking Internal Dependent Services
Microservices should not check health of services they depend on during their own health checks. This creates cascade failures and coupling:
// ❌ Bad - creates cascade failure
async function checkHealth() {
const orderServiceHealth = await fetch('http://orders/health');
const paymentServiceHealth = await fetch('http://payments/health');
return orderServiceHealth.ok && paymentServiceHealth.ok;
}
Instead, verify your service can reach dependencies (connection pools, circuit breakers) without making health check requests to them.
Synchronous Database Queries
Running expensive database queries synchronously during health checks creates performance problems:
// ❌ Bad - expensive synchronous check
app.get('/health', async (req, res) => {
try {
const result = await db.query('SELECT COUNT(*) FROM users');
res.json({ status: 'ok', userCount: result.rows[0].count });
} catch (error) {
res.status(503).json({ status: 'error' });
}
});
Use lightweight connection validation instead:
// ✅ Good - fast connection check
app.get('/health', async (req, res) => {
try {
await db.ping(); // Simple connectivity test
res.json({ status: 'healthy' });
} catch (error) {
res.status(503).json({ status: 'unhealthy' });
}
});
Ignoring Health Check Results
Some teams implement health checks but never configure monitoring to use them. The endpoint exists but serves no purpose.
Ensure your health checks integrate with:
- Load balancer health checking
- Container orchestration (Kubernetes probes)
- External monitoring systems
- Alerting pipelines
Over-Detailed Public Responses
Exposing full dependency details, internal metrics, and system architecture through public health checks provides attackers with reconnaissance data:
// ❌ Too much information
{
"status": "healthy",
"version": "2.4.1",
"build": "a3f2b9c",
"dependencies": {
"postgres_primary": "10.0.1.42:5432",
"postgres_replica": "10.0.1.43:5432",
"redis_cluster": ["10.0.2.10:6379", "10.0.2.11:6379"],
"internal_api": "http://api.internal:8080"
}
}
Reserve detailed responses for authenticated administrative endpoints.
Health Check Patterns for Specific Technologies
Different platforms require different health check implementations.
HTTP/HTTPS Services
Standard REST APIs and web services:
app.get('/health/live', (req, res) => {
res.status(200).json({ status: 'alive' });
});
app.get('/health/ready', async (req, res) => {
try {
await Promise.race([
databasePool.ping(),
new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000))
]);
res.status(200).json({ status: 'ready' });
} catch (error) {
res.status(503).json({ status: 'not ready', reason: error.message });
}
});
Database Health
Database connection health checking:
async function checkDatabaseHealth() {
const pool = getDatabasePool();
try {
// Check connection availability
const connection = await pool.getConnection();
// Verify connectivity with lightweight query
await connection.query('SELECT 1');
connection.release();
return {
status: 'healthy',
activeConnections: pool.getActiveCount(),
totalConnections: pool.getTotalCount()
};
} catch (error) {
return {
status: 'unhealthy',
reason: error.message
};
}
}
Message Queue Health
Message queue connectivity validation:
async function checkQueueHealth() {
try {
// Verify connection without consuming messages
const queueDepth = await messageQueue.getQueueDepth();
return {
status: 'healthy',
queueDepth,
consumers: await messageQueue.getActiveConsumers()
};
} catch (error) {
return {
status: 'unhealthy',
reason: 'Cannot connect to message queue'
};
}
}
Cache Health
Cache system validation:
async function checkCacheHealth() {
try {
const testKey = '__health_check__';
const testValue = Date.now().toString();
// Write test value
await cache.set(testKey, testValue, { ttl: 10 });
// Read test value
const retrieved = await cache.get(testKey);
if (retrieved !== testValue) {
return { status: 'unhealthy', reason: 'Cache read/write validation failed' };
}
// Clean up
await cache.del(testKey);
return { status: 'healthy' };
} catch (error) {
return { status: 'unhealthy', reason: error.message };
}
}
Monitoring Integration
Health checks only provide value when monitoring systems use them effectively.
External Monitoring Services
Configure external monitoring to check health endpoints from multiple geographic regions:
Upstat monitors HTTP and HTTPS endpoints including health check URLs, tracking DNS resolution time, TCP connection establishment, TLS handshake duration, and time-to-first-byte metrics. Multi-region checking differentiates between local network issues and true service unavailability.
Health check monitoring should track both availability (is the endpoint responding) and response quality (status code, response time, response content).
Load Balancer Integration
Configure load balancers to use readiness endpoints:
upstream backend {
server backend1:8080 max_fails=3 fail_timeout=30s;
server backend2:8080 max_fails=3 fail_timeout=30s;
}
location / {
proxy_pass http://backend;
# Health check configuration
health_check interval=5s
fails=2
passes=2
uri=/health/ready
match=health_ok;
}
match health_ok {
status 200;
header Content-Type = "application/json";
}
Failed health checks remove instances from rotation without restarting them. Passing health checks restore instances to the pool.
Container Orchestration
Kubernetes probe configuration:
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
image: app:latest
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30
Kubernetes uses these probes to manage container lifecycle automatically.
Testing Health Check Implementation
Validate that health checks work correctly before production deployment.
Unit Testing
Test health check logic in isolation:
describe('Health Check Endpoint', () => {
it('returns 200 when all dependencies are healthy', async () => {
mockDatabase.ping.mockResolvedValue(true);
mockCache.ping.mockResolvedValue(true);
const response = await request(app).get('/health/ready');
expect(response.status).toBe(200);
expect(response.body.status).toBe('ready');
});
it('returns 503 when database is unavailable', async () => {
mockDatabase.ping.mockRejectedValue(new Error('Connection failed'));
const response = await request(app).get('/health/ready');
expect(response.status).toBe(503);
expect(response.body.status).toBe('not ready');
});
});
Integration Testing
Validate health checks against real dependencies in test environments:
describe('Health Check Integration', () => {
it('detects actual database connectivity issues', async () => {
// Start service with database stopped
await stopDatabase();
const response = await request(app).get('/health/ready');
expect(response.status).toBe(503);
// Start database
await startDatabase();
await waitForDatabaseReady();
const healthyResponse = await request(app).get('/health/ready');
expect(healthyResponse.status).toBe(200);
});
});
Chaos Engineering
Intentionally fail dependencies to verify health checks detect problems:
# Kill database connection
docker pause postgres-container
# Verify health check fails
curl http://localhost:8080/health/ready
# Expected: 503 Service Unavailable
# Restore database
docker unpause postgres-container
# Verify health check recovers
curl http://localhost:8080/health/ready
# Expected: 200 OK
Conclusion
Effective health checks form the foundation of reliable monitoring. They provide monitoring systems with accurate signals about service health, enable load balancers to route traffic intelligently, and help operations teams detect problems faster.
The difference between useful health checks and misleading ones comes down to thoughtful design: separating liveness from readiness, handling dependencies without cascade failures, responding quickly without expensive validation, and providing actionable information without security risks.
Start with simple liveness checks that validate basic process health. Add readiness validation that checks critical dependencies using background checking patterns. Integrate health endpoints with monitoring systems and load balancers. Test failure scenarios to ensure health checks detect real problems.
Health checks are not afterthoughts—they are operational contracts between your service and the infrastructure that keeps it running. Invest the time to implement them correctly, and your monitoring systems will reward you with faster detection, fewer false positives, and more reliable operations.
Explore In Upstat
Monitor service health with automated HTTP/HTTPS checks that track DNS resolution, TCP connection, TLS handshake, and response time metrics across multiple regions.