Blog Home  /  health-check-implementation-guide

Health Check Implementation Guide

Health checks are the foundation of reliable monitoring, but implementing them effectively requires thoughtful design. This guide covers endpoint patterns, response formats, dependency handling, and failure modes that enable accurate monitoring without creating alert fatigue or performance overhead.

September 19, 2025 undefined
monitoring

When your monitoring system checks if a service is healthy, what does “healthy” actually mean? For many teams, health checks are afterthoughts—simple endpoints that return “OK” without meaningful validation. Then production breaks in ways the health check never detected.

Effective health checks require intentional design. They must balance comprehensive validation against performance overhead, provide actionable signals without creating alert fatigue, and fail meaningfully when problems occur.

This guide covers the practical patterns that separate health checks that work from health checks that lie.

Understanding Health Check Types

Not all health checks serve the same purpose. Different contexts require different validation levels.

Liveness Checks

Liveness checks answer the simplest question: Is this process still running?

These checks validate that the application process is alive and can respond to requests. They should not check external dependencies, database connectivity, or complex business logic. Liveness is binary—the process either responds or does not.

When to use: Container orchestration platforms like Kubernetes restart containers when liveness checks fail. Keep these checks extremely lightweight to avoid false positives that trigger unnecessary restarts.

Example endpoint: /health/live or /liveness

What to check:

  • Process can handle HTTP requests
  • Event loop is not blocked
  • Basic internal state is valid

What NOT to check:

  • Database connectivity
  • External API availability
  • Disk space or memory thresholds

Readiness Checks

Readiness checks determine: Is this service ready to accept traffic?

These checks validate that the service is both alive and capable of processing requests successfully. A service might be alive but not ready if it has not completed initialization, cannot reach required dependencies, or is overwhelmed by existing load.

When to use: Load balancers and service meshes use readiness to decide whether to route traffic to an instance. Failing readiness removes the instance from rotation without restarting it.

Example endpoint: /health/ready or /readiness

What to check:

  • Database connection pools are established
  • Required configuration is loaded
  • Critical dependencies are accessible
  • Cache connections are established

What NOT to check:

  • Non-critical dependency health
  • Long-running operations
  • Business logic validation

Startup Checks

Startup checks answer: Has initialization completed?

Some services require extended initialization periods—loading large datasets, warming caches, or establishing connections to slow dependencies. Startup checks provide additional time before liveness checking begins.

When to use: Services with slow initialization that would fail premature liveness checks.

Example endpoint: /health/startup or /startup

What to check:

  • Initial configuration loaded successfully
  • Required data preloaded into memory
  • Slow dependency connections established

Deep Health Checks

Deep health checks provide comprehensive validation: Is every component functioning correctly?

These checks validate all dependencies, system resources, and critical functionality. They are expensive operations meant for debugging and operational dashboards, not automated monitoring that runs every 30 seconds.

When to use: Manual investigation, administrative dashboards, pre-deployment validation.

Example endpoint: /health/deep or /health/full

What to check:

  • All external dependency connectivity
  • Database query execution
  • Message queue operations
  • File system access
  • Resource utilization levels

Designing Effective Endpoints

Health check endpoints need consistent structure that monitoring systems can interpret reliably.

Endpoint Naming Conventions

Establish clear naming that indicates check purpose:

/health              # Basic health status
/health/live         # Liveness only
/health/ready        # Readiness validation
/health/startup      # Initialization status
/health/deep         # Comprehensive check

Avoid ambiguous paths like /status, /ping, or /healthz unless your ecosystem has established conventions around them.

HTTP Status Codes

Use HTTP status codes to communicate health state:

200 OK: Service is healthy and operating normally. All critical components are functioning as expected.

503 Service Unavailable: Service is unhealthy. Do not route traffic here. Dependencies are failing, resources are exhausted, or critical functionality is impaired.

429 Too Many Requests: Service is alive but overwhelmed. Useful when implementing backpressure signals.

500 Internal Server Error: Health check itself failed due to implementation error. Distinct from intentional service unavailability.

Do not return 200 with error details in the response body. Monitoring systems make routing decisions based on status codes, not body parsing.

Response Format

Structure responses with actionable information:

{
  "status": "healthy",
  "timestamp": "2025-09-19T10:15:30Z",
  "uptime": 86400,
  "dependencies": {
    "database": {
      "status": "healthy",
      "responseTime": 12
    },
    "cache": {
      "status": "healthy",
      "responseTime": 3
    },
    "payment_api": {
      "status": "healthy",
      "responseTime": 145
    }
  },
  "system": {
    "memory": {
      "used": 512000000,
      "total": 1073741824,
      "percentage": 47.7
    },
    "connections": {
      "active": 42,
      "max": 200
    }
  }
}

Include enough detail for debugging without exposing sensitive internal architecture to unauthorized observers.

Response Time Requirements

Health checks must respond quickly. Slow health checks create several problems:

Monitoring overhead: Checks that take 5 seconds to complete limit check frequency and increase monitoring costs.

False positives: Timeout-based monitoring interprets slow responses as failures, triggering unnecessary alerts.

Resource consumption: Expensive health checks consume resources needed for actual work.

Target response times:

  • Liveness: under 100ms
  • Readiness: under 500ms
  • Startup: under 2 seconds
  • Deep: under 5 seconds

Implement background checking patterns for expensive validation rather than performing it synchronously in the health check handler.

Handling Dependency Checks

Checking dependency health is where most health check implementations go wrong.

The Cascade Failure Problem

When services check dependencies synchronously during health checks, failures cascade. Service A checks Service B. Service B checks Service C. Service C is slow. Now Services A, B, and C all report unhealthy, and your load balancer removes all three from rotation.

This is bad. A service should remain available even when non-critical dependencies fail, using circuit breakers, fallbacks, and graceful degradation.

Critical vs Non-Critical Dependencies

Not all dependencies are equally important. Categorize them:

Critical dependencies: Required for core functionality. If these fail, the service cannot fulfill its primary purpose.

  • Primary database for read/write operations
  • Authentication service for protected endpoints
  • Core business logic services

Non-critical dependencies: Support secondary features. Service can degrade gracefully when these fail.

  • Analytics tracking services
  • Recommendation engines
  • Notification delivery systems

Health checks should only fail for critical dependency problems. Non-critical failures should log warnings but not affect health status.

Circuit Breaker Integration

Integrate circuit breaker state into health checks:

async function checkDatabaseHealth(): Promise<DependencyStatus> {
  const circuit = circuitBreakerRegistry.get('database');

  if (circuit.isOpen()) {
    return {
      status: 'unhealthy',
      reason: 'Circuit breaker open after repeated failures',
      lastFailure: circuit.getLastFailureTime()
    };
  }

  // Circuit closed, attempt connection check
  try {
    const startTime = Date.now();
    await database.ping();
    const responseTime = Date.now() - startTime;

    return {
      status: 'healthy',
      responseTime
    };
  } catch (error) {
    return {
      status: 'unhealthy',
      reason: error.message
    };
  }
}

This prevents health checks from repeatedly hammering failing dependencies.

Background Health Checking

For expensive dependency validation, check health in the background and cache results:

class HealthCheckManager {
  private healthStatus: Map<string, DependencyStatus> = new Map();

  constructor() {
    // Check every 10 seconds in background
    setInterval(() => this.performHealthChecks(), 10000);
  }

  private async performHealthChecks() {
    const checks = [
      this.checkDatabase(),
      this.checkCache(),
      this.checkPaymentAPI()
    ];

    const results = await Promise.allSettled(checks);
    results.forEach((result, index) => {
      const key = ['database', 'cache', 'payment_api'][index];
      this.healthStatus.set(key, result.status === 'fulfilled' ? result.value : {
        status: 'unhealthy',
        reason: 'Check failed'
      });
    });
  }

  public getHealthStatus(): HealthCheckResponse {
    return {
      status: this.isHealthy() ? 'healthy' : 'unhealthy',
      dependencies: Object.fromEntries(this.healthStatus)
    };
  }

  private isHealthy(): boolean {
    // Only critical dependencies affect overall health
    return this.healthStatus.get('database')?.status === 'healthy';
  }
}

Health check endpoints return cached status instantly. Background tasks handle expensive validation asynchronously.

Security Considerations

Health check endpoints present security and operational challenges.

Authentication Requirements

Should health checks require authentication?

Public health checks: No authentication. Load balancers and external monitoring need access without credentials. Return minimal information—overall status without internal details.

Detailed health checks: Require authentication. Exposing dependency names, response times, resource utilization, and internal architecture aids attackers.

Provide both endpoints:

/health          # Public, minimal information
/health/detailed # Authenticated, full diagnostics

Information Disclosure

Avoid exposing sensitive details in public health checks:

❌ Bad:

{
  "database": {
    "host": "prod-db-01.internal.company.com",
    "user": "app_service",
    "status": "unhealthy",
    "error": "Connection refused to 10.0.1.42:5432"
  }
}

✅ Good:

{
  "database": {
    "status": "unhealthy"
  }
}

Detailed error messages belong in logs, not public responses.

Rate Limiting

Health checks can become denial-of-service vectors. Implement rate limiting:

const healthCheckLimiter = rateLimit({
  windowMs: 10000, // 10 seconds
  max: 100, // Max 100 requests per window
  message: 'Health check rate limit exceeded'
});

app.get('/health', healthCheckLimiter, healthCheckHandler);

Legitimate monitoring checks every 30-60 seconds. Tighter rate limits indicate abuse.

Common Implementation Mistakes

Avoid these patterns that undermine health check effectiveness:

Hardcoded “OK” Responses

Health checks that always return success are worse than no health check. They create false confidence:

// ❌ Useless health check
app.get('/health', (req, res) => {
  res.json({ status: 'ok' });
});

This tells monitoring systems nothing. Service could be crashing, database could be down, and health checks still return success.

Checking Internal Dependent Services

Microservices should not check health of services they depend on during their own health checks. This creates cascade failures and coupling:

// ❌ Bad - creates cascade failure
async function checkHealth() {
  const orderServiceHealth = await fetch('http://orders/health');
  const paymentServiceHealth = await fetch('http://payments/health');

  return orderServiceHealth.ok && paymentServiceHealth.ok;
}

Instead, verify your service can reach dependencies (connection pools, circuit breakers) without making health check requests to them.

Synchronous Database Queries

Running expensive database queries synchronously during health checks creates performance problems:

// ❌ Bad - expensive synchronous check
app.get('/health', async (req, res) => {
  try {
    const result = await db.query('SELECT COUNT(*) FROM users');
    res.json({ status: 'ok', userCount: result.rows[0].count });
  } catch (error) {
    res.status(503).json({ status: 'error' });
  }
});

Use lightweight connection validation instead:

// ✅ Good - fast connection check
app.get('/health', async (req, res) => {
  try {
    await db.ping(); // Simple connectivity test
    res.json({ status: 'healthy' });
  } catch (error) {
    res.status(503).json({ status: 'unhealthy' });
  }
});

Ignoring Health Check Results

Some teams implement health checks but never configure monitoring to use them. The endpoint exists but serves no purpose.

Ensure your health checks integrate with:

  • Load balancer health checking
  • Container orchestration (Kubernetes probes)
  • External monitoring systems
  • Alerting pipelines

Over-Detailed Public Responses

Exposing full dependency details, internal metrics, and system architecture through public health checks provides attackers with reconnaissance data:

// ❌ Too much information
{
  "status": "healthy",
  "version": "2.4.1",
  "build": "a3f2b9c",
  "dependencies": {
    "postgres_primary": "10.0.1.42:5432",
    "postgres_replica": "10.0.1.43:5432",
    "redis_cluster": ["10.0.2.10:6379", "10.0.2.11:6379"],
    "internal_api": "http://api.internal:8080"
  }
}

Reserve detailed responses for authenticated administrative endpoints.

Health Check Patterns for Specific Technologies

Different platforms require different health check implementations.

HTTP/HTTPS Services

Standard REST APIs and web services:

app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

app.get('/health/ready', async (req, res) => {
  try {
    await Promise.race([
      databasePool.ping(),
      new Promise((_, reject) => setTimeout(() => reject(new Error('Timeout')), 2000))
    ]);

    res.status(200).json({ status: 'ready' });
  } catch (error) {
    res.status(503).json({ status: 'not ready', reason: error.message });
  }
});

Database Health

Database connection health checking:

async function checkDatabaseHealth() {
  const pool = getDatabasePool();

  try {
    // Check connection availability
    const connection = await pool.getConnection();

    // Verify connectivity with lightweight query
    await connection.query('SELECT 1');

    connection.release();

    return {
      status: 'healthy',
      activeConnections: pool.getActiveCount(),
      totalConnections: pool.getTotalCount()
    };
  } catch (error) {
    return {
      status: 'unhealthy',
      reason: error.message
    };
  }
}

Message Queue Health

Message queue connectivity validation:

async function checkQueueHealth() {
  try {
    // Verify connection without consuming messages
    const queueDepth = await messageQueue.getQueueDepth();

    return {
      status: 'healthy',
      queueDepth,
      consumers: await messageQueue.getActiveConsumers()
    };
  } catch (error) {
    return {
      status: 'unhealthy',
      reason: 'Cannot connect to message queue'
    };
  }
}

Cache Health

Cache system validation:

async function checkCacheHealth() {
  try {
    const testKey = '__health_check__';
    const testValue = Date.now().toString();

    // Write test value
    await cache.set(testKey, testValue, { ttl: 10 });

    // Read test value
    const retrieved = await cache.get(testKey);

    if (retrieved !== testValue) {
      return { status: 'unhealthy', reason: 'Cache read/write validation failed' };
    }

    // Clean up
    await cache.del(testKey);

    return { status: 'healthy' };
  } catch (error) {
    return { status: 'unhealthy', reason: error.message };
  }
}

Monitoring Integration

Health checks only provide value when monitoring systems use them effectively.

External Monitoring Services

Configure external monitoring to check health endpoints from multiple geographic regions:

Upstat monitors HTTP and HTTPS endpoints including health check URLs, tracking DNS resolution time, TCP connection establishment, TLS handshake duration, and time-to-first-byte metrics. Multi-region checking differentiates between local network issues and true service unavailability.

Health check monitoring should track both availability (is the endpoint responding) and response quality (status code, response time, response content).

Load Balancer Integration

Configure load balancers to use readiness endpoints:

upstream backend {
  server backend1:8080 max_fails=3 fail_timeout=30s;
  server backend2:8080 max_fails=3 fail_timeout=30s;
}

location / {
  proxy_pass http://backend;

  # Health check configuration
  health_check interval=5s
               fails=2
               passes=2
               uri=/health/ready
               match=health_ok;
}

match health_ok {
  status 200;
  header Content-Type = "application/json";
}

Failed health checks remove instances from rotation without restarting them. Passing health checks restore instances to the pool.

Container Orchestration

Kubernetes probe configuration:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: app:latest
    livenessProbe:
      httpGet:
        path: /health/live
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      timeoutSeconds: 1
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /health/ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 2
      failureThreshold: 3
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 5
      failureThreshold: 30

Kubernetes uses these probes to manage container lifecycle automatically.

Testing Health Check Implementation

Validate that health checks work correctly before production deployment.

Unit Testing

Test health check logic in isolation:

describe('Health Check Endpoint', () => {
  it('returns 200 when all dependencies are healthy', async () => {
    mockDatabase.ping.mockResolvedValue(true);
    mockCache.ping.mockResolvedValue(true);

    const response = await request(app).get('/health/ready');

    expect(response.status).toBe(200);
    expect(response.body.status).toBe('ready');
  });

  it('returns 503 when database is unavailable', async () => {
    mockDatabase.ping.mockRejectedValue(new Error('Connection failed'));

    const response = await request(app).get('/health/ready');

    expect(response.status).toBe(503);
    expect(response.body.status).toBe('not ready');
  });
});

Integration Testing

Validate health checks against real dependencies in test environments:

describe('Health Check Integration', () => {
  it('detects actual database connectivity issues', async () => {
    // Start service with database stopped
    await stopDatabase();

    const response = await request(app).get('/health/ready');
    expect(response.status).toBe(503);

    // Start database
    await startDatabase();
    await waitForDatabaseReady();

    const healthyResponse = await request(app).get('/health/ready');
    expect(healthyResponse.status).toBe(200);
  });
});

Chaos Engineering

Intentionally fail dependencies to verify health checks detect problems:

# Kill database connection
docker pause postgres-container

# Verify health check fails
curl http://localhost:8080/health/ready
# Expected: 503 Service Unavailable

# Restore database
docker unpause postgres-container

# Verify health check recovers
curl http://localhost:8080/health/ready
# Expected: 200 OK

Conclusion

Effective health checks form the foundation of reliable monitoring. They provide monitoring systems with accurate signals about service health, enable load balancers to route traffic intelligently, and help operations teams detect problems faster.

The difference between useful health checks and misleading ones comes down to thoughtful design: separating liveness from readiness, handling dependencies without cascade failures, responding quickly without expensive validation, and providing actionable information without security risks.

Start with simple liveness checks that validate basic process health. Add readiness validation that checks critical dependencies using background checking patterns. Integrate health endpoints with monitoring systems and load balancers. Test failure scenarios to ensure health checks detect real problems.

Health checks are not afterthoughts—they are operational contracts between your service and the infrastructure that keeps it running. Invest the time to implement them correctly, and your monitoring systems will reward you with faster detection, fewer false positives, and more reliable operations.

Explore In Upstat

Monitor service health with automated HTTP/HTTPS checks that track DNS resolution, TCP connection, TLS handshake, and response time metrics across multiple regions.