March 12, 2026 · 9 min read

Kubernetes Liveness, Readiness, and Startup Probes — Complete Guide

Kubernetes liveness, readiness, and startup probes explained: probe types, timing config, common misconfigurations, gRPC health checks, and debug tips.

Kubernetes Liveness, Readiness, and Startup Probes — Complete Guide

Kubernetes health probes are one of the most impactful reliability features in Kubernetes — and one of the most commonly misconfigured. Get them right and Kubernetes automatically recovers from application failures, prevents traffic from reaching pods that aren’t ready, and handles slow-starting applications gracefully. Get them wrong and you’ll see cascading CrashLoopBackOff incidents, phantom traffic black holes, and production outages from pods that weren’t actually ready.

This guide covers how each probe type works, when each fires, probe configuration parameters, common misconfiguration patterns, and how to debug probe failures.


The Three Probe Types

Liveness Probe — determines if a container is running correctly. If it fails, Kubernetes kills and restarts the container. Use this to detect deadlocks, infinite loops, or other states where the application is running but permanently broken.

When it fires: after initialDelaySeconds (or startup probe passes), then every periodSeconds

Failure action: container is killed and restarted (respects restartPolicy)

Readiness Probe — determines if a container is ready to accept traffic. If it fails, the pod is removed from Service endpoints. Traffic stops routing to it. The container is not restarted.

When it fires: after initialDelaySeconds, then continuously throughout the pod’s life

Failure action: pod removed from Service endpoints (no traffic), restored when probe passes again

Startup Probe — determines if a container’s application has started. While it’s failing, liveness and readiness probes are disabled. Use this for slow-starting applications to prevent premature liveness failures during startup.

When it fires: from container start, every periodSeconds, until it passes or exceeds failureThreshold

Failure action: if it never passes within failureThreshold × periodSeconds, container is killed


Probe Types (HTTP, TCP, Exec, gRPC)

Each probe type has different mechanisms for checking health:

HTTP GET Probe — makes an HTTP GET request to the specified path and port. Response codes 200-399 indicate success.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:              # Optional custom headers
    - name: Authorization
      value: Bearer <token>

TCP Socket Probe — attempts to open a TCP connection. Success if the port accepts connections.

livenessProbe:
  tcpSocket:
    port: 3306    # Good for databases that don't expose HTTP health endpoints

Exec Probe — executes a command inside the container. Exit code 0 = success, any other code = failure.

livenessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - "redis-cli ping | grep PONG"

gRPC Probe — checks a gRPC health checking protocol endpoint (stable since Kubernetes 1.27):

livenessProbe:
  grpc:
    port: 9090
    service: ""   # Empty string checks the overall gRPC server health

The gRPC health check requires the server to implement the gRPC Health Checking Protocol. Most gRPC frameworks have built-in support.


Probe Configuration Parameters

Understanding these parameters is essential for tuning probes correctly:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30    # Wait 30s after container starts before first probe
  periodSeconds: 10           # Check every 10 seconds
  timeoutSeconds: 5           # Probe fails if response takes >5 seconds
  successThreshold: 1         # 1 success to mark as healthy (liveness must be 1)
  failureThreshold: 3         # 3 consecutive failures = unhealthy

initialDelaySeconds: How long to wait after container start before the first probe. Set this to slightly longer than your average startup time. Too short → premature failures during startup. Too long → slow recovery from crashes.

periodSeconds: How often to probe. Default is 10s. For critical services, consider 5s. Don’t probe faster than your health endpoint can respond.

timeoutSeconds: How long to wait for a probe response before considering it a failure. Must be less than periodSeconds. Default is 1s — often too short for endpoints that do real work.

failureThreshold: How many consecutive failures before taking action. Default is 3. Higher values reduce false positives but slow down failure detection.

successThreshold: How many consecutive successes to transition from failed to healthy. Must be 1 for liveness. Can be higher for readiness (useful for services that need sustained health before receiving traffic).


Probe Lifecycle: How They Work Together

A typical pod startup sequence with all three probes:

t=0s:   Container starts
        → Startup probe begins
        → Liveness/readiness probes: PAUSED

t=0-60s: Startup probe running
         • App loading models, warming up caches, running migrations
         • If startup probe fails by t=120s (12 × 10s = 120s): container killed

t=60s:  Startup probe SUCCESS
        → Startup probe stops
        → Liveness probe: STARTS (after its own initialDelaySeconds)
        → Readiness probe: STARTS (after its own initialDelaySeconds)

t=70s:  Readiness probe passes
        → Pod added to Service endpoints
        → Traffic begins routing to pod

t=ongoing: Both liveness and readiness probe continuously

Common Misconfigurations and Their Consequences

Misconfiguration 1: Liveness probe too aggressive → CrashLoopBackOff

The most common probe mistake. Setting failureThreshold: 1 or periodSeconds: 2 with timeoutSeconds: 1 means any transient slowness in your health endpoint causes a container restart. Under load, health endpoints can be slow — liveness probes too aggressive causes cascading restarts that make the problem worse.

# BAD: Will cause CrashLoopBackOff under any load spike
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 2
  timeoutSeconds: 1
  failureThreshold: 1    # ONE failure kills the container

# GOOD: Tolerates some transient failures
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3    # Needs 3 consecutive failures (30 seconds)

Misconfiguration 2: No readiness probe → traffic black holes

Without a readiness probe, Kubernetes routes traffic to pods as soon as the container starts — before the application is actually ready to handle requests. The pod shows as Running but returns 502/503 errors until the application finishes starting.

Always configure a readiness probe for any service that receives external traffic.

Misconfiguration 3: Readiness probe checks same endpoint as liveness

If readiness and liveness both check /healthz, they behave identically. Use separate endpoints:

  • /healthz — for liveness (is the process alive and not deadlocked?)
  • /readyz — for readiness (is the service ready to handle traffic, including dependency checks?)

Your readiness probe can check downstream dependencies (database connectivity, cache availability) while your liveness probe checks only the process health.

livenessProbe:
  httpGet:
    path: /healthz    # Simple: is the process running?
    port: 8080

readinessProbe:
  httpGet:
    path: /readyz     # Complex: is the service ready for traffic?
    port: 8080

Misconfiguration 4: No startup probe for slow-starting applications

Java applications with large classpaths, ML applications loading model weights, or applications that run database migrations at startup may take 60-300 seconds to start. If initialDelaySeconds is set shorter than the actual startup time, the liveness probe fails and kills a healthy (but still starting) container.

# For a Java app that takes up to 3 minutes to start
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30    # 30 failures × 10s = 300s (5 minutes) allowed for startup
  periodSeconds: 10

# These only activate after startup probe passes
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

Misconfiguration 5: Health endpoint does expensive work

If your /healthz endpoint queries a database, calls external services, or performs other expensive operations, it will be called every periodSeconds indefinitely. This adds noticeable load at scale (100 pods × 6 probes/minute = 600 DB queries/minute from health checks alone).

Health endpoints should return quickly (< 100ms) and do minimal work. For readiness, check connection pool state rather than running a query.


Probe Tuning for Slow-Start Applications

A framework for setting probe parameters based on application startup characteristics:

Step 1: Measure actual startup time. Run your container and measure from start to “ready to handle first request.” Do this 5-10 times to get p99.

Step 2: Set startup probe budget. failureThreshold × periodSeconds must exceed your p99 startup time.

p99 startup = 90 seconds
failureThreshold = 12, periodSeconds = 10 → budget = 120 seconds ✓

Step 3: Set liveness probe. After startup, how long should it take to detect a truly dead process? 30-60 seconds is usually right.

failureThreshold = 3, periodSeconds = 10 → detection time = 30 seconds

Step 4: Set readiness probe. How quickly should traffic stop routing to a temporarily unhealthy pod? For stateless services, 10-20 seconds is typically right.


gRPC Health Probes in Depth

gRPC services require special handling. Before Kubernetes native gRPC support (1.27), you needed a sidecar (grpc-health-probe) to check gRPC health. Now it’s built in.

Implementing gRPC health checking on the server side (Go example):

import "google.golang.org/grpc/health/grpc_health_v1"

// Register health server
healthServer := health.NewServer()
grpc_health_v1.RegisterHealthServer(grpcServer, healthServer)

// Set service status
healthServer.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)
healthServer.SetServingStatus("my.Service", grpc_health_v1.HealthCheckResponse_SERVING)

Kubernetes probe configuration:

livenessProbe:
  grpc:
    port: 9090
    service: "my.Service"   # Check specific service, or empty string for overall

readinessProbe:
  grpc:
    port: 9090
    service: "my.Service"
  initialDelaySeconds: 5
  periodSeconds: 10

Debugging Probe Failures

When a probe fails, the pod’s events are the first place to look:

# Check pod events for probe failures
kubectl describe pod <pod-name> -n <namespace>
# Look for "Liveness probe failed" or "Readiness probe failed" in Events section

# Check recent events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep <pod-name>

Debugging a specific probe:

# Manually run the probe command to test it
kubectl exec -it <pod-name> -- curl -v http://localhost:8080/healthz

# Check if the port is listening inside the container
kubectl exec -it <pod-name> -- ss -tlnp | grep 8080

# Check container logs for health endpoint errors
kubectl logs <pod-name> --since=5m | grep healthz

Common probe failure root causes:

  1. Port mismatch — probe checks port 8080 but app listens on 3000. Always verify with ss -tlnp inside the container.
  2. Path doesn’t exist/healthz returns 404. Implement the endpoint.
  3. Authentication required — health endpoint requires auth headers the probe doesn’t send. Use a separate unauthenticated health endpoint.
  4. Timeout too short — health endpoint takes 2 seconds, timeoutSeconds: 1. Increase timeout.
  5. App not ready at startup — liveness fires before initialDelaySeconds (check your clock offset, or use startup probe).

Production Probe Checklist

  • Every service has a readiness probe configured
  • Liveness probe failureThreshold ≥ 3 (no hair-trigger restarts)
  • Startup probe set for any application with startup time >30 seconds
  • Liveness and readiness use separate endpoints with different logic
  • Health endpoints return <100ms (no DB queries, no external calls)
  • timeoutSeconds > typical health endpoint response time
  • Probes tested manually before deploying to production
  • Probe failures monitored with alerts (not just relied on for silent self-healing)

Get Your K8s Reliability Right

Probe misconfiguration is one of the leading causes of unexpected production restarts and traffic disruptions in Kubernetes. Getting it right requires testing failure scenarios — not just configuring probes and hoping.

K8s Health Assessment service at kubernetes.ae — we audit your probe configuration alongside HPA settings, resource limits, and PDB policies to build a comprehensive reliability posture for your cluster.

Get Expert Kubernetes Help

Talk to a certified Kubernetes expert. Free 30-minute consultation — actionable findings within days.

Talk to an Expert