How to Configure Kubernetes Readiness Probes Without Breaking Deploys

Readiness probes look simple until one bad check stalls a rollout, removes every pod from a Service, or turns a small dependency blip into a production outage. Teams usually add them under pressure because traffic is reaching pods too early. The fix is correct, but an overstrict probe can break deploys just as fast as having no probe at all.

A good readiness probe answers one narrow question: can this container receive traffic right now? It should not prove that every downstream system is healthy. It should give Kubernetes a stable signal for routing and rolling updates.

What readiness probes actually control

A Kubernetes readiness probe tells the kubelet whether a container should be marked ready. When a pod is not ready, Kubernetes removes it from Service load balancing through Endpoints or EndpointSlices. For a Deployment, readiness also affects rollout progress because new replicas must become ready before old replicas can safely go away.

This creates two important behaviors:

Readiness is traffic gating. A failing readiness probe stops new traffic from reaching that pod through a Service.
Readiness is rollout gating. A Deployment can stall if new pods never become ready.

Readiness probes are different from liveness probes. A liveness probe answers, “should Kubernetes restart this container?” A readiness probe answers, “should this pod receive traffic?” Mixing those responsibilities is one of the fastest ways to create restart loops or stuck rollouts.

Start with the right readiness contract

Design the readiness endpoint before tuning probe numbers. The endpoint should match your app’s real serving state.

For a typical HTTP API, readiness should usually check:

The process has finished booting.
The HTTP server is accepting requests.
Required local initialization is complete, such as loading config or warming critical in-memory state.
The app can serve at least a minimal request path without crashing.

Readiness should usually avoid:

Hard failures on optional dependencies.
Slow checks against external systems on every probe.
Expensive database queries.
Calling another service that may call back into the same workload.
Checking the entire platform health from inside one pod.

For example, if your service needs a database to serve every request, you may include a lightweight database connectivity check. If the database is temporarily unavailable, removing pods from traffic may be correct. But if all pods share the same database and every pod fails readiness at once, your Service has zero ready endpoints. That may be worse than serving partial failures, depending on how your clients handle errors.

A practical readiness endpoint often looks like this at the application level:

GET /readyz

Return 200 when:
- HTTP server is running
- app initialization has completed
- required local resources are available

Return 503 when:
- startup is still in progress
- the app cannot safely accept requests
- the app is draining before shutdown

Keep the endpoint fast. A common target is under 100 ms inside the cluster. If it regularly takes seconds, the probe itself can become a source of failure.

Use sane defaults, then tune from real rollout behavior

A safe baseline for many HTTP services looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  minReadySeconds: 10
  progressDeadlineSeconds: 300
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: example/api:1.2.3
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 2
            successThreshold: 1
            failureThreshold: 3

These values mean:

initialDelaySeconds: 5 waits briefly before the first readiness check.
periodSeconds: 5 checks readiness every 5 seconds.
timeoutSeconds: 2 fails the probe if the endpoint does not respond quickly.
failureThreshold: 3 requires about 15 seconds of failed checks before marking the container unready.
successThreshold: 1 marks the container ready after one successful check.
minReadySeconds: 10 requires the pod to stay ready for 10 seconds before the Deployment treats it as available.

For slower services, increase the startup path separately instead of making readiness very forgiving. If your app takes 60 seconds to start, use a startup probe:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: example/api:1.2.3
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /startupz
              port: 8080
            periodSeconds: 5
            failureThreshold: 24
            timeoutSeconds: 2
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /livez
              port: 8080
            periodSeconds: 10
            timeoutSeconds: 2
            failureThreshold: 3

With this configuration, the startup probe gives the container up to about 120 seconds to start. Once startup succeeds, Kubernetes begins normal readiness and liveness checks. This avoids the common mistake of using a huge readiness delay to hide slow startup.

Pick the probe type that matches the workload

Kubernetes supports several probe types. Choose the simplest one that proves the pod can receive traffic.

HTTP readiness probe

Use this for HTTP services. It is explicit, easy to debug, and maps well to real request handling.

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 2
  failureThreshold: 3

Prefer a dedicated readiness route over checking the root path. The root path may redirect, call templates, hit dependencies, or change for product reasons.

TCP readiness probe

Use this when the only useful check is whether a port is accepting connections. This works for simple TCP services, but it is weaker than an HTTP probe because it cannot prove application readiness.

readinessProbe:
  tcpSocket:
    port: 5432
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3

A TCP socket can be open while the app is still loading data or rejecting requests. Use it when that tradeoff is acceptable.

Exec readiness probe

Use this when readiness depends on a local command inside the container. Keep it fast and deterministic.

readinessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - test -f /tmp/ready
  periodSeconds: 5
  timeoutSeconds: 1
  failureThreshold: 3

Be careful with exec probes in minimal images. If your image does not include /bin/sh, curl, or other tools, the probe fails even when the application is healthy. Prefer direct HTTP probes when possible.

Configure rollouts so readiness failures do not take capacity to zero

Readiness probes and Deployment strategy settings must fit together. A probe can be correct and still cause downtime if the rollout settings remove too much old capacity before new pods are ready.

For a user-facing service, start with:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 0
    maxSurge: 1
minReadySeconds: 10
progressDeadlineSeconds: 300

This tells Kubernetes to keep all existing replicas available while adding one extra pod during the rollout. After the new pod becomes ready and stays ready for minReadySeconds, Kubernetes can terminate an old pod.

Check these rollout settings together:

replicas: If you run one replica, a rolling update can still create a visible availability gap during startup or termination. Use at least two replicas for services that need continuous availability.
maxUnavailable: Use 0 when you cannot afford to drop below current ready capacity during deploys.
maxSurge: Use at least 1 so Kubernetes can start a replacement before removing an old pod.
minReadySeconds: Use this to avoid counting a pod as available after a single lucky readiness success.
progressDeadlineSeconds: Set this long enough for normal startup, but short enough to fail a broken rollout quickly.

If you manage Kubernetes manifests through Terraform, keep these strategy settings close to the probe configuration so reviewers see the rollout behavior as one unit. The same applies when you deploy Kubernetes resources using Terraform and need predictable diffs during probe changes.

Handle shutdown and draining deliberately

Readiness is not only for startup. It should also help during shutdown.

When Kubernetes terminates a pod, it sends SIGTERM to the container and waits for terminationGracePeriodSeconds. During this period, the pod may still receive traffic for a short time while endpoint updates propagate. Your app should stop accepting new work, return unready, and finish in-flight requests.

A practical shutdown pattern:

Receive SIGTERM.
Flip an internal readiness flag to false.
Keep the process alive.
Stop accepting new requests or return connection-draining responses.
Finish in-flight requests.
Exit before terminationGracePeriodSeconds expires.

Example pod settings:

spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: api
      image: example/api:1.2.3
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - sleep 10
      readinessProbe:
        httpGet:
          path: /readyz
          port: 8080
        periodSeconds: 5
        timeoutSeconds: 2
        failureThreshold: 1

The preStop sleep gives Kubernetes time to remove the pod from Service endpoints before the process exits. This is useful for HTTP services behind load balancers where endpoint propagation is not instant. Do not use an excessive sleep. If your normal request duration is under 2 seconds, a 5 to 10 second drain window may be enough. If you serve long requests, tune it based on real request duration.

If you are deploying workloads on managed Kubernetes, test readiness behavior with the actual ingress controller, service mesh, or load balancer path you use in production. A pod can be removed from Kubernetes endpoints quickly while an external load balancer still has a stale target for a short period. For a concrete workload example on Amazon Elastic Kubernetes Service (EKS), the same readiness and rollout mechanics apply when you deploy Apache Airflow on AWS EKS, even though the application shape differs from a simple API.

Debug a rollout blocked by readiness probes

When a rollout hangs, do not guess. Inspect the Deployment, ReplicaSets, pods, events, and endpoints in that order.

kubectl rollout status deployment/api

kubectl get deployment api -o wide

kubectl describe deployment api

kubectl get rs -l app=api

kubectl get pods -l app=api -o wide

kubectl describe pod <pod-name>

kubectl get endpointslice -l kubernetes.io/service-name=api

kubectl logs <pod-name> -c api --previous
kubectl logs <pod-name> -c api

Look for these signals:

Readiness probe failed: The kubelet is running the probe and getting a non-success response, timeout, connection refusal, or command failure.
ContainersReady is true but Ready is false: A readiness gate or pod condition may be blocking readiness.
Zero endpoints: The Service has no ready pods behind it.
ProgressDeadlineExceeded: The Deployment waited too long for new pods to become available.
Old ReplicaSet scaled down too early: Check maxUnavailable and replica count.

You can test the readiness endpoint from inside the cluster:

kubectl run probe-debug \
  --rm -it \
  --restart=Never \
  --image=curlimages/curl:8.5.0 \
  -- curl -v http://api:8080/readyz

You can also port-forward to one pod and test the endpoint directly:

kubectl port-forward pod/<pod-name> 8080:8080

curl -v http://127.0.0.1:8080/readyz

If the endpoint works through port-forward but fails in the probe, check the configured port, path, scheme, host header assumptions, and whether the app binds to 127.0.0.1 instead of 0.0.0.0.

Common readiness probe mistakes

Most readiness incidents come from a small set of configuration and application design errors.

Using the same endpoint for liveness and readiness. Readiness can fail during dependency issues without requiring a restart. Liveness should be much more conservative.
Checking too many dependencies. If every pod fails readiness when one shared dependency has a short outage, you may remove all serving capacity.
Setting timeoutSeconds too low. A 1 second timeout may be fine for a local health flag. It may be too aggressive for an endpoint that loads app code or checks a local queue.
Setting failureThreshold too low. One transient timeout should not usually remove a pod from traffic.
Using initialDelaySeconds to solve slow startup. Use a startup probe for slow startup. Keep readiness focused on traffic eligibility.
Forgetting minReadySeconds. Without it, Kubernetes can treat a pod as available after one successful readiness check.
Adding probes without adjusting rollout strategy. A strict probe plus maxUnavailable: 1 can reduce capacity during deploys.
Using shell-based exec probes in distroless images. The shell may not exist.
Making the probe endpoint require authentication. Kubelet probes do not automatically include your app’s normal auth context.

For platform teams building reusable deployment templates, encode these defaults into the base chart, Kustomize component, or internal module. If your platform also provisions cloud dependencies through Kubernetes APIs, keep the same review discipline you would use when you deploy AWS resources using Crossplane on Kubernetes: define safe defaults, make exceptions explicit, and test failure behavior before production.

A safe rollout checklist

Before merging a readiness probe change, run through this checklist:

Confirm the endpoint contract. It should answer whether the pod can receive traffic now.
Separate startup, readiness, and liveness. Do not make one endpoint carry all three meanings.
Set rollout strategy intentionally. For most production APIs, start with maxUnavailable: 0 and maxSurge: 1.
Use minReadySeconds. Require a pod to stay ready briefly before counting it as available.
Test a normal rollout. Watch kubectl rollout status and endpoint changes.
Test a broken rollout. Deploy an image or config that fails readiness and confirm old pods keep serving.
Test shutdown. Send traffic during pod termination and verify clients do not see avoidable connection failures.
Document dependency behavior. Decide which dependency failures should mark pods unready and which should return application-level errors.

A readiness probe should make deploys safer, not more fragile. Keep the check narrow, tune it with your Deployment strategy, and test both success and failure paths. If you do that, Kubernetes can route traffic only to pods that are ready while still keeping rollouts stable when a new version is broken.