How to Deploy Zero-Downtime Kubernetes Releases With Helm and Argo Rollouts

Kubernetes releases can put teams under pressure when they need to ship quickly without disrupting live traffic. A standard Deployment with a rolling update is often enough, but it can fall short when you need controlled traffic shifts, automated health checks, fast rollback, or a pause before exposing a new version to every user.

Helm and Argo Rollouts solve different parts of that problem. Helm packages and applies your Kubernetes manifests. Argo Rollouts manages progressive delivery by replacing the standard Deployment controller with a Rollout custom resource that can run canary or blue-green releases.

The useful pattern is simple: use Helm to define the release, and use Argo Rollouts to control how traffic reaches the new version. The hard part is making sure the application, probes, services, and rollback path are ready before you call the release “zero downtime.”

What Helm and Argo Rollouts should each own

Helm should own packaging, configuration, and repeatable installation. Your chart can include the Rollout, Service, ConfigMap, Secret references, ingress objects, and analysis templates. It gives you one versioned unit to promote through environments.

Argo Rollouts should own progressive delivery behavior. It decides when to scale the new ReplicaSet, when to pause, when to run checks, when to promote, and when to abort.

A clean split looks like this:

Helm values: image tag, replica count, resource requests, rollout strategy, analysis settings, service names, environment-specific configuration.
Argo Rollouts: canary steps, blue-green promotion, pauses, analysis runs, rollback behavior, rollout status.
Continuous integration and delivery pipeline: chart linting, image publishing, helm upgrade, rollout status checks, and promotion rules.

If your team manages manifests with another tool, the same ownership questions still apply. For example, teams that deploy Kubernetes resources using Terraform still need a clear boundary between manifest delivery and rollout control.

Zero downtime starts before the rollout

Argo Rollouts cannot protect users if the application is not safe to roll. Before you change the controller, check the basics that decide whether a pod can enter and leave traffic cleanly.

Readiness probes must be accurate. A pod should not receive traffic until it can serve real requests. A shallow TCP check may pass while the application is still loading caches, connecting to a database, or warming routes.
Liveness probes should not restart slow pods too aggressively. A bad liveness probe can turn a normal startup delay into a crash loop during release.
Old pods need time to drain. Use a sensible terminationGracePeriodSeconds. Add a preStop hook if the application needs a short delay before shutdown.
Capacity must cover overlap. Canary and blue-green releases often run old and new versions at the same time. If the cluster has no spare capacity, the rollout may stall or push other workloads into pressure.
Database changes must be backward compatible. A canary is risky if the new version writes data that the old version cannot read after rollback.
Pod Disruption Budgets should match reality. A strict PodDisruptionBudget can protect availability, but it can also block node drains and upgrades if replica counts are too low.

For a web API, “zero downtime” usually means new pods become ready before traffic moves, old pods keep serving while they drain, and any failed canary can return traffic to the stable version quickly. For a worker, batch job, or scheduler, the release problem may be different. You may need idempotent jobs, queue visibility timeouts, or leader election checks instead of traffic splitting.

Build the Helm chart around the rollout path

A practical Helm chart for Argo Rollouts usually includes a Rollout resource and at least one stable Service. Canary strategies may also use a canary service and a traffic routing integration. Without traffic routing, Argo Rollouts can still adjust canary replica counts, but your Kubernetes service may distribute requests across available pods rather than send an exact traffic percentage.

A simplified Helm template for a canary rollout could look like this:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: {{ include "app.fullname" . }}
spec:
  replicas: {{ .Values.replicaCount }}
  strategy:
    canary:
      stableService: {{ include "app.fullname" . }}
      canaryService: {{ include "app.fullname" . }}-canary
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
  selector:
    matchLabels:
      app.kubernetes.io/name: {{ include "app.name" . }}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: {{ include "app.name" . }}
    spec:
      containers:
        - name: app
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080

The values file should make the release behavior easy to review:

replicaCount: 4

image:
  repository: example/app
  tag: "1.2.3"

rollout:
  strategy: canary
  firstWeight: 10
  firstPause: 5m
  secondWeight: 50
  secondPause: 10m

Keep the chart boring. Avoid hiding risky behavior in complex templates. A reviewer should be able to answer these questions quickly:

Which image tag will run?
How many replicas will exist during the rollout?
Which service receives stable traffic?
Where does canary traffic go?
What pauses or checks must pass before promotion?
How does the team abort if the release is bad?

If the application depends on cloud resources, keep that dependency explicit. A release should not quietly create or mutate critical infrastructure unless your team has agreed to that workflow. Some teams separate application rollout from infrastructure provisioning, such as when they deploy AWS resources using Crossplane on Kubernetes.

Run the release with checks and a clear rollback path

The basic deployment flow is usually straightforward:

Build and push the application image.
Update the Helm values with the new image tag.
Run chart validation in the pipeline.
Apply the release with helm upgrade --install.
Wait for Argo Rollouts to report a healthy rollout.
Promote or abort based on checks.

A typical command sequence looks like this:

helm upgrade --install app ./chart \
  --namespace production \
  --values values-production.yaml \
  --set image.tag=1.2.3

kubectl argo rollouts get rollout app \
  --namespace production

kubectl argo rollouts status app \
  --namespace production

For a manual promotion step, the team can pause after a small canary percentage and inspect the service before continuing:

kubectl argo rollouts promote app \
  --namespace production

If the release fails, abort it:

kubectl argo rollouts abort app \
  --namespace production

Automated analysis can reduce guesswork, but the checks need to match the risk. Useful checks often include request error rate, latency, health endpoint success, queue depth, or application-specific signals. A weak check can mark a bad release healthy. A noisy check can block good releases. Start with a small set of signals that the team already trusts during incidents.

For platform workloads, rollout planning may need extra care. For example, when you deploy Apache Airflow on AWS Elastic Kubernetes Service (EKS), the webserver, scheduler, workers, and metadata database do not all behave like a stateless HTTP service. Each component needs its own release and rollback assumptions.

Choose canary or blue-green based on the failure you expect

Canary and blue-green releases both reduce risk, but they solve different problems.

Use canary when gradual exposure matters

A canary release sends a small portion of traffic to the new version first. This works well when you want to detect problems under real traffic before the full rollout.

Canary is a good fit when:

The application is stateless or mostly stateless.
Old and new versions can run at the same time.
You can measure health during partial traffic exposure.
You want to stop at 5%, 10%, or 50% before full promotion.

Canary is weaker when the risk is tied to background processing, database writes, or tenant-specific behavior that may not appear in a small traffic slice.

Use blue-green when fast switching matters

A blue-green release runs the new version beside the old version, then switches traffic when the new version is ready. It can make rollback simple if both versions remain available and compatible.

Blue-green is a good fit when:

You need the full new stack ready before traffic moves.
You want a fast switch between stable and preview versions.
You have enough capacity to run both versions at once.
Your testing process works well against a preview service.

Blue-green can cost more during the release because it may require duplicate capacity. It also does not remove the need for backward-compatible data changes.

Failure modes to plan for

Most rollout problems come from assumptions that were never tested. Plan for these before production:

The readiness probe lies. The pod becomes ready before the app can serve real traffic. Fix the probe before tuning rollout steps.
The canary passes because it receives the wrong traffic. If you need exact traffic weights, confirm that your ingress, service mesh, or traffic provider is configured correctly.
The rollback works at the pod level but fails at the data level. Use expand-and-contract database migrations so old and new versions can run together.
The release stalls because the cluster lacks capacity. Check resource requests, cluster autoscaling behavior, and disruption budgets before using blue-green or high-surge canary steps.
Helm reports success while the rollout is still progressing. Make your pipeline wait for Argo Rollouts status, not only for the Helm command to finish.
Manual approval becomes unclear during an incident. Define who can promote, pause, or abort before the release window starts.

If your release also creates application-adjacent infrastructure, treat that as a separate risk. A pattern such as deploying a Kubernetes app and AWS resource using Crossplane can be useful, but you still need to decide whether infrastructure changes should happen in the same promotion path as application traffic.

Takeaway

Helm gives you a repeatable release package. Argo Rollouts gives you safer traffic control. Together, they can reduce deployment disruption, but only when the application is ready for progressive delivery.

Start with one service that already has reliable probes, clear metrics, and a known rollback path. Convert its Deployment to a Rollout, use Helm values to keep the strategy reviewable, and make your pipeline wait for rollout status. Once that path works under real release pressure, standardize it for the services that need controlled promotion instead of applying it everywhere by default.

How to Deploy Zero-Downtime Kubernetes Releases With Helm and Argo Rollouts

What Helm and Argo Rollouts should each own

Zero downtime starts before the rollout

Build the Helm chart around the rollout path

Run the release with checks and a clear rollback path

Choose canary or blue-green based on the failure you expect

Use canary when gradual exposure matters

Use blue-green when fast switching matters

Failure modes to plan for

Takeaway

Want a senior engineer on this?

Keep reading

How to Set Up Kubernetes Autoscaling Without Creating Cost Surprises

How to Configure Kubernetes PriorityClasses Without Starving Workloads

How to Resize Kubernetes Persistent Volumes Without Downtime