Progressive Delivery on Kubernetes with Argo Rollouts: Canary, Analysis, and Automated Rollback

A Kubernetes Deployment will happily roll a broken image to 100% of your traffic as fast as readiness probes allow, because a passing readiness probe says nothing about your error rate or p99 latency. Progressive delivery closes that gap: you shift a small slice of traffic, measure real SLOs against the canary, and let a controller promote or roll back on the evidence. This article shows how to do that end to end with Argo Rollouts – converting a Deployment, shaping traffic, wiring Prometheus-backed analysis, and gating CI.

1. Why rolling updates are not enough

A rolling update is a mechanical strategy: new pods become Ready before old ones are removed, respecting maxSurge and maxUnavailable. What it cannot answer is the only question that matters during a release – is the new version actually serving requests correctly? Readiness checks that a process is up, not that it returns 200s, not that latency is within budget, not that a downstream dependency still resolves under the new code path.

Progressive delivery adds a feedback loop. Promotion becomes conditional on metrics, and rollback is automatic when they regress. The unit of progress is a traffic weight, not a pod count, so blast radius is bounded by the percentage of users exposed rather than by how fast pods schedule.

Argo Rollouts is a drop-in replacement for the Deployment controller, with a Rollout custom resource that adds canary and blue-green strategies, native AnalysisRun evaluation, and an abort path back to the last stable ReplicaSet. It is a CNCF Graduated project and integrates cleanly with Argo CD.

2. Install the controller and the kubectl plugin

Install the controller into its own namespace, then add the kubectl plugin so you can drive and observe rollouts from the CLI.

# Controller
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

kubectl rollout status deploy/argo-rollouts -n argo-rollouts

# kubectl plugin (macOS via Homebrew)
brew install argoproj/tap/kubectl-argo-rollouts

# Or download the binary directly (Linux amd64 shown)
curl -fsSLO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x ./kubectl-argo-rollouts-linux-amd64
sudo mv ./kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts

kubectl argo rollouts version

For a real cluster, pin to a specific release tag instead of latest so the controller version is reproducible, and manage the manifest through your GitOps repo rather than kubectl apply from a laptop.

3. Convert a Deployment to a Rollout

A Rollout is intentionally similar to a Deployment: the spec.template is identical and most operators carry over verbatim. The differences are that kind becomes Rollout, apiVersion becomes argoproj.io/v1alpha1, and a strategy.canary (or strategy.blueGreen) block replaces the rolling-update strategy.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
  namespace: shop
spec:
  replicas: 8
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: ghcr.io/acme/checkout:1.8.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet: { path: /healthz, port: 8080 }
          resources:
            requests: { cpu: 200m, memory: 256Mi }
  strategy:
    canary:
      maxSurge: "25%"          # defaults to 25% if omitted
      maxUnavailable: 0         # keep stable capacity intact during the roll
      steps:
        - setWeight: 10
        - pause: { duration: 5m }

Do not run both a Deployment and a Rollout with the same selector. If you are migrating an existing app, either rename the workload or use the workloadRef field on the Rollout to reference the existing Deployment so the controller adopts it without you duplicating the pod template.

Apply it and watch the canary surface in the plugin’s tree view:

kubectl apply -f checkout-rollout.yaml
kubectl argo rollouts get rollout checkout -n shop --watch

4. Canary steps, traffic weights, and pause conditions

The steps list is a state machine the controller walks top to bottom. The three primitives you will use constantly:

Step	Effect
`setWeight: N`	Route N% of traffic to the canary (requires a traffic provider; see step 5)
`pause: { duration: 10m }`	Wait a fixed time, then continue automatically
`pause: {}`	Pause indefinitely until a human runs `promote`
`setCanaryScale`	Decouple canary replica count from traffic weight
`analysis`	Run an inline `AnalysisRun` that must pass before proceeding (step 6)

A practical production canary looks like this:

strategy:
  canary:
    canaryService: checkout-canary
    stableService: checkout-stable
    steps:
      - setWeight: 5
      - pause: { duration: 2m }
      - analysis:
          templates:
            - templateName: success-rate-latency
      - setWeight: 25
      - pause: { duration: 5m }
      - setWeight: 50
      - pause: { duration: 5m }
      - setWeight: 100

Two refinements worth knowing. setCanaryScale runs the canary at low replica count while sending little traffic (saving cost during a long bake), or pins an explicit count regardless of weight:

- setCanaryScale:
    weight: 25              # scale canary to 25% of spec.replicas
# or
- setCanaryScale:
    matchTrafficWeight: true   # default behavior: replicas track weight

And dynamicStableScale: true scales the stable ReplicaSet down as the canary weight rises, so you are not paying for double capacity at 50/50. Use it only when you have a traffic provider; without one, abort cannot instantly shift traffic back and you risk a capacity gap on rollback.

5. Traffic shaping: NGINX, Istio, or SMI

Without a trafficRouting provider, setWeight is approximated by replica ratio – the controller scales canary vs stable pods so the proportion roughly matches the weight. That is coarse (you cannot do 5% with 8 replicas) and couples traffic to scaling. For real percentage control, plug in an ingress or mesh provider. The controller manipulates that provider’s native objects on each step.

NGINX Ingress. You provide a primary Ingress plus two Services; Rollouts creates and manages a shadow canary Ingress with the nginx.ingress.kubernetes.io/canary annotations, adjusting canary-weight per step.

strategy:
  canary:
    canaryService: checkout-canary   # required
    stableService: checkout-stable   # required
    trafficRouting:
      nginx:
        stableIngress: checkout       # your existing Ingress, backend = stableService

Istio. You own a VirtualService and a DestinationRule with named subsets; Rollouts rewrites the route weights and the subset pod-hash labels. Subset-level splitting needs only a single Service.

strategy:
  canary:
    trafficRouting:
      istio:
        virtualService:
          name: checkout-vsvc
          routes:
            - primary                 # the HTTP route name to manage
        destinationRule:
          name: checkout-destrule
          canarySubsetName: canary
          stableSubsetName: stable

SMI (Service Mesh Interface). For meshes that implement SMI (e.g., Linkerd), the smi provider manages a TrafficSplit object. The mechanics mirror the above: you supply canary and stable Services and the controller adjusts the split weights.

The provider choice does not change your steps. That is the point of the abstraction: the same canary definition runs on NGINX in staging and Istio in production, with only the trafficRouting block differing. Verify your provider actually honors small weights – some ingress controllers round aggressively at low percentages.

6. AnalysisTemplates: success rate, latency, and error budgets

This is where progressive delivery earns its name. An AnalysisTemplate (namespaced) or ClusterAnalysisTemplate (cluster-wide, reusable) declares one or more metrics. Each runs a query on a schedule and evaluates the result against a successCondition and/or failureCondition. The aggregate outcome is Successful, Failed, Error, or Inconclusive. The fields that govern the verdict:

Field	Meaning
`interval`	How often to sample (e.g., `1m`)
`count`	Total number of measurements to take
`successCondition`	Expression that, when true, marks a measurement a success
`failureCondition`	Expression that, when true, marks a measurement a failure
`failureLimit`	How many failed measurements are tolerated before the run fails (default 0)
`inconclusiveLimit`	How many inconclusive measurements before the run is inconclusive
`consecutiveErrorLimit`	Provider/query errors tolerated in a row before the run errors (default 4)

A template that gates on both success rate and p99 latency, parameterized so it works for any service:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-latency
  namespace: shop
spec:
  args:
    - name: service
    - name: namespace
  metrics:
    - name: success-rate
      interval: 1m
      count: 5
      successCondition: result[0] >= 0.995
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus-operated.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service}}",namespace="{{args.namespace}}",
              code!~"5.."}[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service}}",namespace="{{args.namespace}}"}[2m]))
    - name: p99-latency
      interval: 1m
      count: 5
      successCondition: result[0] <= 0.4
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus-operated.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service}}",namespace="{{args.namespace}}"}[2m]))
              by (le))

result is an array; result[0] is the first (and for these queries, only) returned series value. Pass the args from the Rollout step:

- analysis:
    templates:
      - templateName: success-rate-latency
    args:
      - name: service
        value: checkout
      - name: namespace
        value: shop

Inline (in steps) vs background analysis matters. An inline analysis step blocks progression until the run completes. A background analysis runs alongside the whole canary from a chosen step and aborts the moment it fails – ideal for continuous error-budget monitoring across every weight:

strategy:
  canary:
    analysis:                  # background: runs concurrently
      templates:
        - templateName: success-rate-latency
      args:
        - { name: service, value: checkout }
        - { name: namespace, value: shop }
      startingStep: 2          # begin once traffic is non-trivial
    steps:
      - setWeight: 5
      - pause: { duration: 2m }
      - setWeight: 25
      - pause: { duration: 10m }
      - setWeight: 100

For error budgets, encode the SLO directly: a 99.9% target becomes a successCondition of result[0] >= 0.999 over a rolling window, enforced per release. Secrets for authenticated providers (Datadog, New Relic, a secured Prometheus) come from valueFrom.secretKeyRef on an arg, never inlined.

7. Automated rollback, abort thresholds, and inconclusive runs

The verdict drives the outcome automatically:

Failed (failures exceed failureLimit) -> the rollout is aborted. The controller shifts traffic back to the stable ReplicaSet and the canary scales down. No human action required.
Inconclusive (inconclusive measurements exceed inconclusiveLimit, or no condition matched) -> the rollout pauses at its current step and waits for a human to promote or abort. This is the correct default for “I genuinely cannot tell” – low traffic at 3am yields too few samples to judge, so the system stops rather than guesses.
Error (the provider itself fails, e.g., Prometheus unreachable, beyond consecutiveErrorLimit) -> treated as a failure path; do not let a broken metrics pipeline silently green-light a release.

Tune the thresholds to your traffic shape:

metrics:
  - name: success-rate
    interval: 1m
    count: 5
    successCondition: result[0] >= 0.995
    failureLimit: 1            # one bad minute aborts
    inconclusiveLimit: 2       # tolerate two thin-data windows before giving up
    consecutiveErrorLimit: 3   # 3 scrape failures in a row = error

Set a failureCondition as well as a successCondition when “not clearly good” should not automatically mean “bad.” With only a successCondition, every non-passing measurement counts as a failure. With both, a measurement that satisfies neither is inconclusive – which pauses for a human instead of aborting on noise.

Drive and inspect rollbacks from the plugin:

kubectl argo rollouts get rollout checkout -n shop --watch   # live tree + analysis
kubectl argo rollouts promote checkout -n shop               # advance one step / past a pause
kubectl argo rollouts promote checkout -n shop --full        # skip remaining steps + analysis
kubectl argo rollouts abort  checkout -n shop                # force rollback to stable
kubectl argo rollouts retry  rollout checkout -n shop        # resume an aborted rollout
kubectl argo rollouts undo   checkout -n shop                # roll back to a prior revision

A manual abort is sticky: the Rollout stays in Degraded until you retry or push a new revision, so an aborted release will not silently re-promote on the next reconcile.

8. Argo CD health checks and CI gates

If you run Argo CD, the integration is automatic. Argo CD ships a Lua health check for the argoproj.io/Rollout resource, so a Rollout reports as:

Progressing while a canary is advancing,
Paused while waiting on a pause or inconclusive analysis,
Healthy once fully promoted,
Degraded when aborted or failed.

That means an aborted canary turns its Argo CD Application red, and selfHeal will not “fix” it by re-applying, because the manifest in Git is already what is deployed – the failure is runtime, not drift. Surface it in your sync/health gates rather than treating it as config noise.

For pipeline gating outside Argo CD, block the job on the rollout reaching a terminal-good state. The plugin’s status command exits non-zero on failure and supports a timeout, which is exactly what a CI step needs:

# Promote by setting the new image, then wait for the canary to fully succeed.
kubectl argo rollouts set image checkout \
  checkout=ghcr.io/acme/checkout:1.9.0 -n shop

# Blocks until Healthy; non-zero exit on Degraded/abort fails the pipeline.
kubectl argo rollouts status checkout -n shop --watch --timeout 900s

# GitHub Actions gate
- name: Wait for canary to succeed
  run: |
    kubectl argo rollouts status checkout -n shop --watch --timeout 900s
- name: Roll back on failure
  if: failure()
  run: kubectl argo rollouts abort checkout -n shop

This makes a regressed SLO a failed build. The deploy job goes red, the canary self-aborts, and traffic is already back on stable before an engineer opens the logs.

Enterprise scenario

A payments team had a textbook canary: 5% weight, a background AnalysisTemplate gating on success-rate over http_requests_total. A release that introduced a slow database query sailed through to 100% green, then paged on p99 latency twenty minutes later. The analysis was right and useless – their query used rate(...[2m]) but the Prometheus evaluation_interval and scrape were both 60s, and Argo Rollouts sampled at interval: 30s. Each measurement re-read the same under-populated 2m window, so the canary’s first ~90s of real traffic never accumulated enough samples to move the ratio off the stable baseline that was still dominating the series. The metric was an average over both ReplicaSets, not the canary.

The fix was to scope every query to the canary pod hash and align the lookback to the sample interval. Argo Rollouts injects {{args.*}}, so they passed the rollout’s pod-template-hash and matched on it:

metrics:
  - name: canary-success-rate
    interval: 1m            # match scrape, never sample faster than data arrives
    count: 8
    successCondition: result[0] >= 0.995
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus-operated.monitoring:9090
        query: |
          sum(rate(http_requests_total{
            app="checkout",
            rollouts_pod_template_hash="{{args.canary-hash}}",
            code!~"5.."}[1m]))
          /
          sum(rate(http_requests_total{
            app="checkout",
            rollouts_pod_template_hash="{{args.canary-hash}}"}[1m]))

The label is set automatically on canary pods; the lesson is that an analysis query is only as honest as its label scope and its window. Validate both in the Prometheus UI against the canary hash before you trust the gate.

Verify

Confirm the controller, the traffic split, and the analysis are all doing what you think.

# Controller is up
kubectl get pods -n argo-rollouts

# Rollout state, current step, and traffic weight
kubectl argo rollouts get rollout checkout -n shop

# Both ReplicaSets exist mid-canary; note the pod-template-hash
kubectl get rs -n shop -l app=checkout

# Analysis runs and their verdicts
kubectl get analysisrun -n shop
kubectl describe analysisrun -n shop <name>   # shows each measurement + value

# Provider objects are being rewritten (NGINX example)
kubectl get ingress -n shop                   # a managed -canary ingress appears
# Istio example
kubectl get virtualservice,destinationrule -n shop -o yaml | grep -A2 weight

# Sanity-check the query in Prometheus directly before trusting the gate
# (run the same PromQL in the Prometheus UI / API and confirm it returns a value)

A healthy canary shows a single setWeight reflected in the provider object, AnalysisRuns in Successful phase with sampled values that match what Prometheus returns, and the Rollout advancing through steps. An aborted one shows traffic back at the stable Service, the canary ReplicaSet scaled to 0, and a Degraded phase.

Checklist

Pitfalls and next steps

The failures I see most often are not controller bugs. They are analysis that never had a chance to be right: queries scoped to the wrong label so they measure the whole fleet instead of the canary; thresholds copied from a high-traffic service onto a low-traffic one, so every release goes inconclusive; or a setWeight that does nothing because no trafficRouting provider is configured and replica-ratio rounding cannot express the intended percentage. Always run your PromQL in the Prometheus UI first, and always confirm the provider object actually changed weight after the first step.

The second class is operational blindness. A canary is a runtime event, so treat it like one. Scrape the controller’s metrics (the rollout_info series carries a phase label; the endpoint is on port 8090) into a Grafana dashboard, fire a notification on any Rollout entering Degraded, and write a one-page runbook covering how to read analysis output, when to promote past an inconclusive run versus abort, and how retry differs from pushing a new revision. From there, extend the same analysis templates to your blue-green deployments – the metric gate is identical, only the traffic flip differs.

Progressive Delivery on Kubernetes with Argo Rollouts: Canary, Analysis, and Automated Rollback

1. Why rolling updates are not enough

2. Install the controller and the kubectl plugin

3. Convert a Deployment to a Rollout

4. Canary steps, traffic weights, and pause conditions

5. Traffic shaping: NGINX, Istio, or SMI

6. AnalysisTemplates: success rate, latency, and error budgets

7. Automated rollback, abort thresholds, and inconclusive runs

8. Argo CD health checks and CI gates

Enterprise scenario

Verify

Checklist

Pitfalls and next steps

Written by Vinod

Comments

Keep Reading

Blue-Green on Kubernetes with Argo Rollouts: Preview Services, Analysis Gates, and Automated Promotion

Standing Up Backstage as an Internal Developer Portal: Catalog, Software Templates, and TechDocs

Fast, Reproducible, Multi-Arch Builds with BuildKit Remote Cache and SBOM Attestations