A Kubernetes Deployment will happily roll a broken image to 100% of your traffic as fast as readiness probes allow, because a passing readiness probe says nothing about your error rate or p99 latency. Progressive delivery closes that gap: you shift a small slice of traffic, measure real SLOs against the canary, and let a controller promote or roll back on the evidence. This article shows how to do that end to end with Argo Rollouts – converting a Deployment, shaping traffic, wiring Prometheus-backed analysis, and gating CI.
1. Why rolling updates are not enough
A rolling update is a mechanical strategy: new pods become Ready before old ones are removed, respecting maxSurge and maxUnavailable. What it cannot answer is the only question that matters during a release – is the new version actually serving requests correctly? Readiness checks that a process is up, not that it returns 200s, not that latency is within budget, not that a downstream dependency still resolves under the new code path.
Progressive delivery adds a feedback loop. Promotion becomes conditional on metrics, and rollback is automatic when they regress. The unit of progress is a traffic weight, not a pod count, so blast radius is bounded by the percentage of users exposed rather than by how fast pods schedule.
Argo Rollouts is a drop-in replacement for the
Deploymentcontroller, with aRolloutcustom resource that adds canary and blue-green strategies, nativeAnalysisRunevaluation, and an abort path back to the last stable ReplicaSet. It is a CNCF Graduated project and integrates cleanly with Argo CD.
2. Install the controller and the kubectl plugin
Install the controller into its own namespace, then add the kubectl plugin so you can drive and observe rollouts from the CLI.
# Controller
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
-f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
kubectl rollout status deploy/argo-rollouts -n argo-rollouts
# kubectl plugin (macOS via Homebrew)
brew install argoproj/tap/kubectl-argo-rollouts
# Or download the binary directly (Linux amd64 shown)
curl -fsSLO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x ./kubectl-argo-rollouts-linux-amd64
sudo mv ./kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
kubectl argo rollouts version
For a real cluster, pin to a specific release tag instead of latest so the controller version is reproducible, and manage the manifest through your GitOps repo rather than kubectl apply from a laptop.
3. Convert a Deployment to a Rollout
A Rollout is intentionally similar to a Deployment: the spec.template is identical and most operators carry over verbatim. The differences are that kind becomes Rollout, apiVersion becomes argoproj.io/v1alpha1, and a strategy.canary (or strategy.blueGreen) block replaces the rolling-update strategy.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
namespace: shop
spec:
replicas: 8
revisionHistoryLimit: 3
selector:
matchLabels:
app: checkout
template:
metadata:
labels:
app: checkout
spec:
containers:
- name: checkout
image: ghcr.io/acme/checkout:1.8.0
ports:
- containerPort: 8080
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
resources:
requests: { cpu: 200m, memory: 256Mi }
strategy:
canary:
maxSurge: "25%" # defaults to 25% if omitted
maxUnavailable: 0 # keep stable capacity intact during the roll
steps:
- setWeight: 10
- pause: { duration: 5m }
Do not run both a
Deploymentand aRolloutwith the sameselector. If you are migrating an existing app, either rename the workload or use theworkloadReffield on the Rollout to reference the existing Deployment so the controller adopts it without you duplicating the pod template.
Apply it and watch the canary surface in the plugin’s tree view:
kubectl apply -f checkout-rollout.yaml
kubectl argo rollouts get rollout checkout -n shop --watch
4. Canary steps, traffic weights, and pause conditions
The steps list is a state machine the controller walks top to bottom. The three primitives you will use constantly:
| Step | Effect |
|---|---|
setWeight: N |
Route N% of traffic to the canary (requires a traffic provider; see step 5) |
pause: { duration: 10m } |
Wait a fixed time, then continue automatically |
pause: {} |
Pause indefinitely until a human runs promote |
setCanaryScale |
Decouple canary replica count from traffic weight |
analysis |
Run an inline AnalysisRun that must pass before proceeding (step 6) |
A practical production canary looks like this:
strategy:
canary:
canaryService: checkout-canary
stableService: checkout-stable
steps:
- setWeight: 5
- pause: { duration: 2m }
- analysis:
templates:
- templateName: success-rate-latency
- setWeight: 25
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
Two refinements worth knowing. setCanaryScale runs the canary at low replica count while sending little traffic (saving cost during a long bake), or pins an explicit count regardless of weight:
- setCanaryScale:
weight: 25 # scale canary to 25% of spec.replicas
# or
- setCanaryScale:
matchTrafficWeight: true # default behavior: replicas track weight
And dynamicStableScale: true scales the stable ReplicaSet down as the canary weight rises, so you are not paying for double capacity at 50/50. Use it only when you have a traffic provider; without one, abort cannot instantly shift traffic back and you risk a capacity gap on rollback.
5. Traffic shaping: NGINX, Istio, or SMI
Without a trafficRouting provider, setWeight is approximated by replica ratio – the controller scales canary vs stable pods so the proportion roughly matches the weight. That is coarse (you cannot do 5% with 8 replicas) and couples traffic to scaling. For real percentage control, plug in an ingress or mesh provider. The controller manipulates that provider’s native objects on each step.
NGINX Ingress. You provide a primary Ingress plus two Services; Rollouts creates and manages a shadow canary Ingress with the nginx.ingress.kubernetes.io/canary annotations, adjusting canary-weight per step.
strategy:
canary:
canaryService: checkout-canary # required
stableService: checkout-stable # required
trafficRouting:
nginx:
stableIngress: checkout # your existing Ingress, backend = stableService
Istio. You own a VirtualService and a DestinationRule with named subsets; Rollouts rewrites the route weights and the subset pod-hash labels. Subset-level splitting needs only a single Service.
strategy:
canary:
trafficRouting:
istio:
virtualService:
name: checkout-vsvc
routes:
- primary # the HTTP route name to manage
destinationRule:
name: checkout-destrule
canarySubsetName: canary
stableSubsetName: stable
SMI (Service Mesh Interface). For meshes that implement SMI (e.g., Linkerd), the smi provider manages a TrafficSplit object. The mechanics mirror the above: you supply canary and stable Services and the controller adjusts the split weights.
The provider choice does not change your
steps. That is the point of the abstraction: the same canary definition runs on NGINX in staging and Istio in production, with only thetrafficRoutingblock differing. Verify your provider actually honors small weights – some ingress controllers round aggressively at low percentages.
6. AnalysisTemplates: success rate, latency, and error budgets
This is where progressive delivery earns its name. An AnalysisTemplate (namespaced) or ClusterAnalysisTemplate (cluster-wide, reusable) declares one or more metrics. Each runs a query on a schedule and evaluates the result against a successCondition and/or failureCondition. The aggregate outcome is Successful, Failed, Error, or Inconclusive. The fields that govern the verdict:
| Field | Meaning |
|---|---|
interval |
How often to sample (e.g., 1m) |
count |
Total number of measurements to take |
successCondition |
Expression that, when true, marks a measurement a success |
failureCondition |
Expression that, when true, marks a measurement a failure |
failureLimit |
How many failed measurements are tolerated before the run fails (default 0) |
inconclusiveLimit |
How many inconclusive measurements before the run is inconclusive |
consecutiveErrorLimit |
Provider/query errors tolerated in a row before the run errors (default 4) |
A template that gates on both success rate and p99 latency, parameterized so it works for any service:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate-latency
namespace: shop
spec:
args:
- name: service
- name: namespace
metrics:
- name: success-rate
interval: 1m
count: 5
successCondition: result[0] >= 0.995
failureLimit: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service}}",namespace="{{args.namespace}}",
code!~"5.."}[2m]))
/
sum(rate(http_requests_total{
service="{{args.service}}",namespace="{{args.namespace}}"}[2m]))
- name: p99-latency
interval: 1m
count: 5
successCondition: result[0] <= 0.4
failureLimit: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="{{args.service}}",namespace="{{args.namespace}}"}[2m]))
by (le))
result is an array; result[0] is the first (and for these queries, only) returned series value. Pass the args from the Rollout step:
- analysis:
templates:
- templateName: success-rate-latency
args:
- name: service
value: checkout
- name: namespace
value: shop
Inline (in steps) vs background analysis matters. An inline analysis step blocks progression until the run completes. A background analysis runs alongside the whole canary from a chosen step and aborts the moment it fails – ideal for continuous error-budget monitoring across every weight:
strategy:
canary:
analysis: # background: runs concurrently
templates:
- templateName: success-rate-latency
args:
- { name: service, value: checkout }
- { name: namespace, value: shop }
startingStep: 2 # begin once traffic is non-trivial
steps:
- setWeight: 5
- pause: { duration: 2m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 100
For error budgets, encode the SLO directly: a 99.9% target becomes a successCondition of result[0] >= 0.999 over a rolling window, enforced per release. Secrets for authenticated providers (Datadog, New Relic, a secured Prometheus) come from valueFrom.secretKeyRef on an arg, never inlined.
7. Automated rollback, abort thresholds, and inconclusive runs
The verdict drives the outcome automatically:
- Failed (failures exceed
failureLimit) -> the rollout is aborted. The controller shifts traffic back to the stable ReplicaSet and the canary scales down. No human action required. - Inconclusive (inconclusive measurements exceed
inconclusiveLimit, or no condition matched) -> the rollout pauses at its current step and waits for a human topromoteorabort. This is the correct default for “I genuinely cannot tell” – low traffic at 3am yields too few samples to judge, so the system stops rather than guesses. - Error (the provider itself fails, e.g., Prometheus unreachable, beyond
consecutiveErrorLimit) -> treated as a failure path; do not let a broken metrics pipeline silently green-light a release.
Tune the thresholds to your traffic shape:
metrics:
- name: success-rate
interval: 1m
count: 5
successCondition: result[0] >= 0.995
failureLimit: 1 # one bad minute aborts
inconclusiveLimit: 2 # tolerate two thin-data windows before giving up
consecutiveErrorLimit: 3 # 3 scrape failures in a row = error
Set a
failureConditionas well as asuccessConditionwhen “not clearly good” should not automatically mean “bad.” With only asuccessCondition, every non-passing measurement counts as a failure. With both, a measurement that satisfies neither is inconclusive – which pauses for a human instead of aborting on noise.
Drive and inspect rollbacks from the plugin:
kubectl argo rollouts get rollout checkout -n shop --watch # live tree + analysis
kubectl argo rollouts promote checkout -n shop # advance one step / past a pause
kubectl argo rollouts promote checkout -n shop --full # skip remaining steps + analysis
kubectl argo rollouts abort checkout -n shop # force rollback to stable
kubectl argo rollouts retry rollout checkout -n shop # resume an aborted rollout
kubectl argo rollouts undo checkout -n shop # roll back to a prior revision
A manual abort is sticky: the Rollout stays in Degraded until you retry or push a new revision, so an aborted release will not silently re-promote on the next reconcile.
8. Argo CD health checks and CI gates
If you run Argo CD, the integration is automatic. Argo CD ships a Lua health check for the argoproj.io/Rollout resource, so a Rollout reports as:
Progressingwhile a canary is advancing,Pausedwhile waiting on a pause or inconclusive analysis,Healthyonce fully promoted,Degradedwhen aborted or failed.
That means an aborted canary turns its Argo CD Application red, and selfHeal will not “fix” it by re-applying, because the manifest in Git is already what is deployed – the failure is runtime, not drift. Surface it in your sync/health gates rather than treating it as config noise.
For pipeline gating outside Argo CD, block the job on the rollout reaching a terminal-good state. The plugin’s status command exits non-zero on failure and supports a timeout, which is exactly what a CI step needs:
# Promote by setting the new image, then wait for the canary to fully succeed.
kubectl argo rollouts set image checkout \
checkout=ghcr.io/acme/checkout:1.9.0 -n shop
# Blocks until Healthy; non-zero exit on Degraded/abort fails the pipeline.
kubectl argo rollouts status checkout -n shop --watch --timeout 900s
# GitHub Actions gate
- name: Wait for canary to succeed
run: |
kubectl argo rollouts status checkout -n shop --watch --timeout 900s
- name: Roll back on failure
if: failure()
run: kubectl argo rollouts abort checkout -n shop
This makes a regressed SLO a failed build. The deploy job goes red, the canary self-aborts, and traffic is already back on stable before an engineer opens the logs.
Enterprise scenario
A payments team had a textbook canary: 5% weight, a background AnalysisTemplate gating on success-rate over http_requests_total. A release that introduced a slow database query sailed through to 100% green, then paged on p99 latency twenty minutes later. The analysis was right and useless – their query used rate(...[2m]) but the Prometheus evaluation_interval and scrape were both 60s, and Argo Rollouts sampled at interval: 30s. Each measurement re-read the same under-populated 2m window, so the canary’s first ~90s of real traffic never accumulated enough samples to move the ratio off the stable baseline that was still dominating the series. The metric was an average over both ReplicaSets, not the canary.
The fix was to scope every query to the canary pod hash and align the lookback to the sample interval. Argo Rollouts injects {{args.*}}, so they passed the rollout’s pod-template-hash and matched on it:
metrics:
- name: canary-success-rate
interval: 1m # match scrape, never sample faster than data arrives
count: 8
successCondition: result[0] >= 0.995
failureLimit: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring:9090
query: |
sum(rate(http_requests_total{
app="checkout",
rollouts_pod_template_hash="{{args.canary-hash}}",
code!~"5.."}[1m]))
/
sum(rate(http_requests_total{
app="checkout",
rollouts_pod_template_hash="{{args.canary-hash}}"}[1m]))
The label is set automatically on canary pods; the lesson is that an analysis query is only as honest as its label scope and its window. Validate both in the Prometheus UI against the canary hash before you trust the gate.
Verify
Confirm the controller, the traffic split, and the analysis are all doing what you think.
# Controller is up
kubectl get pods -n argo-rollouts
# Rollout state, current step, and traffic weight
kubectl argo rollouts get rollout checkout -n shop
# Both ReplicaSets exist mid-canary; note the pod-template-hash
kubectl get rs -n shop -l app=checkout
# Analysis runs and their verdicts
kubectl get analysisrun -n shop
kubectl describe analysisrun -n shop <name> # shows each measurement + value
# Provider objects are being rewritten (NGINX example)
kubectl get ingress -n shop # a managed -canary ingress appears
# Istio example
kubectl get virtualservice,destinationrule -n shop -o yaml | grep -A2 weight
# Sanity-check the query in Prometheus directly before trusting the gate
# (run the same PromQL in the Prometheus UI / API and confirm it returns a value)
A healthy canary shows a single setWeight reflected in the provider object, AnalysisRuns in Successful phase with sampled values that match what Prometheus returns, and the Rollout advancing through steps. An aborted one shows traffic back at the stable Service, the canary ReplicaSet scaled to 0, and a Degraded phase.
Checklist
Pitfalls and next steps
The failures I see most often are not controller bugs. They are analysis that never had a chance to be right: queries scoped to the wrong label so they measure the whole fleet instead of the canary; thresholds copied from a high-traffic service onto a low-traffic one, so every release goes inconclusive; or a setWeight that does nothing because no trafficRouting provider is configured and replica-ratio rounding cannot express the intended percentage. Always run your PromQL in the Prometheus UI first, and always confirm the provider object actually changed weight after the first step.
The second class is operational blindness. A canary is a runtime event, so treat it like one. Scrape the controller’s metrics (the rollout_info series carries a phase label; the endpoint is on port 8090) into a Grafana dashboard, fire a notification on any Rollout entering Degraded, and write a one-page runbook covering how to read analysis output, when to promote past an inconclusive run versus abort, and how retry differs from pushing a new revision. From there, extend the same analysis templates to your blue-green deployments – the metric gate is identical, only the traffic flip differs.