Azure Containers

AKS Day-2 Operations: Cluster Upgrades, Node Lifecycle, and Fleet Management

Standing up an AKS cluster is a solved problem. Keeping a fleet of clusters patched, on a supported Kubernetes version, and upgraded without paging anyone is where most platform teams quietly accumulate risk. This runbook covers the upgrade machinery end to end: control-plane vs. node-pool upgrades, node-image vs. Kubernetes patches, surge and PodDisruptionBudget tuning, maintenance windows, blue-green node pools, and fleet-scale rollouts with Azure Kubernetes Fleet Manager.

The Day-2 problem

A cluster you provisioned six months ago is already drifting. AKS supports a Kubernetes minor version for roughly 12 months from its GA on the platform, enforcing an N-2 window — you can run the latest minor and the two behind it. Miss it and the cluster lands on a platform-supported tier (best-effort, no control-plane SLA) and eventually gets force-upgraded. Node images move faster still: Microsoft ships new images weekly with OS CVE fixes, kubelet patches, and containerd updates. A cluster that is “fine” is usually just one nobody has looked at.

The job, then, is to make upgrades boring: scheduled, surge-tuned, observable, and reversible. Start by seeing the gap.

# What's available vs. what you're running
az aks get-upgrades \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --output table

# Per-node-pool view (control plane and pools can differ)
az aks nodepool get-upgrades \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name system \
  --output table

Treat get-upgrades output as a service-level indicator. If the control plane is more than one minor behind the latest GA version, you are burning your N-2 budget and should already have a change ticket open.

Upgrade anatomy: control plane vs. node pools

An AKS upgrade is two distinct operations that people conflate at their peril:

  1. Control-plane upgrade — the managed API server, scheduler, and controller-manager. Fast, Microsoft-managed, and the only part that gates the cluster’s reported version.
  2. Node-pool upgrade — every VM in a pool is reimaged to the new Kubernetes (kubelet) version, one surge batch at a time, with cordon-and-drain.

The control plane must be upgraded before or together with node pools, and node pools may trail the control plane by at most one minor version. Decoupling them is the single most important Day-2 technique, because it lets you take the cheap, low-risk control-plane bump immediately and schedule the expensive, workload-disrupting node reimaging for a maintenance window.

# Upgrade ONLY the control plane (Kubernetes 1.31.x -> 1.32.x)
az aks upgrade \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --kubernetes-version 1.32.0 \
  --control-plane-only \
  --yes

# Later, in a maintenance window, bring each node pool up
az aks nodepool upgrade \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name system \
  --kubernetes-version 1.32.0

A bare az aks upgrade --kubernetes-version 1.32.0 (no --control-plane-only) upgrades the control plane and every node pool in one long-running operation. That is fine for non-prod; in production you almost always want to split them.

Node image vs. Kubernetes upgrades

These are different cadences and you should automate them differently.

Node-image upgrade Kubernetes upgrade
What changes OS packages, containerd, kubelet patch, security fixes Kubernetes minor/patch version (API surface)
Frequency Weekly images from Microsoft Per minor release (~quarterly upstream)
Risk Low — same K8s version, reimage only Higher — API deprecations, behavior changes
Recommended channel NodeImage patch (auto) or manual minor bumps

A node-image-only upgrade keeps the Kubernetes version fixed and just reimages nodes onto the latest weekly image — this is how you stay on top of OS CVEs without touching the API surface.

# Patch OS/kubelet without changing the Kubernetes version
az aks nodepool upgrade \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name system \
  --node-image-only

Auto-upgrade channels

AKS has two independent auto-upgrade settings. Do not confuse them:

az aks update \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --auto-upgrade-channel patch \
  --node-os-upgrade-channel NodeImage

My default for production: cluster channel patch and node OS channel NodeImage, both bound to a maintenance window (next section) so they never fire mid-business-day. Reserve manual control over the minor version bumps — those deserve a human reading the release notes.

Tuning the rollout: max surge, PDBs, and draining

When a node pool upgrades, AKS adds surge nodes, then cordons and drains existing nodes one batch at a time. Two knobs decide whether this is invisible or an outage.

Max surge controls batch size and is set per node pool. The default is one node (an absolute value), which is safe but glacial on a 100-node pool. Bump it to a percentage to parallelize.

# 33% surge: upgrade roughly a third of the pool per batch
az aks nodepool update \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name workloads \
  --max-surge 33%

Higher surge = faster upgrade but more transient capacity (and cost) and more simultaneous pod evictions. For latency-sensitive workloads I use 33%; for large batch/stateless pools 50% is fine. Avoid 100% in production — it doubles the pool and gives you no blast-radius control if a new node image is bad.

PodDisruptionBudgets are what make drains respect your SLOs. During drain, the eviction API honors PDBs; if evicting a pod would violate minAvailable, the drain blocks until the replacement pod is Ready elsewhere.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout-pdb
  namespace: payments
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: checkout

There is a sharp edge here: a PDB that can never be satisfied (e.g. minAvailable: 100% on a single-replica Deployment) will stall the drain indefinitely, turning a clean rolling upgrade into a hung operation. Rules I enforce via policy:

az aks nodepool update \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name workloads \
  --max-surge 33% \
  --drain-timeout 30 \
  --node-soak-duration 5

Maintenance windows and planned maintenance

Auto-upgrade without a maintenance window means Azure can reimage your nodes whenever it likes. Planned Maintenance binds all upgrade activity to schedules you control, which is how you keep upgrades inside a change-freeze policy.

There are three configurable schedules:

# Kubernetes auto-upgrades: Sundays 02:00, 4-hour window, US Eastern
az aks maintenanceconfiguration add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name aksManagedAutoUpgradeSchedule \
  --schedule-type Weekly \
  --day-of-week Sunday \
  --start-time 02:00 \
  --duration 4 \
  --utc-offset -05:00

# Node OS/image upgrades: nightly 03:00, 4-hour window
az aks maintenanceconfiguration add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name aksManagedNodeOSUpgradeSchedule \
  --schedule-type Daily \
  --interval-days 1 \
  --start-time 03:00 \
  --duration 4 \
  --utc-offset -05:00

For change freezes (quarter-end, peak shopping season), use the --config-file form, which supports notAllowedDates — date ranges where no maintenance may start even if it falls inside the recurring window.

{
  "maintenanceWindow": {
    "schedule": { "weekly": { "intervalWeeks": 1, "dayOfWeek": "Sunday" } },
    "durationHours": 4,
    "utcOffset": "-05:00",
    "startTime": "02:00",
    "notAllowedDates": [
      { "start": "2026-11-20", "end": "2026-12-02" }
    ]
  }
}
az aks maintenanceconfiguration add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name aksManagedAutoUpgradeSchedule \
  --config-file ./freeze-window.json

Blue-green at the node-pool level

For high-risk upgrades — a major OS family change, a kernel-sensitive workload, or a node SKU swap — in-place surge is not enough control. Stand up a parallel node pool on the new version, shift workloads, and keep the old pool as a rollback.

# 1. New pool on the target version (note the new name)
az aks nodepool add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name workloads2 \
  --kubernetes-version 1.32.0 \
  --node-count 5 \
  --mode User \
  --labels pool=workloads2

# 2. Cordon every node in the OLD pool so nothing new schedules there
kubectl cordon -l agentpool=workloads

# 3. Drain the old pool; PDBs gate the pace, pods reschedule onto workloads2
kubectl drain -l agentpool=workloads \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=600s

# 4. Validate. If healthy, delete the old pool. If not, uncordon and roll back.
az aks nodepool delete \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name workloads

The agentpool label is applied automatically to every node by AKS, so -l agentpool=<name> reliably targets exactly one pool. Keep the old pool until smoke tests pass — deleting it is the point of no return.

This costs double capacity for the migration window, but converts a multi-hour, irreversible reimage into a controlled cutover with a one-command rollback (kubectl uncordon -l agentpool=workloads).

Fleet-scale upgrades with Azure Kubernetes Fleet Manager

One cluster is a runbook; fifty clusters is a coordination problem. Azure Kubernetes Fleet Manager orchestrates upgrades across many AKS clusters with update runs that march through ordered stages and groups — dev before staging before prod, with bake time between each.

az extension add --name fleet

# Create a fleet (hub-less is fine for update orchestration only)
az fleet create \
  --resource-group rg-fleet \
  --name fleet-platform \
  --location eastus

# Join member clusters and assign each to an update group
az fleet member create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name aks-dev-eastus \
  --member-cluster-id "$DEV_CLUSTER_ID" \
  --update-group dev

az fleet member create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name aks-prod-eastus \
  --member-cluster-id "$PROD_CLUSTER_ID" \
  --update-group prod

An update strategy defines the stage order and the wait between stages; an update run executes it. Define the strategy once and reuse it.

# A strategy: dev first, soak 1 hour, then prod
az fleet updatestrategy create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name ring-rollout \
  --stages '[
    { "name": "dev",  "groups": [{ "name": "dev"  }], "afterStageWaitInSeconds": 3600 },
    { "name": "prod", "groups": [{ "name": "prod" }] }
  ]'

# An update run that targets the latest patch within each cluster's minor
az fleet updaterun create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name run-2026-05 \
  --update-strategy-name ring-rollout \
  --upgrade-type NodeImageOnly

az fleet updaterun start \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name run-2026-05

--upgrade-type accepts Full (Kubernetes + node image), ControlPlaneOnly, and NodeImageOnly. The afterStageWaitInSeconds between stages is your fleet-wide soak: dev takes the new image, you watch dashboards for an hour, and only then does prod proceed. A failed stage halts the run, so a regression caught in dev never reaches prod.

Validating an upgrade

Upgrades fail in two ways: removed APIs and behavioral regressions. Check both — before and after.

Deprecated/removed API detection. Each Kubernetes minor removes APIs. AKS will warn (and can block) an upgrade if in-cluster objects or recent API traffic use APIs slated for removal in the target version. Surface these ahead of time:

# AKS-side: deprecation warnings reported by the control plane,
# including API usage seen in the last ~12h of audit logs
az aks get-upgrades \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --output table

# Cluster-side: look for deprecation warnings the API server is already emitting
kubectl get events -A --field-selector reason=Deprecated 2>/dev/null

# Microsoft Defender for Cloud also raises recommendations for
# clusters running deprecated Kubernetes API versions — check those
# in the Defender for Cloud recommendations blade before upgrading.

Removed-API breakage is the most common cause of a “successful upgrade, broken app.” Run a static check (e.g. pluto or kubent) against your manifests and Helm releases in CI, and gate the upgrade PR on it.

Smoke tests. After the control plane and at least one node pool are on the new version, run synthetic checks against real user paths, not just kubectl get nodes.

# Nodes Ready and on the expected version
kubectl get nodes -o wide

# No pods stuck after the reimage
kubectl get pods -A --field-selector=status.phase!=Running \
  | grep -v Completed || echo "all pods healthy"

# Hit a real ingress path end to end
curl -fsS https://api.kloudvin.example/healthz && echo OK

Enterprise scenario

A payments platform team ran 30+ AKS clusters under a Fleet Manager update run with --upgrade-type Full, dev-then-prod with an hour of bake time. The dev stage went green. Prod stalled: every node-pool upgrade hung in Upgrading, never finishing, never failing. The Fleet run sat blocked for hours.

The cause was a PDB nobody connected to upgrades. A platform DaemonSet-adjacent service — a Deployment of a regional rate-limiter — ran exactly 3 replicas behind an anti-affinity rule (one per zone) with minAvailable: 100%. In dev the pool had spare zones, so a replacement scheduled and the drain proceeded. Prod pools were packed to capacity in all three zones, so when AKS cordoned a node, the evicted rate-limiter pod had nowhere to land that satisfied anti-affinity, and minAvailable: 100% refused to drop below 3. The eviction API blocked indefinitely, and with it the whole batch.

The fix was twofold. Immediately, relax the budget so the drain could breathe:

kubectl patch pdb ratelimiter-pdb -n edge \
  --type merge -p '{"spec":{"minAvailable":"67%"}}'

That unblocked the in-flight batch within minutes. The durable fix was a node-pool --drain-timeout 30 so a stuck eviction surfaces as a failed batch — which halts the Fleet stage before prod — instead of an invisible hang, plus an OPA Gatekeeper policy rejecting any PDB whose minAvailable equals the workload’s replica count. Lesson: validate PDBs against real prod headroom, not dev’s spare capacity.

Verify

After any upgrade run, confirm the cluster is where you intended:

# Control-plane version
az aks show -g rg-platform -n aks-prod-eastus \
  --query "currentKubernetesVersion" -o tsv

# Per-pool Kubernetes version AND node image version
az aks nodepool list -g rg-platform --cluster-name aks-prod-eastus \
  --query "[].{name:name, k8s:currentOrchestratorVersion, image:nodeImageVersion}" \
  -o table

# Fleet run completed across all stages
az fleet updaterun show -g rg-fleet --fleet-name fleet-platform \
  --name run-2026-05 --query "status.state" -o tsv

The nodeImageVersion field is the one teams forget — a pool can be on the right Kubernetes version but a stale image, meaning OS CVEs are still open.

Day-2 upgrade checklist

Pitfalls

Next, wire the deprecated-API scan and a post-upgrade smoke-test job into the same pipeline that runs your Fleet update runs, so the gates and the rollout live in one place — and a red check actually stops the rollout instead of just emailing someone.

AzureAKSFleet ManagerUpgradesNode PoolsMaintenance

Comments

// part 2 of 2 · AKS in Production

Keep Reading