Containerization Lesson 90 of 113

AKS Day-2 Operations: Cluster Upgrades, Node Lifecycle, and Fleet Management

Standing up an AKS cluster is a solved problem. Keeping a fleet of clusters patched, on a supported Kubernetes version, and upgraded without paging anyone is where most platform teams quietly accumulate risk. A cluster you provisioned six months ago is already drifting: Azure Kubernetes Service supports a Kubernetes minor version for roughly twelve months from its GA on the platform, enforcing an N-2 window — you may run the latest minor and the two behind it. Miss that window and the cluster lands on a platform-supported tier (best-effort, no control-plane SLA) and is eventually force-upgraded on Microsoft’s timetable, not yours. Node images move faster still: Microsoft ships new images weekly carrying OS CVE fixes, kubelet patches, and containerd updates. A cluster that is “fine” is usually just one nobody has looked at.

This is the Day-2 runbook, and it treats the upgrade as what it actually is: not one button but a pipeline — control-plane bump, node-pool surge-and-drain under PodDisruptionBudgets, fleet-wide staged rollout, and verification — where a failure at any stage stalls or breaks everything downstream. You will learn to decouple the cheap control-plane upgrade from the expensive node reimaging, tune max surge and PDBs so a drain is invisible instead of an outage, bind every upgrade to a maintenance window so it never fires mid-business-day, stand up blue-green node pools for high-risk changes with a one-command rollback, and coordinate dozens of clusters through Azure Kubernetes Fleet Manager update runs with bake time between rings. Every operation gets both an az CLI invocation and a Bicep/JSON snippet, and because this is a reference you return to mid-change, the version skew rules, the channel matrix, the surge math, the removed-API list, and the failure modes are all laid out as scannable tables — read the prose once, then keep the tables open at change-window time.

By the end you will stop treating upgrades as scary. When the N-2 clock runs down you will know exactly which of the two upgrade operations to run, what each one disrupts, which knob gates the blast radius, and how to catch a bad node image in dev before it ever reaches prod. Knowing which move to make — and in what order — is what separates a boring, scheduled patch from a multi-hour, all-hands stall.

What problem this solves

AKS hides the control plane so you can kubectl apply and have a running cluster. That abstraction is a gift until version support lapses, and then it becomes a wall: a force-upgrade you did not schedule, on a date you did not choose, against workloads whose PDBs you never validated against real headroom. The platform will keep you “supported” only if you keep moving, and the cadence is relentless — a new minor roughly quarterly, a new node image weekly. The work is not whether to upgrade; it is making the upgrade boring, scheduled, surge-tuned, observable, and reversible.

What breaks without this discipline: a team auto-upgrades the Kubernetes version but leaves the node OS channel None, so they are patched on the API and exposed on the OS — open CVEs on every node, invisible because the cluster reports a healthy version. Or a node-pool upgrade hangs in Upgrading forever because a single-replica Deployment with minAvailable: 1 makes the eviction API refuse to drain, turning a clean rolling upgrade into a stuck operation that never finishes and never fails. Or a Fleet update run with zero bake time between dev and prod breaks every cluster at once — a slower way to take an outage, not a safer one. Each of these is perfectly diagnosable and entirely preventable; the failure is always procedural, not mysterious.

Who hits this: every team running more than one AKS cluster, and every team running one for more than a year. It bites hardest on platform teams managing a fleet (the coordination problem dwarfs the single-cluster runbook), on latency-sensitive workloads (where an over-aggressive surge or an unsatisfiable PDB is the difference between invisible and an incident), and on anyone who validated their disruption budgets against dev’s spare capacity instead of prod’s packed-to-the-zone reality. The fix is almost never “open a support ticket” — it is “decouple the operations, gate the blast radius, and bake between rings.”

To frame the whole field before the deep dive, here is every upgrade operation this runbook covers, the question it forces, and where it bites:

Operation What it changes First question Primary risk Where you control it
Control-plane upgrade API server, scheduler, controller-manager Am I inside N-2? Removed-API breakage az aks upgrade --control-plane-only
Node-pool K8s upgrade kubelet version; full reimage Will the drain respect SLOs? PDB stall / blast radius az aks nodepool upgrade
Node-image upgrade OS, containerd, kubelet patch (same K8s) Are OS CVEs open? Stale image left behind --node-image-only / NodeImage channel
Auto-upgrade channels Who pulls the trigger, and when Is it bound to a window? Mid-day reimage --auto-upgrade-channel + maintenance config
Blue-green pool A parallel pool on the new version Is this change reversible? Double capacity cost az aks nodepool add + cordon/drain
Fleet update run Many clusters, in ordered rings Did dev bake before prod? All clusters break together az fleet updaterun + afterStageWaitInSeconds

The job, then, is to make upgrades boring: scheduled, surge-tuned, observable, and reversible. Start by seeing the gap.

# What's available vs. what you're running
az aks get-upgrades \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --output table

# Per-node-pool view (control plane and pools can differ)
az aks nodepool get-upgrades \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name system \
  --output table

Treat get-upgrades output as a service-level indicator. If the control plane is more than one minor behind the latest GA version, you are burning your N-2 budget and should already have a change ticket open.

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already be comfortable provisioning an AKS cluster and running kubectl and az against it — node pools (system vs user mode), Deployments and ReplicaSets, and the basic shape of the control plane. You should know that AKS is a managed Kubernetes service where Microsoft runs the control plane and you own the node pools, and you should understand the difference between the Kubernetes (kubelet) version a node runs and the node image it boots from. Familiarity with PodDisruptionBudgets, cordon/drain, and Helm helps; comfort reading JSON and YAML is assumed.

This sits in the AKS in Production track as the Day-2 operations runbook. It assumes the managed-Kubernetes fundamentals from Understanding Managed Kubernetes: AKS vs EKS vs GKE and the production networking and observability baseline in Production AKS: Networking & Observability. It pairs tightly with Kubernetes Production Readiness: Day-2 Operations Checklist, since upgrades are the highest-stakes recurring Day-2 task, and with Azure Monitor: Managed Prometheus & Managed Grafana for AKS, because an upgrade you cannot observe is one you cannot safely roll back. The fleet-orchestration half complements Azure Arc-enabled Kubernetes: GitOps, Policy & Fleet Management; if you also run EKS, the same mechanics in another cloud are in EKS Cluster Upgrades: Version Lifecycle & Fleet Operations.

A quick map of who owns what during an upgrade, so you call the right person fast:

Layer What lives here Who usually owns it What it can stall / break
Control plane API server, scheduler, etcd Microsoft (managed) Removed-API rejection on upgrade
Node pool VMs, kubelet, OS image Platform team PDB stall, surge blast radius, image drift
Workload Deployments, PDBs, probes App / dev team Unsatisfiable PDB hangs the drain
Maintenance config Windows, freeze dates Platform / change mgmt Mid-day reimage if unset
Fleet Update runs, strategies Platform / SRE All clusters break together with no bake
CI / policy pluto/kubent, OPA Gatekeeper DevOps / security Removed API or bad PDB reaches prod

Core concepts

Five mental models make every later decision obvious.

An upgrade is two operations, not one. A control-plane upgrade moves the managed API server, scheduler, and controller-manager — fast, Microsoft-managed, and the only part that gates the cluster’s reported version. A node-pool upgrade reimages every VM in a pool to the new kubelet version, one surge batch at a time, with cordon-and-drain. People conflate them at their peril. The control plane must be upgraded before or together with node pools, and node pools may trail the control plane by at most one minor version. Decoupling them lets you take the cheap, low-risk control-plane bump immediately and schedule the expensive, workload-disrupting node reimaging for a window.

Two cadences, two channels. The Kubernetes version changes roughly quarterly and touches the API surface (deprecations, behavior changes) — higher risk. The node image changes weekly and carries OS/kubelet/containerd patches at the same Kubernetes version — low risk, reimage only. These are governed by two independent auto-upgrade settings (the cluster channel and the node-OS channel) that people constantly confuse. Automate them separately; reserve a human for minor-version bumps that deserve release-note reading.

Surge and PDBs decide whether a drain is invisible or an outage. When a pool upgrades, AKS adds surge nodes, then cordons and drains existing nodes one batch at a time. Max surge sets the batch size (default one node — safe but glacial on a 100-node pool). PodDisruptionBudgets are what make the drain respect your SLOs: during drain, the eviction API honors PDBs, blocking eviction that would violate minAvailable until a replacement pod is Ready elsewhere. The sharp edge: a PDB that can never be satisfied (e.g. minAvailable: 100% on a single-replica Deployment, or an anti-affinity rule with no spare zone) stalls the drain indefinitely.

The maintenance window is your change-freeze enforcement. Auto-upgrade without a maintenance window means Azure can reimage your nodes whenever it likes. Planned Maintenance binds all upgrade activity to schedules you control, and notAllowedDates carves out change freezes (quarter-end, peak season) where no maintenance starts even inside the recurring window. Three named configs govern the three activity classes; unset them and you have ceded the timing decision to the platform.

The fleet is a coordination problem, not a bigger cluster. One cluster is a runbook; fifty clusters is choreography. Azure Kubernetes Fleet Manager marches an upgrade through ordered stages and groups — dev before staging before prod — with bake time (afterStageWaitInSeconds) between each. A failed stage halts the run, so a regression caught in dev never reaches prod. The bake time is the safety mechanism: zero wait is just a slower way to break everything simultaneously.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to an upgrade
Control plane Managed API server / scheduler / etcd Microsoft-managed Gates the cluster’s reported version
Node pool A group of identical VMs (kubelet + image) Your subscription Reimaged batch-by-batch on upgrade
N-2 support window Latest minor + two behind it Platform policy Lapse → force-upgrade off your schedule
Max surge Extra nodes added per upgrade batch Per node pool Sets batch size / blast radius
PodDisruptionBudget Floor of pods that must stay up Per workload Gates / stalls the drain
Drain timeout How long a node waits to evict Per node pool A stuck eviction fails loudly vs hangs
Node soak Delay after a node is up before next batch Per node pool Lets a bad image surface in monitoring
Auto-upgrade channel Who bumps the K8s version, and to what Cluster setting patch/stable/rapid/none
Node-OS channel Who bumps the node image Cluster setting NodeImage/SecurityPatch/etc.
Maintenance window When upgrades may run Maintenance config Change-freeze enforcement
Blue-green pool A parallel pool on the new version On the cluster Reversible high-risk upgrades
Fleet Manager Multi-cluster upgrade orchestrator rg-fleet Staged, baked, fail-halts-run rollouts
Update run / strategy The executed rollout / its definition Fleet resource Stage order + bake between rings

Version skew and the N-2 support window

Before you touch anything, understand the rules that constrain what you can upgrade to and by how much. Kubernetes itself enforces a version-skew policy between components; AKS layers its support window on top. Violate either and the upgrade is rejected or unsupported.

The component skew rules that govern any single upgrade step:

Component pair Allowed skew What this means in practice Violation symptom
Node pool vs control plane At most 1 minor behind A pool on 1.30 needs the CP on 1.30 or 1.31, never 1.32 Upgrade refused; “node pool too far behind”
Control plane minor jump 1 minor at a time 1.30 → 1.31 → 1.32, never 1.30 → 1.32 directly get-upgrades won’t offer the skip
kubelet vs API server Up to 3 minors older (upstream) AKS tightens this to 1 via the rule above N/A on AKS (CP-skew rule is stricter)
Two node pools to each other Independent within the CP rule Pools can sit on different minors, each ≤1 behind CP Mixed-version pools are legal
Patch within a minor Any patch, freely 1.31.3 → 1.31.8 is always allowed None

The AKS support tiers — where a version lands as it ages, and what you lose:

Tier Which versions Control-plane SLA What you get What you lose
GA / supported Latest 3 minors (N, N-1, N-2) Full uptime SLA (with SLA tier) CVE patches, support, upgrades Nothing
Platform support One minor past N-2 Best-effort only Cluster keeps running No K8s patches, no CVE fixes, limited support
Out of support Older than platform support None Force-upgrade scheduled by Microsoft
Preview / alpha Pre-GA minors None Early features No support; not for prod
LTS (premium tier) Designated minor, ~2 yr Full (premium add-on) Extended support window Higher cost; specific minors only

The upgrade-step math, so you can plan a multi-minor catch-up:

Starting from Target Steps required Why Rough wall-clock
1.31 (N-1) 1.32 (N) 1 hop Single minor CP minutes + node reimage
1.30 (N-2) 1.32 (N) 2 hops One minor at a time Two full cycles
1.29 (out of support) 1.32 3 hops Sequential minors Long; do in a window
1.31.3 1.31.8 1 hop (patch) Same minor Node reimage only, fast
Any Same + new image Node-image upgrade No version change Reimage only, lowest risk
# See the upgrade path the platform will allow (it enforces one-minor hops)
az aks get-upgrades -g rg-platform -n aks-prod-eastus \
  --query "controlPlaneProfile.upgrades[].kubernetesVersion" -o tsv

# Confirm your current support state — how many minors behind GA you are
az aks show -g rg-platform -n aks-prod-eastus \
  --query "{current:currentKubernetesVersion, sku:sku.tier}" -o table

Upgrade anatomy: control plane vs node pools

An AKS upgrade is two distinct operations that people conflate at their peril:

  1. Control-plane upgrade — the managed API server, scheduler, and controller-manager. Fast, Microsoft-managed, and the only part that gates the cluster’s reported version.
  2. Node-pool upgrade — every VM in a pool is reimaged to the new Kubernetes (kubelet) version, one surge batch at a time, with cordon-and-drain.

The control plane must be upgraded before or together with node pools, and node pools may trail the control plane by at most one minor version. Decoupling them is the single most important Day-2 technique, because it lets you take the cheap, low-risk control-plane bump immediately and schedule the expensive, workload-disrupting node reimaging for a maintenance window.

# Upgrade ONLY the control plane (Kubernetes 1.31.x -> 1.32.x)
az aks upgrade \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --kubernetes-version 1.32.0 \
  --control-plane-only \
  --yes

# Later, in a maintenance window, bring each node pool up
az aks nodepool upgrade \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name system \
  --kubernetes-version 1.32.0

A bare az aks upgrade --kubernetes-version 1.32.0 (no --control-plane-only) upgrades the control plane and every node pool in one long-running operation. That is fine for non-prod; in production you almost always want to split them.

The two operations side by side — internalise this table and most upgrade decisions make themselves:

Dimension Control-plane upgrade Node-pool upgrade
What moves API server, scheduler, controller-manager Every node’s kubelet + OS image
Who runs it Microsoft (managed) AKS, surge-batch by batch
Duration Minutes Minutes → hours (pool size × surge)
Workload disruption None (control plane is HA) Pods evicted as nodes drain
Gates the cluster version? Yes No (pools can trail by 1 minor)
Reversible? No (roll forward only) Blue-green pool gives a rollback
Right cadence As soon as available, low-risk Scheduled in a maintenance window
The flag --control-plane-only --nodepool-name <pool>

The CLI verbs you will actually use, and exactly what each touches:

Command Scope Changes version? Reimages nodes? When to reach for it
az aks upgrade --control-plane-only Control plane Yes (CP) No Take the cheap bump now
az aks upgrade (no flag) CP + all pools Yes (CP + pools) Yes (all) Non-prod, or a full window
az aks nodepool upgrade --kubernetes-version One pool Yes (pool) Yes (that pool) Bring a pool to the CP version
az aks nodepool upgrade --node-image-only One pool No Yes (that pool) OS/CVE patch, same K8s
az aks upgrade --node-image-only All pools No Yes (all) Fleet-wide image refresh
az aks nodepool get-upgrades One pool — (read) See what a pool can move to

Node image vs Kubernetes upgrades

These are different cadences and you should automate them differently.

Node-image upgrade Kubernetes upgrade
What changes OS packages, containerd, kubelet patch, security fixes Kubernetes minor/patch version (API surface)
Frequency Weekly images from Microsoft Per minor release (~quarterly upstream)
Risk Low — same K8s version, reimage only Higher — API deprecations, behavior changes
Recommended channel NodeImage patch (auto) or manual minor bumps
Rollback Re-pin to prior image is not supported; roll forward Blue-green pool or roll forward
What it fixes OS CVEs, kubelet/containerd bugs New APIs, upstream features, behavior fixes

A node-image-only upgrade keeps the Kubernetes version fixed and just reimages nodes onto the latest weekly image — this is how you stay on top of OS CVEs without touching the API surface.

# Patch OS/kubelet without changing the Kubernetes version
az aks nodepool upgrade \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name system \
  --node-image-only

Auto-upgrade channels

AKS has two independent auto-upgrade settings. Do not confuse them:

az aks update \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --auto-upgrade-channel patch \
  --node-os-upgrade-channel NodeImage

The cluster (Kubernetes-version) channel — every value, what it does, and the trade-off:

Channel What it does Cadence Best for Risk / gotcha
none No automatic K8s upgrades Never Strict manual control You own the N-2 clock entirely
patch Latest patch of your current minor As patches ship Production default Stays on your minor; you still drive minor bumps
stable Latest patch of N-1 (one minor behind newest) Per minor + patch Conservative auto-minor Lags the newest minor by design
rapid Latest supported patch of N (newest minor) Aggressive Dev / fast-moving Pulls minor bumps automatically — read-notes risk
node-image (legacy alias) Node image only Weekly Superseded by node-OS channel Prefer the dedicated node-OS channel

The node-OS (image) channel — every value:

Node-OS channel What it does Reboot? Best for Gotcha
None No automatic OS updates You manage it explicitly Leaves OS CVEs open if you forget
Unmanaged OS’s own update mechanism handles it Maybe Legacy / special images AKS doesn’t coordinate it; uneven
SecurityPatch Azure applies OS security patches, live where possible Sometimes Patch CVEs without a full image swap Not every fix is patchable live
NodeImage Move to the latest weekly node image Yes (reimage) Production default Reimages nodes; bind to a window

My default for production: cluster channel patch and node OS channel NodeImage, both bound to a maintenance window (next section) so they never fire mid-business-day. Reserve manual control over the minor version bumps — those deserve a human reading the release notes.

resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: 'aks-prod-eastus'
  location: location
  properties: {
    autoUpgradeProfile: {
      upgradeChannel: 'patch'          // Kubernetes version channel
      nodeOSUpgradeChannel: 'NodeImage' // node image channel
    }
  }
}

A decision table for picking the channel pair by environment:

Environment Cluster channel Node-OS channel Bound to window? Rationale
Production (regulated) patch NodeImage Yes (both) Auto-patch + auto-image, human owns minors, freeze-aware
Production (fast-moving SaaS) stable NodeImage Yes Accept auto-minor one behind newest
Staging / pre-prod patch NodeImage Loose Mirror prod, bake new images first
Dev / sandbox rapid NodeImage No Surface breakage early, cheaply
Pinned / compliance-locked none SecurityPatch Yes Manual K8s, but still patch CVEs

Tuning the rollout: max surge, PDBs, and draining

When a node pool upgrades, AKS adds surge nodes, then cordons and drains existing nodes one batch at a time. Two knobs decide whether this is invisible or an outage.

Max surge controls batch size and is set per node pool. The default is one node (an absolute value), which is safe but glacial on a 100-node pool. Bump it to a percentage to parallelize.

# 33% surge: upgrade roughly a third of the pool per batch
az aks nodepool update \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name workloads \
  --max-surge 33%

Higher surge = faster upgrade but more transient capacity (and cost) and more simultaneous pod evictions. For latency-sensitive workloads I use 33%; for large batch/stateless pools 50% is fine. Avoid 100% in production — it doubles the pool and gives you no blast-radius control if a new node image is bad.

The surge spectrum — what each setting buys and costs, on a 30-node pool:

--max-surge Nodes per batch (≈) Extra nodes provisioned Batches Blast radius if image is bad Use for
1 (default) 1 +1 30 Tiny — one node Tiny pools; ultra-cautious
10% 3 +3 10 Small Cautious prod, large pools
33% 10 +10 3 Moderate Latency-sensitive default
50% 15 +15 2 Half the pool per batch Stateless / batch pools
100% 30 +30 (doubles pool) 1 Whole pool at once Avoid in prod; no blast control

PodDisruptionBudgets are what make drains respect your SLOs. During drain, the eviction API honors PDBs; if evicting a pod would violate minAvailable, the drain blocks until the replacement pod is Ready elsewhere.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout-pdb
  namespace: payments
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: checkout

There is a sharp edge here: a PDB that can never be satisfied (e.g. minAvailable: 100% on a single-replica Deployment) will stall the drain indefinitely, turning a clean rolling upgrade into a hung operation. The PDB field matrix — what each setting does and where it bites:

PDB field What it means Safe value Dangerous value Failure mode of the dangerous value
minAvailable (integer) At least N pods must stay up 2 on a 4-replica app 1 on a 1-replica app Eviction blocked → drain hangs
minAvailable (percent) At least X% must stay up 80% 100% No pod may ever be evicted → hang
maxUnavailable (integer) At most N may be down 1 on a 4-replica app 0 Equivalent to 100% available → hang
maxUnavailable (percent) At most X% may be down 25% 0% Same trap as above
selector Which pods the PDB covers Matches the Deployment Matches nothing / too broad Silently protects the wrong pods
(replica count) Pods behind the budget 2+ with topology spread 1 Single replica + any PDB = hang risk

Rules I enforce via policy:

az aks nodepool update \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --nodepool-name workloads \
  --max-surge 33% \
  --drain-timeout 30 \
  --node-soak-duration 5

The node-pool upgrade knobs in one place — defaults, ranges, and when to change each:

Knob What it does Default Range When to change
--max-surge Extra nodes per batch (abs or %) 1 1…pool size, or 1–100% Speed up large pools; cap blast radius
--drain-timeout Minutes a node waits to evict before failing 30 (platform) minutes Lower so a stuck PDB fails loudly, not hangs
--node-soak-duration Minutes after a node is Ready before next batch 0 0–30 min Raise so a bad image surfaces between batches
--max-unavailable Nodes that may be unavailable per batch unset nodes or % Pair with surge for tighter control
PDB minAvailable Workload floor honored during drain per workload int or % Set on every prod workload; never = replicas
WEBSITE/probe readiness Gates when a replacement counts as “Ready” your config Honest readiness so drains don’t proceed early
resource pool 'Microsoft.ContainerService/managedClusters/agentPools@2024-09-01' = {
  name: '${aks.name}/workloads'
  properties: {
    upgradeSettings: {
      maxSurge: '33%'
      drainTimeoutInMinutes: 30
      nodeSoakDurationInMinutes: 5
    }
  }
}

Maintenance windows and planned maintenance

Auto-upgrade without a maintenance window means Azure can reimage your nodes whenever it likes. Planned Maintenance binds all upgrade activity to schedules you control, which is how you keep upgrades inside a change-freeze policy.

There are three configurable schedules:

The three maintenance configs — what each governs and a sane starting schedule:

Config name Governs Recommended schedule Min window Notes
aksManagedAutoUpgradeSchedule Kubernetes auto-upgrades (cluster channel) Weekly, Sun 02:00, 4h 4 hours Pair with the cluster channel
aksManagedNodeOSUpgradeSchedule Node-image / OS upgrades (node-OS channel) Daily/nightly 03:00, 4h 4 hours Images ship weekly; daily catches them
default Legacy weekly AKS-initiated maintenance Weekly off-hours 4 hours Superseded by the two named configs
# Kubernetes auto-upgrades: Sundays 02:00, 4-hour window, US Eastern
az aks maintenanceconfiguration add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name aksManagedAutoUpgradeSchedule \
  --schedule-type Weekly \
  --day-of-week Sunday \
  --start-time 02:00 \
  --duration 4 \
  --utc-offset -05:00

# Node OS/image upgrades: nightly 03:00, 4-hour window
az aks maintenanceconfiguration add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name aksManagedNodeOSUpgradeSchedule \
  --schedule-type Daily \
  --interval-days 1 \
  --start-time 03:00 \
  --duration 4 \
  --utc-offset -05:00

The schedule fields you set, and their constraints:

Field What it controls Values Constraint
--schedule-type Recurrence kind Weekly, AbsoluteMonthly, RelativeMonthly, Daily Daily only for node-OS schedule
--day-of-week Day for weekly schedules SundaySaturday Weekly types only
--start-time Window open time HH:MM (24h) Local to --utc-offset
--duration Window length in hours integer ≥ 4 Minimum 4 hours
--utc-offset Timezone offset ±HH:MM Make it match your change window
--interval-weeks / --interval-days Recurrence interval integer Spread to every 2nd/4th week if needed
notAllowedDates (config-file) Freeze ranges array of start/end dates Blocks even in-window starts

For change freezes (quarter-end, peak shopping season), use the --config-file form, which supports notAllowedDates — date ranges where no maintenance may start even if it falls inside the recurring window.

{
  "maintenanceWindow": {
    "schedule": { "weekly": { "intervalWeeks": 1, "dayOfWeek": "Sunday" } },
    "durationHours": 4,
    "utcOffset": "-05:00",
    "startTime": "02:00",
    "notAllowedDates": [
      { "start": "2026-11-20", "end": "2026-12-02" }
    ]
  }
}
az aks maintenanceconfiguration add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name aksManagedAutoUpgradeSchedule \
  --config-file ./freeze-window.json

Blue-green at the node-pool level

For high-risk upgrades — a major OS family change, a kernel-sensitive workload, or a node SKU swap — in-place surge is not enough control. Stand up a parallel node pool on the new version, shift workloads, and keep the old pool as a rollback.

# 1. New pool on the target version (note the new name)
az aks nodepool add \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name workloads2 \
  --kubernetes-version 1.32.0 \
  --node-count 5 \
  --mode User \
  --labels pool=workloads2

# 2. Cordon every node in the OLD pool so nothing new schedules there
kubectl cordon -l agentpool=workloads

# 3. Drain the old pool; PDBs gate the pace, pods reschedule onto workloads2
kubectl drain -l agentpool=workloads \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=600s

# 4. Validate. If healthy, delete the old pool. If not, uncordon and roll back.
az aks nodepool delete \
  --resource-group rg-platform \
  --cluster-name aks-prod-eastus \
  --name workloads

The agentpool label is applied automatically to every node by AKS, so -l agentpool=<name> reliably targets exactly one pool. Keep the old pool until smoke tests pass — deleting it is the point of no return.

This costs double capacity for the migration window, but converts a multi-hour, irreversible reimage into a controlled cutover with a one-command rollback (kubectl uncordon -l agentpool=workloads).

The two strategies head to head — pick by how much rollback control the change demands:

Dimension In-place surge upgrade Blue-green node pool
Extra capacity Surge % only (e.g. +33%) Full second pool (≈ +100%)
Cost during migration Modest, transient Double, for the window
Rollback None mid-upgrade (roll forward) kubectl uncordon the old pool
Blast radius control Surge % + soak Total — validate before cutover
Operational complexity One command Add pool, cordon, drain, validate, delete
Best for Routine K8s/image upgrades OS-family swap, SKU change, kernel-sensitive
Point of no return When all batches reimaged When you delete the old pool

The blue-green runbook as a checklist table:

Step Command What it does Rollback at this point
1 az aks nodepool add --name workloads2 ... New pool on target version Delete workloads2
2 kubectl cordon -l agentpool=workloads Stop scheduling on old pool kubectl uncordon -l agentpool=workloads
3 kubectl drain -l agentpool=workloads ... Move pods to new pool, PDB-paced Uncordon; pods stay where rescheduled
4a (healthy) az aks nodepool delete --name workloads Remove old pool None — point of no return
4b (unhealthy) kubectl uncordon -l agentpool=workloads Restore old pool to rotation This is the rollback

Fleet-scale upgrades with Azure Kubernetes Fleet Manager

One cluster is a runbook; fifty clusters is a coordination problem. Azure Kubernetes Fleet Manager orchestrates upgrades across many AKS clusters with update runs that march through ordered stages and groups — dev before staging before prod, with bake time between each.

az extension add --name fleet

# Create a fleet (hub-less is fine for update orchestration only)
az fleet create \
  --resource-group rg-fleet \
  --name fleet-platform \
  --location eastus

# Join member clusters and assign each to an update group
az fleet member create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name aks-dev-eastus \
  --member-cluster-id "$DEV_CLUSTER_ID" \
  --update-group dev

az fleet member create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name aks-prod-eastus \
  --member-cluster-id "$PROD_CLUSTER_ID" \
  --update-group prod

The Fleet Manager object model — the nouns you compose into a rollout:

Object What it is Created with Holds
Fleet The top-level container az fleet create Members, strategies, runs
Member A joined AKS cluster az fleet member create Cluster id + update-group label
Update group A label grouping members --update-group on the member Clusters that upgrade together
Stage An ordered ring of groups + bake In the strategy definition Groups + afterStageWaitInSeconds
Update strategy The reusable stage order az fleet updatestrategy create Ordered stages
Update run One execution of a strategy az fleet updaterun create Strategy ref + upgrade type

An update strategy defines the stage order and the wait between stages; an update run executes it. Define the strategy once and reuse it.

# A strategy: dev first, soak 1 hour, then prod
az fleet updatestrategy create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name ring-rollout \
  --stages '[
    { "name": "dev",  "groups": [{ "name": "dev"  }], "afterStageWaitInSeconds": 3600 },
    { "name": "prod", "groups": [{ "name": "prod" }] }
  ]'

# An update run that targets the latest patch within each cluster's minor
az fleet updaterun create \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name run-2026-05 \
  --update-strategy-name ring-rollout \
  --upgrade-type NodeImageOnly

az fleet updaterun start \
  --resource-group rg-fleet \
  --fleet-name fleet-platform \
  --name run-2026-05

--upgrade-type accepts the same conceptual split as a single cluster, applied fleet-wide:

--upgrade-type What it upgrades Risk Use for Maps to single-cluster
Full Kubernetes version + node image Highest Coordinated minor rollout az aks upgrade (no flag)
ControlPlaneOnly Control plane only Low Take the cheap bump fleet-wide --control-plane-only
NodeImageOnly Node image only Lowest Weekly CVE refresh across clusters --node-image-only

The afterStageWaitInSeconds between stages is your fleet-wide soak: dev takes the new image, you watch dashboards for an hour, and only then does prod proceed. A failed stage halts the run, so a regression caught in dev never reaches prod.

A sane multi-ring strategy and what each ring buys you:

Stage (ring) Groups Bake (afterStageWaitInSeconds) Purpose Halt-on-failure effect
dev dev clusters 3600 (1h) Catch obvious breakage cheaply Stops before staging
staging staging clusters 14400 (4h) Soak under prod-like load Stops before prod
prod-canary one prod cluster 7200 (2h) Real traffic, limited blast Stops before full prod
prod remaining prod — (last stage) Full rollout Run completes

Validating an upgrade

Upgrades fail in two ways: removed APIs and behavioral regressions. Check both — before and after.

Deprecated/removed API detection. Each Kubernetes minor removes APIs. AKS will warn (and can block) an upgrade if in-cluster objects or recent API traffic use APIs slated for removal in the target version. Surface these ahead of time:

# AKS-side: deprecation warnings reported by the control plane,
# including API usage seen in the last ~12h of audit logs
az aks get-upgrades \
  --resource-group rg-platform \
  --name aks-prod-eastus \
  --output table

# Cluster-side: look for deprecation warnings the API server is already emitting
kubectl get events -A --field-selector reason=Deprecated 2>/dev/null

Removed-API breakage is the most common cause of a “successful upgrade, broken app.” Run a static check (e.g. pluto or kubent) against your manifests and Helm releases in CI, and gate the upgrade PR on it.

A reference of notable API removals to scan for before a minor bump — confirm against the release notes for your exact target, but these are the ones that bite teams most often:

Removed API (old) Replacement (current) Removed around What uses it How to find it
policy/v1beta1 PodDisruptionBudget policy/v1 1.25 Old PDB manifests kubent; kubectl get pdb -o yaml
batch/v1beta1 CronJob batch/v1 1.25 Legacy CronJobs pluto detect-files
networking.k8s.io/v1beta1 Ingress networking.k8s.io/v1 1.22 Old Ingress objects kubent; controller logs
policy/v1beta1 PodSecurityPolicy Pod Security Admission 1.25 PSP (fully removed) Replace with PSA labels
autoscaling/v2beta2 HPA autoscaling/v2 1.26 Old HPA manifests pluto; kubectl get hpa -o yaml
flowcontrol.apiserver.k8s.io/v1beta2 …/v1 1.29 APF config Cluster-internal; rare in user manifests
*.k8s.io/v1beta1 CSR / certificates certificates.k8s.io/v1 1.22 Old cert workflows kubent

Where each detection signal comes from, and what it catches:

Signal Source Catches Run it
az aks get-upgrades warnings AKS control plane (audit ~12h) APIs recently called in-cluster Before the upgrade
pluto detect-files Static scan of manifests/Helm APIs declared in YAML/charts In CI, on the PR
kubent (kube-no-trouble) Live cluster + manifests Both live objects and files Pre-upgrade + CI
kubectl get events reason=Deprecated API server warnings Deprecation warnings already emitted Spot check
Microsoft Defender for Cloud Defender recommendations blade Clusters on deprecated API versions Continuous posture

Smoke tests. After the control plane and at least one node pool are on the new version, run synthetic checks against real user paths, not just kubectl get nodes.

# Nodes Ready and on the expected version
kubectl get nodes -o wide

# No pods stuck after the reimage
kubectl get pods -A --field-selector=status.phase!=Running \
  | grep -v Completed || echo "all pods healthy"

# Hit a real ingress path end to end
curl -fsS https://api.kloudvin.example/healthz && echo OK

The post-upgrade validation matrix — what to check, the command, and the pass criterion:

Check Command Pass criterion Fails when
Control-plane version az aks show --query currentKubernetesVersion Equals target CP upgrade didn’t complete
Pool K8s version az aks nodepool list --query "[].currentOrchestratorVersion" Equals target Pool not yet upgraded
Pool node-image version az aks nodepool list --query "[].nodeImageVersion" Latest weekly Stale image, CVEs open
Nodes Ready kubectl get nodes All Ready, right version Node stuck NotReady post-reimage
Pods healthy kubectl get pods -A (non-Running) None stuck Pod won’t reschedule (PDB/affinity)
Ingress path curl -fsS https://.../healthz 200 OK App regressed on the new version
Fleet run state az fleet updaterun show --query status.state Completed A stage halted on failure

Architecture at a glance

The diagram traces an upgrade as it actually flows, left to right, and pins each failure class onto the exact stage where it bites. Read it as a pipeline. On the left, the trigger and gates: an operator or CI job issues az aks upgrade (with a removed-API gate already green from pluto/kubent), and a maintenance window decides whether the work may even start now. The request moves to the control plane, where the managed API server bumps one minor (1.31 → 1.32) and runs its own API-removal checks against ~12 hours of audit logs. From there it fans into the node pools: a surge batch (max-surge 33%) adds capacity, a PDB/drain gate (minAvailable 80%, drain-timeout 30m) paces the eviction, and the node image carries the weekly OS CVE fixes. The fleet orchestration zone wraps many clusters: Fleet Manager runs an update run through ordered stages, and a stage gate holds dev for an hour of bake time before prod proceeds. On the right, verify — smoke tests and observability confirm nodes are Ready on the expected image — with a blue-green pool offering the kubectl uncordon rollback that arcs back to the node pools.

Notice the five numbered badges, each on the stage where a day-2 upgrade most often stalls or breaks: (1) removed-API breakage at the control-plane checks (a “successful” upgrade with a broken app); (2) a PDB stalling the drain when minAvailable equals the replica count and there is no headroom; (3) surge blast radius when 100% reimages the whole pool before a bad image shows in monitoring; (4) a stale node image left behind when the OS channel is None; and (5) a fleet stage that never bakes because afterStageWaitInSeconds is zero, so dev and prod break together. The whole method is in the legend: localise the symptom to a stage, read the cause, run the named confirm command, apply the fix. The first question on every stalled upgrade is “which stage is it stuck in — and is it hung or failing?” The badge you land on tells you which knob to reach for.

AKS upgrade control path drawn left to right as a pipeline: an operator or CI trigger (gated by a removed-API check) and a maintenance window start the upgrade; the managed control plane bumps one Kubernetes minor (1.31 to 1.32) and runs removed-API checks against recent audit logs; node pools then surge a batch at 33 percent, drain under a PodDisruptionBudget with a 30-minute drain timeout, and pick up the weekly node image with OS CVE fixes; Azure Kubernetes Fleet Manager runs an update run through ordered stages with a stage gate that bakes dev for an hour before prod; verification runs smoke tests and observability, with a blue-green parallel pool offering a kubectl uncordon rollback — five numbered badges mark removed-API breakage at the control-plane checks, a PodDisruptionBudget stalling the drain, surge blast radius at 100 percent, a stale node image left behind, and a fleet stage that never bakes

Real-world scenario

Meridian Pay, a fictional but representative payments platform, ran 30+ AKS clusters across three regions under a single Fleet Manager. The platform team was six engineers; the workloads were latency-sensitive payment APIs with hard SLOs and a quarter-end change freeze. Their monthly AKS spend across the fleet was about ₹46 lakh, and their upgrade posture was, on paper, mature: cluster channel patch, node-OS channel NodeImage, a Fleet update strategy with dev-then-prod and an hour of bake time.

The incident began on a routine Tuesday. The team kicked off a Fleet update run with --upgrade-type Full to move the fleet from 1.31 to 1.32, dev first. The dev stage went green in forty minutes — nodes reimaged, pods rescheduled, smoke tests passed. After the hour of bake, prod began and stalled: every prod cluster’s node-pool upgrade hung in Upgrading, never finishing, never failing. The Fleet run sat blocked for hours with no error, just a status that would not advance. The on-call engineer’s first instinct — re-run the stage — did nothing, because the operation was not failed, it was stuck.

The cause was a PDB nobody connected to upgrades. A platform service — a Deployment of a regional rate-limiter — ran exactly 3 replicas behind an anti-affinity rule (one per zone) with minAvailable: 100%. In dev the pools had spare zones, so a replacement scheduled and the drain proceeded. Prod pools were packed to capacity in all three zones, so when AKS cordoned a node, the evicted rate-limiter pod had nowhere to land that satisfied anti-affinity, and minAvailable: 100% refused to drop below 3. The eviction API blocked indefinitely, and with it the whole batch. The diagnosis was one command: kubectl get pdb -n edge showed ALLOWED DISRUPTIONS: 0, and kubectl get nodes showed a node stuck SchedulingDisabled for over an hour.

The fix was twofold. Immediately, relax the budget so the drain could breathe:

kubectl patch pdb ratelimiter-pdb -n edge \
  --type merge -p '{"spec":{"minAvailable":"67%"}}'

That unblocked the in-flight batch within minutes. The durable fix was a node-pool --drain-timeout 30 so a stuck eviction surfaces as a failed batch — which halts the Fleet stage before prod — instead of an invisible hang, plus an OPA Gatekeeper policy rejecting any PDB whose minAvailable equals the workload’s replica count, applied fleet-wide via the GitOps pipeline. They also added a --node-soak-duration 5 so a bad image would surface in Grafana between batches, and a synthetic smoke test on a real payment path into the post-upgrade validation job. The next quarter’s 1.32 → 1.33 run completed across all 30 clusters with zero stalls. The lesson on the wall: “Validate PDBs against real prod headroom, not dev’s spare capacity — and make a stuck drain fail loudly, not silently.”

The incident as a timeline, because the order of moves is the lesson:

Time Symptom Action taken Effect What it should have been
T+0 Dev stage green Bake 1h, then prod begins (correct so far)
T+1h05 Prod pools hang Upgrading Wait — maybe it’s slow No progress Ask: hung or failing?
T+1h40 Still stuck, no error Re-run the stage Nothing (not failed) Don’t re-run a stuck op
T+2h Root cause hunt kubectl get pdb -n edge → ALLOWED DISRUPTIONS 0 Cause found This was the breakthrough
T+2h10 Mitigated kubectl patch pdb … minAvailable 67% Batch unblocks in minutes Correct night-of fix
+1 day Durable fix --drain-timeout 30 + --node-soak 5 + Gatekeeper PDB policy Stuck drains now fail loudly The actual fix is procedural
+1 quarter Validated 1.32 → 1.33 across 30 clusters Zero stalls Boring upgrade achieved

Advantages and disadvantages

The managed-control-plane, surge-and-drain, fleet-orchestrated model both enables safe upgrades and hides sharp edges. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Microsoft runs the control-plane upgrade — fast, HA, no workload disruption You can’t roll the control plane back; it’s forward-only
Decoupling CP from node pools lets you take the cheap bump now, schedule the costly one Easy to forget the node pools entirely and run a stale image
Surge + PDB make a node-pool upgrade invisible when tuned right An unsatisfiable PDB stalls the drain forever with no error
Auto-upgrade channels keep you patched without manual toil Two independent channels are constantly confused; one left None = open CVEs
Maintenance windows + notAllowedDates enforce change freezes Unset windows cede the timing decision to the platform
Fleet Manager bakes dev before prod; a failed stage halts the run Zero bake time breaks every cluster at once — slower, not safer
Blue-green pools give a one-command rollback for high-risk changes Double capacity cost for the migration window
N-2 support is a clear, predictable contract Lapse it and you get a force-upgrade on Microsoft’s schedule, not yours

The model is right for any team running AKS at scale that wants patched, supported clusters without hand-rolling upgrade tooling — and the built-in surge, channel, window, and fleet controls cover the vast majority of cases. It bites hardest on teams that validate PDBs against dev headroom, leave the node-OS channel None, or run a fleet with zero bake time. Every disadvantage is manageable — but only if you know it exists, which is the point of this runbook.

Hands-on lab

Reproduce a stalled node-pool upgrade caused by an unsatisfiable PDB, watch it hang, then fix it — all on a small, cheap cluster you delete at the end. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-aks-day2-lab
LOC=eastus
AKS=aks-day2-$RANDOM
az group create -n $RG -l $LOC -o table

Step 2 — Create a small cluster one minor behind the latest (so you have something to upgrade to).

# Pick the second-newest GA version so an upgrade target exists
PREV=$(az aks get-versions -l $LOC --query "values[?isPreview==null].patchVersions | [1].keys(@) | [0]" -o tsv 2>/dev/null || echo 1.31.0)
az aks create -g $RG -n $AKS --node-count 2 --kubernetes-version "$PREV" \
  --node-vm-size Standard_B2s --generate-ssh-keys -o table
az aks get-credentials -g $RG -n $AKS --overwrite-existing

Expected: a cluster on $PREV, two nodes Ready.

Step 3 — Deploy a single-replica app with an unsatisfiable PDB (reproduce the trap).

kubectl create deployment doomed --image=nginx:1.27 --replicas=1
kubectl create poddisruptionbudget doomed-pdb --selector=app=doomed --min-available=1
kubectl get pdb doomed-pdb   # ALLOWED DISRUPTIONS should be 0 — the smoking gun

A 1-replica Deployment with minAvailable: 1 can never tolerate an eviction.

Step 4 — See what you can upgrade to, then start a node-pool upgrade.

az aks get-upgrades -g $RG -n $AKS -o table
TARGET=$(az aks get-upgrades -g $RG -n $AKS \
  --query "controlPlaneProfile.upgrades[-1].kubernetesVersion" -o tsv)

# Set a short drain timeout so the stuck drain FAILS instead of hanging forever
az aks nodepool update -g $RG --cluster-name $AKS --nodepool-name nodepool1 \
  --drain-timeout 5 2>/dev/null || true

# Kick the upgrade (run it; it will struggle to drain the doomed pod)
az aks upgrade -g $RG -n $AKS --kubernetes-version "$TARGET" --yes

Step 5 — Watch it stall on the eviction. In a second Cloud Shell tab:

watch -n 5 'kubectl get nodes; echo; kubectl get pdb doomed-pdb; echo; \
  kubectl get events --field-selector reason=EvictionBlocked -A 2>/dev/null | tail -5'
# A node goes SchedulingDisabled; the doomed pod won't evict (ALLOWED DISRUPTIONS 0)

Without the --drain-timeout, this hangs indefinitely; with it, the batch eventually fails — which is the point: a stuck drain should fail loudly.

Step 6 — Fix the PDB so the drain can proceed.

# Either scale the app to 2+ replicas, or relax the budget
kubectl scale deployment doomed --replicas=2
kubectl patch pdb doomed-pdb --type merge -p '{"spec":{"minAvailable":"50%"}}'
kubectl get pdb doomed-pdb   # ALLOWED DISRUPTIONS now > 0

# Re-run the upgrade; the drain now proceeds
az aks upgrade -g $RG -n $AKS --kubernetes-version "$TARGET" --yes

Step 7 — Verify the upgrade landed, including the node image.

az aks show -g $RG -n $AKS --query currentKubernetesVersion -o tsv
az aks nodepool list -g $RG --cluster-name $AKS \
  --query "[].{name:name, k8s:currentOrchestratorVersion, image:nodeImageVersion}" -o table
kubectl get nodes -o wide

Expected: control plane and pool on $TARGET, nodes Ready, a recent nodeImageVersion.

Validation checklist. You reproduced an upgrade stall purely from an unsatisfiable PDB, confirmed it with ALLOWED DISRUPTIONS: 0 and a SchedulingDisabled node, made it fail instead of hang with --drain-timeout, and fixed it by giving the workload headroom. No control-plane magic — exactly the point. The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
3 1-replica app + minAvailable: 1 PDB The unsatisfiable-PDB trap is real Single-replica services in prod
4 --drain-timeout 5 then upgrade A stuck drain can be made to fail, not hang Meridian Pay’s durable fix
5 Watch ALLOWED DISRUPTIONS 0 The exact confirming signal exists The 2-minute diagnosis
6 Scale to 2 + relax PDB The fix is workload headroom, not platform The actual production fix
7 Check nodeImageVersion The field teams forget Closing OS CVEs

Cleanup (avoid lingering charges).

az group delete -n $RG --yes --no-wait

Cost note. Two Standard_B2s nodes plus a Free-tier control plane is a few rupees per hour; an hour of this lab is well under ₹100, and deleting the resource group stops everything.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read mid-change, then the same entries with the full confirm-command detail underneath.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Node-pool upgrade hangs in Upgrading, never finishes or fails Unsatisfiable PDB (minAvailable = replicas, or 100% with no headroom) kubectl get pdb -A shows ALLOWED DISRUPTIONS: 0; node SchedulingDisabled 2+ replicas; minAvailable as %; --drain-timeout 30 so it fails loudly
2 “Successful upgrade,” app now broken (404s/CrashLoop) Removed API the manifests still use az aks get-upgrades warnings; pluto detect-files; kubent Migrate manifests/Helm to the current API; gate the PR
3 Whole pool NotReady right after one batch --max-surge 100% reimaged everything before a bad image showed kubectl get nodes all new + NotReady; no soak between batches 33–50% surge + --node-soak-duration so a bad image surfaces
4 Cluster on the right K8s version but OS CVEs flagged Node image stale; OS channel left None az aks nodepool list --query "[].nodeImageVersion" lags latest Set --node-os-upgrade-channel NodeImage; run --node-image-only
5 Fleet run stuck; one stage never advances A member cluster’s drain hung → stage can’t complete az fleet updaterun show --query status.state; inspect the member cluster Fix the member’s PDB/drain; --drain-timeout so the stage fails not hangs
6 Upgrade refused: “node pool version too far behind” Node pool > 1 minor behind the control plane az aks nodepool list --query "[].currentOrchestratorVersion" vs CP Upgrade the pool one minor at a time to catch up
7 get-upgrades offers no newer version Already on latest GA, or on a preview/unsupported minor az aks show --query "{v:currentKubernetesVersion,sku:sku.tier}" Nothing to do, or move off preview to a GA minor
8 Auto-upgrade fired mid-business-day No maintenance window bound to the channel az aks maintenanceconfiguration list returns empty Add aksManagedAutoUpgradeSchedule + node-OS schedule
9 Upgrade ran during a change freeze notAllowedDates not configured Activity log shows an upgrade in the freeze window Use --config-file with notAllowedDates ranges
10 Pods stuck Pending during upgrade Surge nodes not provisioning (quota / SKU unavailable) kubectl get events; az vm list-usage for the region quota Raise vCPU quota; smaller surge; alternate SKU
11 Drain blocked on a DaemonSet Forgot --ignore-daemonsets (manual blue-green only) kubectl drain error names the DaemonSet pod Add --ignore-daemonsets (and --delete-emptydir-data)
12 Stateful pods lose data on reimage emptyDir / local disk drained without persistence Data gone after node replaced; --delete-emptydir-data used Use PVCs / StatefulSets; never store state on the node
13 Upgrade slow to a crawl on a big pool --max-surge left at default (1 node) One batch at a time on a 100-node pool Raise to 33–50% (within capacity/quota)
14 Control plane upgraded but cluster “version” unchanged in a tool Tool reads pool version, not CP version az aks show --query currentKubernetesVersion (CP) Pools trail by 1 minor legitimately; upgrade pools in the window

The expanded form, with the full reasoning for the entries that bite hardest:

1. Node-pool upgrade hangs in Upgrading, never finishing and never failing. Root cause: An unsatisfiable PodDisruptionBudgetminAvailable equal to the replica count, a 100% budget, or an anti-affinity rule with no spare zone in a packed pool — so the eviction API refuses to drain and the batch blocks indefinitely. Confirm: kubectl get pdb -A shows ALLOWED DISRUPTIONS: 0 for the offending PDB; kubectl get nodes shows a node stuck SchedulingDisabled for far longer than a batch should take. Fix: Give the workload headroom — 2+ replicas, minAvailable as a percentage never equal to the replica count — and set --drain-timeout 30 on the pool so a genuinely stuck drain surfaces as a failed batch (which halts a Fleet stage) instead of an invisible hang.

2. The upgrade reports success but the application is now broken. Root cause: The target minor removed an API your manifests or Helm charts still declare (e.g. policy/v1beta1 PDB, batch/v1beta1 CronJob, an old Ingress version), so those objects silently stop being served. Confirm: az aks get-upgrades surfaces deprecation warnings for APIs called in the last ~12h; pluto detect-files and kubent scan your YAML and live objects for removed versions. Fix: Migrate every object to the current API version before the upgrade and gate the upgrade PR on a green pluto/kubent run in CI.

3. A whole pool goes NotReady immediately after the first batch. Root cause: --max-surge 100% reimaged the entire pool in one batch, so a bad node image (or an incompatible kernel module) took down every node before monitoring could catch it. Confirm: kubectl get nodes shows all-new nodes, all NotReady, with no healthy old nodes left; the pool had no soak between batches. Fix: Drop surge to 33–50% and set --node-soak-duration so a bad image surfaces in dashboards between batches; for truly risky images, use a blue-green pool you can abandon.

4. The cluster is on the right Kubernetes version but security flags open OS CVEs. Root cause: The node image is stale — the Kubernetes version was upgraded (or auto-upgraded) but the node-OS channel was left None, so nodes never picked up the weekly image with OS fixes. Confirm: az aks nodepool list --query "[].{name:name,image:nodeImageVersion}" shows an image version well behind the latest weekly. Fix: Set --node-os-upgrade-channel NodeImage (bound to a window) and run a --node-image-only upgrade now to close the gap.

5. A Fleet update run is stuck and one stage never advances. Root cause: A member cluster’s node-pool drain hung (usually a PDB, per #1), and because a stage only completes when all its members do, the whole stage — and the run — stalls. Confirm: az fleet updaterun show --query status.state shows the run in progress on a stage that never completes; drilling into the member cluster reveals the hung pool. Fix: Fix the member’s PDB/drain; set --drain-timeout on member pools so a stuck drain fails the stage (halting the run before prod) instead of hanging it forever.

6. The upgrade is refused with “node pool version too far behind.” Root cause: A node pool is more than one minor behind the control plane (the AKS skew rule allows at most one), often because the CP was upgraded twice while the pool was left alone. Confirm: Compare az aks nodepool list --query "[].currentOrchestratorVersion" against az aks show --query currentKubernetesVersion. Fix: Upgrade the lagging pool one minor at a time until it is within one minor of the control plane.

10. Pods stuck Pending during the upgrade because surge nodes won’t provision. Root cause: AKS tried to add surge nodes but hit a regional vCPU quota or SKU-unavailable condition, so the new capacity never appeared and evicted pods have nowhere to go. Confirm: kubectl get events shows FailedScheduling; az vm list-usage -l <region> shows the vCPU family at its limit. Fix: Request a quota increase for the node SKU’s vCPU family, lower --max-surge so fewer surge nodes are needed at once, or temporarily use an alternate available SKU for the pool.

Best practices

A quick decision table — match the situation to the move:

If you need to… Do this Not this
Stay current with minimal risk now --control-plane-only bump, schedule pools Full upgrade mid-day
Close OS CVEs without API risk --node-image-only / NodeImage channel A full K8s minor bump
Upgrade a kernel-sensitive workload Blue-green pool with rollback In-place surge
Roll a fleet safely Update strategy with bake time A single run with zero wait
Catch removed APIs pluto/kubent gate in CI Trusting “upgrade succeeded”
Prevent a hung drain 2+ replicas, % PDB, --drain-timeout minAvailable: 1 on 1 replica

Security notes

The security controls that also make upgrades safer — they pull in the same direction:

Control Mechanism Secures against Also prevents
Node-OS channel NodeImage/SecurityPatch Auto OS/CVE patching Unpatched node CVEs “Forgot the image” CVE gap
Stay within N-2 Support-window discipline Unpatched API-server CVEs Force-upgrade off your schedule
Scoped upgrade identity AKS Contributor / custom role Over-broad upgrade rights Accidental destructive ops
Gatekeeper / Kyverno PDB+API policy Admission control Bad PDBs, removed APIs Hung drains, broken upgrades
Defender for Containers Posture recommendations Deprecated APIs, stale images Upgrading into known breakage
Activity-log / SIEM audit Control-plane logging Unauthorised upgrades Untracked change

Cost & sizing

The bill drivers for upgrades and how they interact with the fixes:

A rough monthly picture for a mid-size fleet. The cost drivers and what each one buys you:

Cost driver What you pay for Rough INR / month What it fixes / enables Watch-out
Control-plane SLA (Standard tier) Uptime SLA per cluster ~₹7,000–8,000 / cluster Control-plane uptime guarantee Free tier has no SLA
Surge nodes (transient) Extra node-hours during upgrade A few hundred per upgrade Faster, batched node reimaging Peaks against vCPU quota
Blue-green second pool Double pool for the window 1× pool cost, hours–days Reversible high-risk upgrade Delete old pool after validation
LTS (extended support) Premium add-on Premium over Standard Stay on a minor ~2 years Specific minors only; costlier
Fleet Manager (hub-less) Orchestration Negligible Staged, baked fleet rollouts You pay for members anyway
Defender for Containers Per-vCPU posture/runtime ~₹1,000–2,000 / node-ish Deprecated-API + image flags Scales with node count

Interview & exam questions

1. Why decouple the control-plane upgrade from the node-pool upgrade in production? The control-plane upgrade is fast, Microsoft-managed, and causes no workload disruption, while the node-pool upgrade reimages every VM with cordon-and-drain and can take hours. Decoupling (--control-plane-only) lets you take the cheap, low-risk bump immediately to stay current, and schedule the expensive, disruptive node reimaging for a maintenance window. Node pools may trail the control plane by at most one minor version.

2. What is the AKS N-2 support window, and what happens if you miss it? AKS supports the latest Kubernetes minor and the two behind it (N, N-1, N-2) for roughly twelve months from GA. Past N-2 a version drops to platform support (best-effort, no control-plane SLA, no K8s patches), and eventually the cluster is force-upgraded by Microsoft on its schedule. Keep a standing change ticket whenever the control plane falls more than one minor behind GA.

3. A node-pool upgrade hangs in Upgrading and never finishes or fails. What’s the most likely cause and how do you confirm it? An unsatisfiable PodDisruptionBudgetminAvailable equal to the replica count, a 100% budget, or anti-affinity with no spare capacity — so the eviction API refuses to drain a node. Confirm with kubectl get pdb -A showing ALLOWED DISRUPTIONS: 0 and a node stuck SchedulingDisabled. Fix with 2+ replicas, a percentage minAvailable, and --drain-timeout so a stuck drain fails loudly instead of hanging.

4. Difference between a node-image upgrade and a Kubernetes upgrade? A node-image upgrade reimages nodes onto the latest weekly image (OS, containerd, kubelet patch) at the same Kubernetes version — low risk, reimage only, for closing OS CVEs. A Kubernetes upgrade changes the minor/patch version and the API surface (deprecations, behavior changes) — higher risk. They have different cadences (weekly vs ~quarterly) and different auto-upgrade channels (NodeImage vs patch/stable/rapid).

5. What do the two auto-upgrade channels control, and what’s a safe production pair? --auto-upgrade-channel governs the Kubernetes version (none/patch/stable/rapid); --node-os-upgrade-channel governs the node image (None/Unmanaged/SecurityPatch/NodeImage). A safe production default is cluster channel patch and node-OS channel NodeImage, both bound to a maintenance window, with a human owning minor-version bumps. Leaving either at none/None patches one axis and exposes the other.

6. How does max surge affect an upgrade, and why avoid 100% in production? Max surge sets the batch size — how many nodes are added and reimaged per batch. Higher surge is faster but provisions more transient capacity and evicts more pods at once. 100% reimages the whole pool in a single batch, removing your ability to catch a bad node image before it has rolled every node. Use 33–50% in prod with --node-soak-duration so a bad image surfaces between batches.

7. What is a maintenance window for, and how do you enforce a change freeze? Planned Maintenance binds auto-upgrade activity to schedules you control (aksManagedAutoUpgradeSchedule for Kubernetes, aksManagedNodeOSUpgradeSchedule for the node image), each at least four hours. For a change freeze, use the --config-file form with notAllowedDates — date ranges where no maintenance starts even if it falls inside the recurring window.

8. When would you use a blue-green node pool instead of an in-place surge upgrade? For high-risk changes — a major OS-family change, a kernel-sensitive workload, or a node SKU swap — where you want a real rollback. You stand up a parallel pool on the new version, cordon and drain the old one so pods reschedule, validate, and either delete the old pool (point of no return) or kubectl uncordon it to roll back. It costs double capacity for the window but converts an irreversible reimage into a controlled cutover.

9. How does Azure Kubernetes Fleet Manager prevent a regression from reaching prod? Fleet Manager runs an update run through ordered stages of groups (dev → staging → prod) with bake time (afterStageWaitInSeconds) between each. A failed stage halts the run, so a regression caught in dev never proceeds to prod. Zero bake time defeats the safety — it’s just a slower way to break every cluster at once.

10. After an upgrade reports success the app breaks. What’s the usual cause and how do you prevent it? The target minor removed an API the manifests/Helm charts still use (e.g. policy/v1beta1 PDB, batch/v1beta1 CronJob, an old Ingress version). Detect it ahead of time with az aks get-upgrades warnings (recent API usage), and gate the upgrade PR on pluto/kubent static scans in CI. Migrate every object to the current API version before upgrading.

11. You upgraded a cluster but security still flags open OS CVEs. Why? The node image is stale — the Kubernetes version moved but the node-OS channel was None, so nodes never picked up the weekly image with OS fixes. Confirm with az aks nodepool list --query "[].nodeImageVersion" lagging the latest. Fix by setting the node-OS channel to NodeImage and running a --node-image-only upgrade.

12. What does --drain-timeout buy you on a node pool? It bounds how long a node waits to evict pods before the batch fails. Without it, a stuck eviction (e.g. an unsatisfiable PDB) hangs the upgrade indefinitely with no error — and in a Fleet run, blocks the whole stage. With it, a genuinely stuck drain surfaces as a failed batch, which halts a Fleet stage before prod and gives you a signal to act on.

These map to CKA (cluster upgrades, the kubeadm/managed upgrade flow, PDBs and drains), AZ-104 / AZ-305 (AKS lifecycle and operations on Azure), and the AKS-specialty knowledge in the Azure Kubernetes learning paths. A compact cert-mapping for revision:

Question theme Primary cert Objective area
Cluster upgrade flow, version skew CKA Cluster maintenance & upgrades
PDBs, cordon/drain, disruptions CKA / CKAD Workloads & scheduling; disruptions
AKS channels, windows, Fleet AZ-104 / AZ-305 Manage & operate AKS
Node images, CVEs, posture AZ-500 Secure compute; container security
Removed APIs, deprecations CKA API lifecycle; upgrade readiness

Quick check

  1. You’re on Kubernetes 1.30 (N-2) and want to reach 1.32 (N). How many upgrade hops does AKS require, and why?
  2. A node-pool upgrade has been stuck in Upgrading for an hour with no error. What single kubectl command confirms the most likely cause, and what does a healthy value look like?
  3. True or false: setting --max-surge 100% is the safest way to upgrade a production node pool quickly.
  4. Your cluster reports Kubernetes 1.32 but a CVE scan flags open OS vulnerabilities on the nodes. What setting was almost certainly wrong, and how do you fix it?
  5. A Fleet update run completed dev but then broke prod with the same regression. What field in the update strategy would have caught it, and what does it do?

Answers

  1. Two hops — 1.30 → 1.31 → 1.32. AKS (and Kubernetes) only allow a one-minor jump at a time, so a multi-minor catch-up is sequential; az aks get-upgrades will not offer the skip.
  2. kubectl get pdb -A — look at ALLOWED DISRUPTIONS. A stuck drain almost always shows ALLOWED DISRUPTIONS: 0 on some PDB (an unsatisfiable budget); a healthy value is ≥ 1, meaning a pod can be evicted so the drain can proceed. (kubectl get nodes will also show a node SchedulingDisabled.)
  3. False. 100% reimages the entire pool in one batch, so a bad node image takes down every node before monitoring catches it — you lose all blast-radius control. Use 33–50% with --node-soak-duration so a bad image surfaces between batches.
  4. The node-OS upgrade channel was left None, so the node image went stale while the Kubernetes version advanced. Confirm with az aks nodepool list --query "[].nodeImageVersion" lagging the latest; fix by setting --node-os-upgrade-channel NodeImage and running a --node-image-only upgrade.
  5. afterStageWaitInSeconds (bake time) between the dev and prod stages. It holds the run after dev so you can watch dashboards before prod proceeds; a failed stage halts the run, so a regression caught in dev never reaches prod. Zero bake means dev and prod break together.

Glossary

Next steps

You can now make any AKS upgrade boring — decoupled, surge-tuned, windowed, and reversible across a fleet. Build outward:

AzureAKSFleet ManagerUpgradesNode PoolsMaintenance
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments