Azure Lesson 80 of 137

VM Scale Sets with Flexible Orchestration: Azure Image Builder, Compute Gallery, and Automatic Rolling Upgrades

Most teams that run IaaS at scale on Azure are still operating VM Scale Sets the way they did in 2019: Uniform orchestration, a Marketplace image plus a 400-line cloud-init that re-runs on every boot, and “upgrades” that mean tearing the whole set down on a Friday. That model fights you on three fronts. Boot is slow and non-deterministic because you build the box every time it starts. You have no immutable, versioned artifact to roll back to. And you have no safe, health-gated mechanism to push a new image without a maintenance window.

The modern shape fixes all three. Flexible orchestration gives you the placement control of an availability set with the scale machinery of a VMSS. Azure Image Builder bakes a golden image once, in a pipeline, so boot is fast and identical. Azure Compute Gallery versions and replicates that image like any other artifact. And rolling upgrades, gated by the Application Health extension, replace instances batch by batch and stop the moment health regresses. This is the principal-level walkthrough of wiring those four together correctly — every setting, every limit, every failure mode laid out as a table you keep open while you build.

By the end you will stop hand-waving about “we’ll bake an image and roll it out.” You will know which orchestration mode to pick and why it is irreversible, how to scope the build identity so a compromised pipeline cannot wreck your subscription, how replication geography becomes your canary lever, the exact difference between the upgrade policy mode and automatic OS upgrade that everyone conflates, and how to design a health probe that means “can actually serve” rather than “process is up.” The prose explains the mechanism; the tables enumerate every option end to end so you can scan the right one mid-change.

What problem this solves

Running stateful or stateless fleets on raw VMs at scale produces three chronic pains, and Flexible-orchestration-plus-pipeline kills all three. Slow, non-deterministic boot: a Marketplace image plus boot-time configuration means every instance rebuilds itself on start — apt-get upgrade, package installs, hardening scripts — taking 4–6 minutes and varying with mirror health and load. New instances arriving during an autoscale event are the slowest exactly when you need them fastest. No rollback artifact: if the box is assembled at boot, there is nothing immutable to revert to; “rollback” means reverting a script and praying it re-runs cleanly. No safe rollout: pushing a kernel patch or a new agent by re-running cloud-init across the fleet has no health gate — a bad change rolls to 100% of instances before anyone notices.

What breaks without this: a CVE patch ships by re-running configuration management across the set, a script regression slips through, and a third of the fleet comes back unable to reach a downstream (an HSM, a database, a license server). There is no automatic halt and no automatic rollback, so recovery is a frantic manual reimage during an incident bridge. Meanwhile boot latency makes autoscale lag demand, and the lack of an immutable artifact makes every audit (“what exactly was running last Tuesday?”) an archaeology project.

Who hits this: anyone running IaaS fleets — web/app tiers that outgrew App Service, self-hosted CI runners, packaged ISV software that only ships as a VM, GPU inference nodes, regulated workloads that must run a CIS-hardened, monthly-patched base. The fix is not “a better cloud-init”; it is moving configuration left into a baked, versioned image and replacing reboots with health-gated instance replacement.

To frame the whole field before the deep dive, here is every moving part this article wires together, the pain it removes, and where it sits in the flow:

Building block Azure resource type Pain it removes Where it sits in the pipeline
Flexible orchestration Microsoft.Compute/virtualMachineScaleSets Awkward per-VM ops; opaque instances The runtime fleet
Azure Image Builder Microsoft.VirtualMachineImages/imageTemplates Slow, non-deterministic boot Build (control plane)
Compute Gallery Microsoft.Compute/galleries No immutable, versioned artifact The artifact registry
Application Health extension VM extension No health signal for rollout Gate on the fleet
Rolling upgrade policy upgradePolicy.rollingUpgradePolicy No safe, batched rollout The rollout safety envelope
Automatic instance repair automaticRepairsPolicy Stuck-unhealthy instances persist Continuous remediation

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the VM fundamentals: what an Azure VM, OS disk, NIC, NSG and availability zone are, and how az works in Cloud Shell with JSON output. Read Azure Virtual Machines: Every Setting That Matters for the per-VM anatomy and Azure VM Availability & Resilience Deep Dive for fault domains, availability zones, and the resilience model that VMSS placement builds on. Familiarity with Azure Load Balancer helps, since a Standard LB usually fronts the set, and Azure Compute: Dedicated Hosts, Spot, Confidential, HPC & Batch covers the Spot mechanics this article uses.

This sits in the Compute track, one level up from single-VM operations. It is the bridge between “I can run a VM” and “I run a self-healing, immutably-versioned fleet.” It pairs with Azure Monitor & Application Insights for Observability (you watch rollouts and health there), Azure Container Registry: Secure Supply Chain (the same supply-chain discipline, for containers), and Azure Key Vault: Secrets, Keys & Certificates (where the build identity and any baked-in secrets are governed).

A quick map of who owns what during a rollout, so you escalate to the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Build identity / RBAC Managed identity, role scope Platform / security Build can’t publish; over-privileged pipeline
Image template (AIB) Source, customizers, distribute Platform / app Build fails; broken image shipped
Compute Gallery Definitions, versions, replication Platform Version not seen in region; bad latest
VMSS model Orchestration, SKU, upgrade policy Platform / app Rollout halts; FD misconfig
Health probe (app) /healthz semantics App / dev team Greenlights broken batch; repair loop
Autoscale / Spot Rules, base count, eviction Platform / FinOps Flap; capacity loss on reclamation

Core concepts

Six mental models make every later decision obvious.

Orchestration mode is the shape of the fleet, chosen once. Uniform treats instances as identical, fungible cattle behind a virtualMachineScaleSets/virtualMachines proxy; Flexible makes each instance a real Microsoft.Compute/virtualMachines resource that happens to be a member. The mode is set at creation and is irreversible. You pick Flexible for operability and Uniform only for very large homogeneous fleets or Service Fabric.

The image is an immutable, versioned artifact — not a recipe run at boot. Azure Image Builder bakes the box once; Compute Gallery stores it as gallery → image definition → image version. The definition is the unchanging contract (OS type, generation, security type, publisher/offer/sku); versions are the artifacts. A rollback is “point at the previous version,” not “revert a script.”

Replication gates regional rollout. A scale set in a region only sees a new image version once that version has finished replicating to that region. This is not a limitation — it is your canary lever: replicate to one region, prove it, then fan out.

Two upgrade knobs, constantly confused. The upgrade-policy mode (Manual/Automatic/Rolling) controls what happens to existing instances when you change the scale set model. Automatic OS image upgrade controls what happens when a new image version appears, and it always uses the rolling-upgrade policy regardless of mode. They are configured separately and mean different things.

A rolling upgrade is only as safe as its health signal. With Flexible orchestration the Application Health extension is required — there is no load-balancer-probe fallback. The platform replaces a batch, waits for the new instances to report healthy via this signal, and proceeds only if they do; otherwise it halts and restores. A probe that checks “process up” instead of “can serve” will happily greenlight a broken batch.

Exactly one health source. A scale set may have one health source. If you configure both an Application Health extension and a Load Balancer health probe, orchestration features (automatic OS upgrade, instance repair) will not work until you remove one.

The vocabulary in one table

Before the deep sections, pin down every moving part side by side. The glossary at the end repeats these for lookup:

Term One-line definition Where it lives Why it matters
Flexible orchestration Instances are real VM resources in a set VMSS model Operability; default for new fleets
Uniform orchestration Instances proxied behind the set VMSS model Max scale ceiling; Service Fabric
Fault domain (FD) Rack-level failure boundary Placement Spreading → availability
platformFaultDomainCount How many FDs to spread across Set at create (irreversible) 1 = max spread
Azure Image Builder (AIB) Managed Packer that bakes images imageTemplates resource Fast, identical boot
Image definition Immutable contract for an image Compute Gallery OS/gen/security type
Image version A concrete baked artifact Compute Gallery The thing you roll out
excludeFromLatest Hide a version from latest Version publishing profile Kill switch for a bad bake
Replication Copy a version to target regions Gallery Gates regional rollout
Upgrade policy mode Action on model change VMSS upgrade policy Manual/Automatic/Rolling
Automatic OS upgrade Action on new image version automaticOSUpgradePolicy Always uses rolling policy
Application Health extension In-guest health probe VM extension The rollout gate
Automatic instance repair Replace stuck-unhealthy instances automaticRepairsPolicy Continuous self-heal
Spot Priority Mix Regular floor + Spot above it VMSS (Flex) Cost with protected capacity

1. Uniform vs Flexible orchestration, and fault-domain placement

Uniform orchestration treats instances as identical, fungible cattle managed through a single VMSS model. It is still the right choice for very large, homogeneous fleets (thousands of instances) and for Service Fabric. But it hides the underlying VMs behind a virtualMachineScaleSets/virtualMachines proxy, so per-instance operations and standard VM tooling are awkward.

Flexible orchestration is the default for almost every new workload. Instances are real Microsoft.Compute/virtualMachines resources that happen to be members of a scale set. That means each instance shows up in the portal as a normal VM, takes VM extensions the normal way, can be attached or detached individually, and works with anything that expects a real VM resource. You trade some of Uniform’s raw scale ceiling for operability, and for most fleets that is the correct trade.

Here is the decision laid out attribute by attribute — read your requirements down the rows:

Attribute Uniform Flexible Which to pick
Instance resource type VMSS/virtualMachines proxy Real Microsoft.Compute/virtualMachines Flex for tooling/operability
Max instances (typical) Up to ~1,000 per set Up to ~1,000 per set (multi-zone) Either for most fleets
Per-instance operations Awkward (proxy) Native VM operations Flex
Mixed VM sizes in one set No Yes Flex
Spot Priority Mix No Yes Flex
Service Fabric support Yes No Uniform for Service Fabric
Automatic OS upgrade GA Preview Uniform if you need GA today
Default for new workloads Legacy Yes Flex
Changeable after create No — pick once

The placement decision that matters on day one is fault-domain spreading. Set it at creation; you cannot change it later.

platformFaultDomainCount Behaviour Use when Gotcha
1 (max spreading) Azure spreads instances across as many fault domains as the region allows, best-effort Default. Best availability for most stateless fleets Instance FD not guaranteed fixed
23 (fixed spreading) Instances pinned across exactly N fault domains; request fails if it cannot satisfy N Quorum systems that need a known, fixed FD count Allocation can fail in a constrained region
5 (legacy max, regional) Fixed 5 FDs, regional (non-zonal) deployments Legacy parity with availability sets Not valid with zonal --zones

Microsoft’s own guidance is to use max spreading (platformFaultDomainCount = 1) for most scale sets. It gives the broadest distribution and avoids allocation failures when a region is constrained. Combine that with Availability Zones for the strongest posture. The interaction between zones and FD count is worth one more table, because the combination is what determines your blast radius:

Zones FD count Effective resilience When to use
None (regional) 1 (max) Spread across racks in one DC Dev / single-region, cost-sensitive
None (regional) 23 (fixed) Known FD quorum, one DC Quorum systems without zones
1 2 3 (zonal) 1 (max) Spread across 3 AZs + racks Production default
1 2 3 (zonal) fixed Rejected by the platform Not supported together
LOC=eastus
RG=rg-vmss-prod

az group create --name $RG --location $LOC

# Flexible scale set, zonal + max fault-domain spreading.
# --orchestration-mode flexible and --platform-fault-domain-count 1
# are the two flags that define the shape.
az vmss create \
  --resource-group $RG --name vmss-app \
  --orchestration-mode flexible \
  --zones 1 2 3 \
  --platform-fault-domain-count 1 \
  --instance-count 3 \
  --vm-sku Standard_D2as_v5 \
  --image Ubuntu2204 \
  --upgrade-policy-mode Manual \
  --admin-username azureuser \
  --generate-ssh-keys \
  --single-placement-group false

The Bicep equivalent — this is the form you actually keep in source control, where the irreversible fields are reviewed in a PR:

resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2024-07-01' = {
  name: 'vmss-app'
  location: location
  zones: [ '1', '2', '3' ]
  sku: { name: 'Standard_D2as_v5', capacity: 3 }
  properties: {
    orchestrationMode: 'Flexible'        // irreversible
    platformFaultDomainCount: 1          // max spread; irreversible
    singlePlacementGroup: false
    upgradePolicy: { mode: 'Manual' }    // promote to Rolling only after health is green
  }
}

Create the set in Manual upgrade mode first. You want to confirm the Application Health extension is reporting green before you ever switch to Rolling. Flipping to rolling with a misconfigured health signal is how people brick a fleet.

The flags on az vmss create that carry irreversible or load-bearing decisions deserve their own reference, because getting one wrong means recreating the set:

Flag Purpose Default Reversible? Gotcha
--orchestration-mode Flexible vs Uniform Flexible (new CLI) No The whole shape of the fleet
--platform-fault-domain-count FD spreading varies by region No 1 = max spread
--zones Zonal placement none (regional) No Can’t add zones later
--single-placement-group Single vs multiple PGs true (Uniform) No Set false for Flex large sets
--vm-sku Instance size Yes (model update) Mixed sizes allowed on Flex
--instance-count Initial capacity Yes (autoscale) Don’t pin if autoscaling
--upgrade-policy-mode Manual/Automatic/Rolling Manual Yes Start Manual

2. The golden-image pipeline with Azure Image Builder

Azure Image Builder (AIB) is a managed wrapper over HashiCorp Packer. You describe a source image, a set of customizers, and one or more distribution targets in a Microsoft.VirtualMachineImages/imageTemplates resource. AIB spins up a transient build VM in a staging resource group, runs your customizers, generalizes the result, and publishes it to your target — here, a Compute Gallery image version.

First, the identity. AIB runs as a user-assigned managed identity that needs rights to write image versions into your gallery. Grant it a role scoped to the gallery resource group.

IDENTITY=id-aib
az identity create --resource-group $RG --name $IDENTITY
AIB_PRINCIPAL=$(az identity show -g $RG -n $IDENTITY --query principalId -o tsv)
AIB_ID=$(az identity show -g $RG -n $IDENTITY --query id -o tsv)
SUB=$(az account show --query id -o tsv)

# AIB needs to write image versions into the gallery RG.
az role assignment create \
  --assignee-object-id $AIB_PRINCIPAL \
  --assignee-principal-type ServicePrincipal \
  --role "Contributor" \
  --scope /subscriptions/$SUB/resourceGroups/$RG

Contributor on the resource group is the simple path. In a hardened estate, replace it with a custom role that grants only the image-version and disk actions AIB needs, scoped to the gallery and the staging RG. Never grant subscription-level Contributor to a build identity.

The least-privilege custom role for AIB is a small, well-known action set. Enumerate exactly what it needs rather than reaching for Contributor:

Action Why AIB needs it Scope
Microsoft.Compute/galleries/images/versions/write Publish the new image version Gallery RG
Microsoft.Compute/galleries/images/read Read the target image definition Gallery RG
Microsoft.Compute/images/write / read Manage the intermediate managed image Staging RG
Microsoft.Compute/disks/write Create the build VM’s disk Staging RG
Microsoft.Storage/storageAccounts/blobServices/.../read Pull scriptUri customizers from blob Script storage
Microsoft.Network/virtualNetworks/subnets/join/action Build inside your VNet (if used) VNet RG

The build itself is shaped by the template’s top-level knobs. These govern time, size and cost of every bake — set them deliberately:

Template field What it controls Default / typical When to change
buildTimeoutInMinutes Hard cap on the whole build 0 (→ ~240 max) Raise for heavy installs; lower to fail fast
vmProfile.vmSize Build VM size Standard_D1_v2 Bigger for faster compiles/installs
vmProfile.osDiskSizeGB Build OS disk size source size Raise if customizers need space
vmProfile.vnetConfig Build inside a VNet none (AIB-managed) Required to reach private sources
stagingResourceGroup Where the transient build lives AIB-generated IT_* RG Pin it for RBAC/cleanup control
errorHandling.onCustomizerError Cleanup vs keep on failure cleanup Keep to debug a failed bake

Now the template. The two load-bearing sections are source (a PlatformImage) and distribute (a SharedImage pointing at a gallery image definition). Note version: latest on the source — because latest is resolved at build time, you can rerun the same template monthly and always rebake on top of the newest patched base image.

{
  "type": "Microsoft.VirtualMachineImages/imageTemplates",
  "apiVersion": "2024-02-01",
  "location": "eastus",
  "identity": {
    "type": "UserAssigned",
    "userAssignedIdentities": {
      "<AIB_ID resource id>": {}
    }
  },
  "properties": {
    "buildTimeoutInMinutes": 60,
    "vmProfile": {
      "vmSize": "Standard_D2as_v5",
      "osDiskSizeGB": 30
    },
    "source": {
      "type": "PlatformImage",
      "publisher": "Canonical",
      "offer": "0001-com-ubuntu-server-jammy",
      "sku": "22_04-lts-gen2",
      "version": "latest"
    },
    "customize": [
      {
        "type": "Shell",
        "name": "harden-and-install",
        "inline": [
          "set -euo pipefail",
          "sudo apt-get update && sudo apt-get -y upgrade",
          "sudo apt-get -y install nginx jq",
          "sudo systemctl enable nginx",
          "echo 'baked $(date -u +%FT%TZ)' | sudo tee /etc/image-build-stamp"
        ]
      },
      {
        "type": "Shell",
        "name": "cis-baseline",
        "scriptUri": "https://stbuildscripts.blob.core.windows.net/scripts/cis-baseline.sh"
      }
    ],
    "distribute": [
      {
        "type": "SharedImage",
        "galleryImageId": "/subscriptions/<sub>/resourceGroups/rg-vmss-prod/providers/Microsoft.Compute/galleries/galProd/images/ubuntu-app",
        "runOutputName": "ubuntu-app-out",
        "artifactTags": { "source": "aib", "baseline": "cis" },
        "targetRegions": [
          { "name": "eastus", "replicaCount": 3, "storageAccountType": "Standard_ZRS" },
          { "name": "westus3", "replicaCount": 2, "storageAccountType": "Standard_ZRS" }
        ]
      }
    ]
  }
}

AIB supports several customizer types; pick by what the step has to do, not habit. Each has a failure mode worth knowing:

Customizer type What it does Use for Gotcha
Shell (inline) Run inline Linux commands Small installs, hardening Put set -euo pipefail first
Shell (scriptUri) Run a script from a URL/blob Reusable baselines (CIS) Identity needs blob read
PowerShell Run Windows commands/scripts Windows images runElevated for admin tasks
WindowsRestart Reboot mid-build After driver/agent installs Set a sane restartTimeout
WindowsUpdate Apply Windows Updates Patch Windows base Long; raise buildTimeoutInMinutes
File Copy a file onto the image Config, certs, binaries Source must be reachable

Distribution targets are not limited to a gallery; know the options even though SharedImage is the right default:

Distribute type Output When to use Note
SharedImage Compute Gallery image version Default — versioned, replicated Used by VMSS/auto-OS upgrade
ManagedImage A standalone managed image Legacy / single-region No versioning or replication
VHD A VHD in a storage account Export / off-Azure use You manage lifecycle yourself

Deploy the template, then invoke the build. AIB templates are submitted as ARM resources, and the build is a separate Run action.

# Submit the template resource (validates and creates the build pipeline).
az deployment group create \
  --resource-group $RG \
  --template-file aib-ubuntu-app.json

# Kick off the actual image build (long-running).
az image builder run \
  --resource-group $RG \
  --name aib-ubuntu-app

# Watch the build; lastRunStatus.runState goes Running -> Succeeded.
az image builder show \
  --resource-group $RG --name aib-ubuntu-app \
  --query "lastRunStatus" -o jsonc

customize is fail-fast: if any single customizer fails, the whole build fails. Put set -euo pipefail at the top of every Shell block so a silent error inside a script actually surfaces as a build failure instead of shipping a broken image.

The lastRunStatus.runState values tell you where a build is and what to do next:

runState Meaning Typical duration Next action
Running Build VM up, customizers executing 10–60+ min Wait; tail customizer logs
Succeeded Version published to the gallery Confirm replication, then roll out
Failed A customizer or distribute step failed Read lastRunStatus.message; keep staging RG to debug
Canceled Build canceled (timeout or manual) Raise buildTimeoutInMinutes; rerun
PartiallySucceeded Some target regions failed Re-run distribute; check region quota

3. Compute Gallery versioning, replication, and targeting

The Compute Gallery is the registry for your images. The hierarchy is gallery → image definition → image version. The definition is the immutable contract (OS type, generation, security type, publisher/offer/sku triple). Versions are the artifacts AIB writes into it.

GAL=galProd

az sig create --resource-group $RG --gallery-name $GAL

# Image definition. Hyper-V Gen2 + TrustedLaunchSupported is the modern
# default; it lets you boot the image on either Standard or TrustedLaunch VMs.
az sig image-definition create \
  --resource-group $RG --gallery-name $GAL \
  --gallery-image-definition ubuntu-app \
  --publisher kloudvin --offer ubuntu --sku app-jammy \
  --os-type Linux --os-state Generalized \
  --hyper-v-generation V2 \
  --features SecurityType=TrustedLaunchSupported

The image-definition fields are the immutable contract — you cannot change them on an existing definition, so choosing wrong means a new definition. Enumerate them:

Definition field Values Default Changeable? Gotcha
--os-type Linux / Windows No Must match the source
--os-state Generalized / Specialized Generalized No AIB outputs Generalized
--hyper-v-generation V1 / V2 V1 No V2 for TrustedLaunch/CVM
SecurityType Standard / TrustedLaunch / TrustedLaunchSupported / ConfidentialVM(Supported) Standard No *Supported boots on either
--publisher/--offer/--sku Your triple No The image’s identity; pick a scheme
--end-of-life-date A date none Yes Informational; doesn’t block

The SecurityType choice is consequential enough to compare head to head — it decides which VM SKUs can boot your image:

SecurityType Boots on Standard VMs Boots on TrustedLaunch VMs Boots on Confidential VMs Use when
Standard Yes No No Legacy; avoid for new
TrustedLaunchSupported Yes Yes No Modern default — flexible
TrustedLaunch No Yes No Mandate Secure Boot + vTPM
ConfidentialVMSupported Yes Yes Yes May target CVM SKUs

A few facts that bite people:

The version-level publishing knobs control replication, redundancy and rollout eligibility — the levers you actually pull during a release:

Version field What it does Default When to change Limit / gotcha
targetRegions[].name Where the version replicates source region only Add every consuming region Region not listed → not visible there
targetRegions[].replicaCount Replicas per region 1 2–3 in high-throughput regions Up to ~50 per region; more = faster mass-create
storageAccountType Replica redundancy Standard_LRS Standard_ZRS in zoned regions ZRS survives one-zone storage outage
excludeFromLatest Hide from latest false Set true to kill a bad version Auto-OS upgrade skips excluded
endOfLifeDate Mark a version EOL none Lifecycle hygiene Informational; doesn’t block boot
replicationMode Full vs shallow Full Shallow for fast test images Shallow not for production scale-out
# Promote a known-good version and demote a bad one without deleting it.
az sig image-version update \
  --resource-group $RG --gallery-name $GAL \
  --gallery-image-definition ubuntu-app \
  --gallery-image-version 1.4.0 \
  --set publishingProfile.excludeFromLatest=true

When the scale set should always track the newest version, reference the definition without a version. That /latest-style reference (omit the version segment) is exactly what automatic OS upgrade keys off. The three ways to reference an image from a VMSS, and what each does:

Reference form Example tail Behaviour Use for
Definition, no version .../images/ubuntu-app Resolves to latest non-excluded version Auto-OS upgrade tracking
Definition + explicit version .../images/ubuntu-app/versions/1.7.0 Pinned to that exact version Reproducible / pinned fleets
latest keyword .../versions/latest Resolves at create time only One-off creates, not auto-upgrade

Gallery replication has real numbers worth keeping in front of you when you plan a large rollout:

Limit / quantity Approximate value Why it matters
Image versions per definition Up to ~10,000 Long retention is fine
Replicas per region per version Up to ~50 Each replica serves a slice of concurrent creates
Target regions per version Up to ~50 Global fleets in one version
Concurrent instance creates per replica ~20 (rule of thumb) Add replicas to mass-create faster
Replication time Minutes to tens of minutes Gates when a region can roll out

4. Automatic OS image upgrades and rolling-upgrade health policies

There are two distinct, separately-configured knobs that people constantly conflate:

The three mode values, side by side — this is the table that ends the confusion:

mode What it does on a model change Touches instances when image changes? Health gate? Use when
Manual Nothing — you upgrade instances yourself No No Bring-up; full manual control
Automatic Upgrades all instances at once, no batching No (that’s auto-OS upgrade) No Rarely; risky for prod
Rolling Upgrades in health-gated batches No (that’s auto-OS upgrade) Yes Production model changes

Heads-up on lifecycle: automatic OS image upgrade for VMSS Flex is in preview (it has been GA for Uniform for years). For Flex, the instance image version must be set to latest, the Application Health extension version on the instance must match the model, and — importantly — MaxSurge cannot be combined with automatic OS upgrade on Flex. Validate it in non-prod and confirm current regional availability before you depend on it in production.

The prerequisites for automatic OS upgrade on Flex are strict; miss one and it silently does nothing. Treat this as a checklist table:

Prerequisite Why How to confirm
Image referenced as latest (no version) Upgrade needs a moving target az vmss show --query virtualMachineProfile.storageProfile.imageReference
Application Health extension present The required health source on Flex az vmss extension list shows ApplicationHealth*
Extension version matches the model Drift blocks orchestration az vmss get-instance-view vs model
Single health source only One probe rule No LB health-probe duplicate configured
enableAutomaticOSUpgrade=true The feature switch az vmss show --query upgradePolicy.automaticOSUpgradePolicy
No MaxSurge Unsupported with auto-OS on Flex rollingUpgradePolicy.maxSurge unset/false

The rolling upgrade policy is the safety envelope. These are the real fields and their meanings:

Field (az flag) Meaning Typical prod value Gotcha
maxBatchInstancePercent (--max-batch-instance-percent) Max % of instances upgraded in one batch 20 Smaller batch = safer, slower
maxUnhealthyInstancePercent (--max-unhealthy-instance-percent) If more than this % of the whole set is unhealthy, the upgrade halts 20 Counts unrelated unhealthy too
maxUnhealthyUpgradedInstancePercent (--max-unhealthy-upgraded-instance-percent) If more than this % of already-upgraded instances go unhealthy, the upgrade is cancelled 20 The real rollback trigger
pauseTimeBetweenBatches (--pause-time-between-batches) ISO-8601 soak time between batches, e.g. PT2M PT2MPT5M Long enough to catch slow failures
prioritizeUnhealthyInstances (--prioritize-unhealthy-instances) Upgrade already-unhealthy instances first true Heals the sick first
maxSurge (--max-surge) Create new instances before deleting old false on Flex auto-OS Not with auto-OS on Flex
rollbackFailedInstancesOnPolicyBreach Roll failed instances back on breach true Restores previous OS disk

Set the policy and enable automatic OS upgrade:

# 1) Tighten the rolling envelope and switch to Rolling mode.
az vmss update \
  --resource-group $RG --name vmss-app \
  --set upgradePolicy.mode=Rolling \
  --max-batch-instance-percent 20 \
  --max-unhealthy-instance-percent 20 \
  --max-unhealthy-upgraded-instance-percent 20 \
  --prioritize-unhealthy-instances true \
  --pause-time-between-batches PT2M

# 2) Enable automatic OS image upgrade (keys off the gallery 'latest').
az vmss update \
  --resource-group $RG --name vmss-app \
  --enable-auto-os-upgrade true

In Bicep, the policy lives under upgradePolicy and is reviewed like any other config:

properties: {
  upgradePolicy: {
    mode: 'Rolling'
    rollingUpgradePolicy: {
      maxBatchInstancePercent: 20
      maxUnhealthyInstancePercent: 20
      maxUnhealthyUpgradedInstancePercent: 20
      pauseTimeBetweenBatches: 'PT2M'
      prioritizeUnhealthyInstances: true
    }
    automaticOSUpgradePolicy: {
      enableAutomaticOSUpgrade: true
      disableAutomaticRollback: false
    }
  }
}

The orchestrator never upgrades more than 20% of the set at once, waits for each upgraded instance to report healthy, and restores the previous OS disk if an instance does not recover in time. If overall unhealthy instances cross your threshold mid-flight — even from unrelated maintenance — it stops at the end of the current batch. The conditions that halt or roll back an upgrade, and the exact status you will see, are worth tabulating because they are what you read during an incident:

Condition Status / error you see What the platform does Your move
Upgraded instances exceed maxUnhealthyUpgradedInstancePercent MaxUnhealthyUpgradedInstancePercentExceededInRollingUpgrade Cancels upgrade; rolls failed instances back Demote bad version with excludeFromLatest; fix; re-bake
Whole-set unhealthy exceeds maxUnhealthyInstancePercent MaxUnhealthyInstancePercentExceededInRollingUpgrade Halts at end of current batch Resolve unrelated unhealthy; resume
Instance won’t report healthy in time per-instance failure in latest result Restores previous OS disk for that instance Check probe semantics + grace period
Health extension missing/mismatched upgrade does not start No-op Add/align the extension version
Manual cancel Cancelled Stops; leaves mixed versions Re-run after fix

5. Application Health extension and graceful instance replacement

A rolling upgrade is only as safe as its health signal. With Flexible orchestration and a rolling policy, the Application Health extension is required — there is no load-balancer-probe fallback the way there is for Uniform. The platform uses this signal to decide whether a freshly-upgraded instance is healthy before touching the next batch.

Critical constraint: a scale set may have exactly one health source. If you have both an Application Health extension and a Load Balancer health probe configured, you must remove one before orchestration features (automatic OS upgrade, instance repair) will work.

Add the extension to the model. It probes a local endpoint your app owns — make that endpoint mean “I can actually serve traffic,” not just “the process is up.”

az vmss extension set \
  --resource-group $RG --vmss-name vmss-app \
  --name ApplicationHealthLinux \
  --publisher Microsoft.ManagedServices \
  --version 2.0 \
  --settings '{
    "protocol": "http",
    "port": 8080,
    "requestPath": "/healthz",
    "intervalInSeconds": 5,
    "numberOfProbes": 1,
    "gracePeriod": 600
  }'

# Make sure the extension change is rolled to existing instances.
az vmss update-instances \
  --resource-group $RG --name vmss-app --instance-ids '*'

Every setting on the extension shapes how fast and how forgivingly it flips an instance unhealthy. Enumerate them — these are the knobs that cause both premature halts and missed-failure greenlights:

Setting What it does Default Range / values When to change
protocol Probe protocol http / https / tcp https for TLS endpoints; tcp for non-HTTP
port Port probed in-guest 1–65535 Match your readiness listener
requestPath Path for http/https any path Point at a real readiness route
intervalInSeconds Probe frequency 5 5–60 Lower = faster detection, more noise
numberOfProbes Consecutive results to flip state 1 1–24 Raise to ride transient blips
gracePeriod Grace after boot before counting 600 0–7200 s Cover boot + warm-up
intervalInSeconds × numberOfProbes Effective detection window derived This is your real reaction time

The health states the extension reports, and what each means for orchestration:

Reported state Meaning Effect during rolling upgrade Effect on instance repair
Healthy Probe returns success Batch proceeds Instance left alone
Unhealthy Probe fails past numberOfProbes Counts against thresholds; may halt Eligible for repair after grace
Unknown No signal yet (within grace) Waits, doesn’t fail Not repaired during grace
(no extension) No health source Auto-OS upgrade won’t run Repair won’t run

The HTTP probe contract is specific — design /healthz to the rule, not by guesswork:

Probe returns… Instance is… Include in the check Never include
200 OK Healthy / can serve In-process readiness (config loaded, pools primed) A slow downstream report query
2xx other than 200 (http/https) Treated as unhealthy Redirects, 204 — return a clean 200
Non-2xx / timeout Unhealthy A fast, required-dependency check An optional cache/search call
TCP connect (tcp mode) Healthy if port accepts A real listener bound to the port A port that’s up before the app can serve

Pair this with automatic instance repair, which uses the same health signal to replace an instance that stays unhealthy outside of any upgrade. The grace period must be long enough to cover boot plus app warm-up, or you will fight a repair loop.

az vmss update \
  --resource-group $RG --name vmss-app \
  --enable-automatic-repairs true \
  --automatic-repairs-grace-period PT30M

Automatic instance repair has its own small set of knobs; the grace period and repair action are the two that bite:

Setting What it does Default Gotcha
enableAutomaticRepairs Turn repair on false Needs the single health source
gracePeriod Wait after a state change before repairing PT30M (range ~PT10MPT90M) Too short → repair loop on slow boot
repairAction What “repair” does Replace Reimage / Restart are cheaper but less thorough

6. Autoscale rules, predictive scaling, and scale-in protection

Scaling is configured against the scale set as the target resource. Build rules on a real saturation signal, and always pair a scale-out rule with a scale-in rule plus a cooldown so you do not flap.

az monitor autoscale create \
  --resource-group $RG \
  --resource vmss-app \
  --resource-type Microsoft.Compute/virtualMachineScaleSets \
  --name autoscale-app \
  --min-count 3 --max-count 30 --count 3

# Scale out on sustained CPU, scale in conservatively.
az monitor autoscale rule create \
  --resource-group $RG --autoscale-name autoscale-app \
  --condition "Percentage CPU > 70 avg 5m" \
  --scale out 2 --cooldown 5

az monitor autoscale rule create \
  --resource-group $RG --autoscale-name autoscale-app \
  --condition "Percentage CPU < 30 avg 10m" \
  --scale in 1 --cooldown 10

The autoscale rule fields determine whether you respond smoothly or flap. Enumerate the dials and their sane production values:

Rule field What it controls Typical out value Typical in value Gotcha
Metric The saturation signal Percentage CPU Percentage CPU Use a real bottleneck (CPU, queue depth)
Operator / threshold Trigger point > 70 < 30 Wide gap avoids oscillation
Time aggregation avg/max/min over window avg avg max reacts to spikes harder
Time window Smoothing period 5m 10m Longer = steadier, slower
Scale action Count / percent change out 2 in 1 Out faster than in
Cooldown Wait before next action 5m 10m Must exceed boot + warm-up
Min / max / default Capacity bounds Max must clear peak demand

Pick the metric to the workload — CPU is the default, but it is often the wrong signal:

Workload Better scale metric Why Source
Web/app tier Percentage CPU Compute-bound request handling Host metric
Queue worker Queue length / messages Backlog, not CPU, is the demand Storage/Service Bus metric
Memory-bound service Available memory (guest) CPU may be idle while RAM saturates Guest metric via agent
Connection-bound Active connections / LB metric Sockets, not CPU, are the ceiling LB metric
Predictable daily shape Schedule + predictive Provision ahead of the curve Recurrence profile

Two refinements that separate a production config from a demo:

az vmss update \
  --resource-group $RG --name vmss-app \
  --instance-id 3 \
  --protect-from-scale-in true \
  --protect-from-scale-set-actions false

The two protection flags are easy to confuse — they protect against different actions:

Protection flag Protects against Leaves allowed Use for
protect-from-scale-in Autoscale removing this instance Manual delete, upgrades, repair An instance running a long job
protect-from-scale-set-actions Set-wide actions (incl. upgrades) on this instance Autoscale scale-in (if other flag off) Pin a special-purpose instance

Predictive autoscale has two modes; never jump straight to enforcing it:

Predictive mode What it does Risk When to use
ForecastOnly Computes and charts a forecast, takes no action None First — validate the model
Enabled Provisions ahead of the forecast Over/under-provision if model is off After ForecastOnly looks right
Disabled Reactive scaling only Lags spiky demand Workloads with no daily shape

7. Spot instances, eviction handling, and mixed capacity

Flexible orchestration unlocks Spot Priority Mix (GA for Flex), which runs a guaranteed floor of regular VMs alongside Spot VMs in one scale set. You set a base count of regular instances that is never evicted, plus a percentage of regular instances among everything above that base. The rest are Spot, evicted (and optionally deallocated) when Azure reclaims capacity.

# Floor of 3 regular VMs; above that, 50% regular / 50% Spot.
# Eviction policy 'Deallocate' keeps the disk so the instance can return.
az vmss create \
  --resource-group $RG --name vmss-batch \
  --orchestration-mode flexible \
  --platform-fault-domain-count 1 \
  --instance-count 10 \
  --vm-sku Standard_D4as_v5 \
  --image Ubuntu2204 \
  --priority Spot \
  --eviction-policy Deallocate \
  --regular-priority-count 3 \
  --regular-priority-percentage 50 \
  --single-placement-group false

The Spot Priority Mix parameters decide how much guaranteed capacity you keep. Enumerate them and their effect on a 10-instance set:

Parameter What it sets Example (cap 10) Result
regular-priority-count Floor of never-evicted regular VMs 3 First 3 are always regular
regular-priority-percentage % regular among instances above the floor 50 Of the remaining 7, ~3–4 regular
priority Default priority for the set Spot Above-floor non-regular are Spot
eviction-policy What happens on reclaim Deallocate Disk kept; instance can return
max-price Max hourly price you’ll pay -1 (any, up to on-demand) -1 = only evicted by capacity

Eviction policy is a real fork — pick by whether the instance needs to come back:

eviction-policy On reclaim Disk Cost while evicted Use for
Deallocate Stop + deallocate Kept Disk storage only Work that resumes; stateful-ish
Delete Delete the instance Removed None Pure stateless; recreate fresh

Handle eviction gracefully from inside the instance. Spot eviction is delivered through Azure Scheduled Events on the Instance Metadata Service; poll it and drain on a Preempt signal.

# Poll IMDS for a Preempt event and drain before the 30s window closes.
curl -s -H "Metadata:true" \
  "http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" \
  | jq '.Events[] | select(.EventType=="Preempt")'

The Scheduled Events you should handle, and the window you get for each:

EventType Trigger Notice window What to do
Preempt Spot eviction ~30 s Drain connections, checkpoint, deregister from LB
Terminate Configured VM delete configurable (e.g. 5–15 min) Graceful shutdown
Reboot Planned host reboot minutes Flush state, expect a restart
Redeploy Host migration minutes Re-establish ephemeral state on new host
Freeze Brief host pause seconds Tolerate a short stall

Spot is for interruptible work: batch, CI runners, stateless stream processors, dev. The base regular count is your insurance that core capacity survives a region-wide Spot reclamation. For a tier-1 synchronous API, keep Spot out of the path or set the base high enough to carry full load alone.

Architecture at a glance

The diagram traces the image as it actually flows, left to right, from a build pipeline to a self-healing fleet. On the far left is the build control plane: the AIB user-assigned identity (scoped to the gallery RG, not the subscription), the Image Builder template that runs Packer with a 60-minute timeout, and the transient build VM that does the work and is then thrown away. AIB publishes the result into the artifact zone — a Compute Gallery image definition (Gen2, TrustedLaunch-supported) holding an immutable image version (1.7.1, ZRS-replicated, with excludeFromLatest available as a kill switch). That version then replicates into the regions zone — eastus as the canary with three replicas, westus3 with two — and only a region that has finished replicating can roll the new image out.

On the right, the VMSS Flex zone is the runtime fleet: a Standard Load Balancer probing port 8080, the Flexible scale set spread across fault domain 1 and zones 1-2-3, and the Application Health extension probing /healthz and gating the rolling upgrade to 20% batches. Everything converges on the observe zone — automatic OS-upgrade history and the repair/autoscale loop (3→30 instances). Read the five numbered badges as the five places a rollout breaks: a denied build identity, an excluded/stale version, unfinished replication, a halted rolling upgrade from a failing probe, and a repair/scale-in loop. The legend narrates each as symptom · confirm command · fix. The mental model is one sentence: the image flows left to right, and each badge is a gate that, misconfigured, stops the flow exactly there.

Azure golden-image pipeline feeding a health-gated Flexible VM Scale Set: a build control plane (AIB user-assigned identity scoped to the gallery resource group, Azure Image Builder running Packer with a 60-minute timeout, and a transient build VM) publishes an immutable image version into a Compute Gallery (Gen2 TrustedLaunch-supported definition, version 1.7.1 with ZRS replication and excludeFromLatest as a kill switch), which replicates to a canary eastus region with three replicas and westus3 with two, then rolls out to a Flexible VM Scale Set behind a Standard Load Balancer probing port 8080, spread across fault domain 1 and zones 1-2-3, gated by an Application Health extension probing /healthz in 20% batches, with automatic OS-upgrade history and an instance-repair-plus-autoscale loop scaling 3 to 30 instances — five numbered failure points marking a denied build identity, an excluded or stale version, unfinished replication, a halted rolling upgrade, and a repair or scale-in loop

Real-world scenario

A payments platform team ran a tier-1 authorization service on a Uniform VMSS with a Marketplace Ubuntu image and a 350-line cloud-init. Two problems converged. First, their security team mandated a CIS-hardened, monthly-patched base image with a sub-90-second boot SLO; cloud-init at boot took 4–6 minutes and was non-deterministic under load. Second, a routine kernel CVE patch the previous quarter had been pushed by re-running cloud-init across the fleet, a bad script slipped through, and ~30% of instances came back unable to reach the HSM. There was no health gate and no rollback — the bad config rolled to the entire set before anyone noticed, and recovery was a frantic manual reimage.

They rebuilt on Flexible orchestration with an AIB + Compute Gallery pipeline. The CIS baseline and all packages moved into the image (boot dropped to ~50 seconds). The Application Health extension probed a /ready endpoint that returned 200 only after a successful test connection to the HSM — so “healthy” meant “can actually authorize.” Crucially, they kept maxUnhealthyUpgradedInstancePercent tight and replicated each new gallery version to a single canary region first.

The next CVE patch told the story. A new image version went to the canary region; rolling upgrade replaced the first 20% batch; the new instances failed the HSM connectivity probe; the orchestrator restored their previous OS disks and halted at MaxUnhealthyUpgradedInstancePercentExceededInRollingUpgrade after the first batch. Blast radius: a handful of instances in one region, auto-rolled-back, zero customer impact. The fix (a firewall rule the new baseline had tightened) shipped as 1.7.1; 1.7.0 was demoted with excludeFromLatest rather than deleted.

The health endpoint design was the whole game — a probe that only checks the process would have happily greenlit the broken batch:

az vmss extension set \
  --resource-group rg-pay-prod --vmss-name vmss-authz \
  --name ApplicationHealthLinux \
  --publisher Microsoft.ManagedServices --version 2.0 \
  --settings '{
    "protocol": "https",
    "port": 8443,
    "requestPath": "/ready",
    "intervalInSeconds": 5,
    "gracePeriod": 900
  }'

The before/after, as numbers, because the business case is in the deltas:

Metric Before (Uniform + cloud-init) After (Flex + AIB + Gallery) Driver of the change
Boot time 4–6 min, variable ~50 s, deterministic Config baked into the image
CVE-patch blast radius ~30% of fleet, all regions A handful, one canary region Rolling upgrade + canary replication
Rollback time Manual reimage, ~hours Automatic, in-flight OS-disk restore on policy breach
“What ran last Tuesday?” Script archaeology Image version number Immutable versioned artifact
Health meaning “Process is up” “Can reach the HSM” /ready semantics
Customer impact (last patch) Outage Zero The health gate held

Advantages and disadvantages

The baked-image-plus-health-gated-rollout model both removes the chronic IaaS pains and adds new moving parts you must operate. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Boot is fast and identical — config is baked, not run at boot A build pipeline is now infrastructure you must own and debug
Immutable, versioned artifact — rollback is “point at the prior version” More resources to govern: identity, template, gallery, replication
Health-gated rolling upgrade halts and rolls back automatically A bad health probe greenlights a broken batch — the probe is the whole game
Replication geography gives you a built-in canary lever Replication adds latency before a region can roll out
Flexible instances are real VMs — standard tooling just works Auto-OS upgrade on Flex is preview; constraints (no MaxSurge) apply
Automatic instance repair self-heals stuck instances A too-short repair grace period causes a repair loop
Spot Priority Mix cuts cost with a protected regular floor Spot eviction must be handled in-guest or you lose work
excludeFromLatest is a clean kill switch for a bad bake Forgetting to reference the definition without a version disables auto-upgrade tracking

The model is right for fleets that need fast, deterministic boot, an audit-grade artifact, and safe rollout — regulated workloads, large web/app tiers, self-hosted runners. It is overkill for a single VM or a tiny static set you rarely change. The disadvantages are all operational discipline, not fundamental flaws: own the pipeline, design the probe honestly, size the grace periods to boot+warm-up, and handle eviction — then the model pays for itself the first time a bad patch halts itself in a canary region.

Hands-on lab

Stand up a Flexible scale set, attach an honest health probe, and watch a rolling change replace instances batch by batch — free-tier-conscious (small SKU, low count; delete at the end). Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-vmss-lab
LOC=eastus
az group create -n $RG -l $LOC -o table

Step 2 — Create a Flexible scale set, zonal + max FD spread.

az vmss create -g $RG -n vmss-lab \
  --orchestration-mode flexible \
  --zones 1 2 3 --platform-fault-domain-count 1 \
  --instance-count 3 --vm-sku Standard_B2s \
  --image Ubuntu2204 --upgrade-policy-mode Manual \
  --admin-username azureuser --generate-ssh-keys \
  --single-placement-group false -o table

Expected: a scale set with orchestrationMode: Flexible and three instances spread across zones.

Step 3 — Install a tiny health endpoint via custom data, then add the Application Health extension. Use a trivial /healthz on port 8080:

az vmss extension set -g $RG --vmss-name vmss-lab \
  --name ApplicationHealthLinux \
  --publisher Microsoft.ManagedServices --version 2.0 \
  --settings '{"protocol":"http","port":8080,"requestPath":"/healthz","intervalInSeconds":5,"numberOfProbes":1,"gracePeriod":300}'

az vmss update-instances -g $RG -n vmss-lab --instance-ids '*'

Step 4 — Confirm every instance reports a health state, not just a power state.

az vmss get-instance-view -g $RG -n vmss-lab --query "statuses" -o table
# Look for HealthState/Healthy (or Unknown during grace), not only PowerState/running

Step 5 — Switch to Rolling and tighten the envelope.

az vmss update -g $RG -n vmss-lab \
  --set upgradePolicy.mode=Rolling \
  --max-batch-instance-percent 34 \
  --max-unhealthy-instance-percent 34 \
  --max-unhealthy-upgraded-instance-percent 34 \
  --pause-time-between-batches PT1M

Step 6 — Trigger a model change and watch the rolling replacement. Change the SKU to force instance replacement, then watch the rolling-upgrade result:

az vmss update -g $RG -n vmss-lab --set sku.name=Standard_B2ms
az vmss rolling-upgrade get-latest -g $RG -n vmss-lab -o jsonc
# runningStatus.code progresses; failedInstanceCount should stay 0

Expected: instances upgrade in batches of ~1 (34% of 3), pausing a minute between, with failedInstanceCount: 0.

Step 7 — Enable automatic instance repair.

az vmss update -g $RG -n vmss-lab \
  --enable-automatic-repairs true \
  --automatic-repairs-grace-period PT30M

Validation checklist. You created a real Flexible set, attached the required single health source, promoted it to Rolling only after health was green, and observed a health-gated batch replacement with zero failed instances. The steps mapped to what each proves:

Step What you did What it proves Real-world analogue
2 Flexible set, FD=1, zonal The shape is set once, correctly Every new production fleet
3 Add Application Health extension Flex needs the single health source Wiring the rollout gate
4 Read HealthState Health ≠ power state Pre-rollout green check
5 Promote to Rolling Tightened envelope before risk Production model changes
6 SKU change → batch replace Rolling upgrade actually batches + gates A real image/config rollout
7 Enable repair Continuous self-heal Day-2 operations

Cleanup (avoid lingering charges).

az group delete -n $RG --yes --no-wait

Cost note. Three B2s/B2ms instances for an hour is a few tens of rupees; deleting the resource group stops everything. Keep the count low and delete promptly.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the entries that bite hardest expanded with the full confirm path.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 AIB build fails immediately, can’t publish Build identity lacks rights on gallery/staging RG az image builder show -g $RG -n <tmpl> --query lastRunStatus.message Grant the user-assigned identity image-version + disk actions scoped to the RGs
2 New version exists but a region’s instances don’t get it Version not finished replicating to that region az sig image-version show ... --query publishingProfile.targetRegions Add the region to targetRegions; wait for Succeeded
3 New version published but auto-OS upgrade never fires VMSS references a pinned version, not the definition az vmss show --query virtualMachineProfile.storageProfile.imageReference Reference the definition without a version segment
4 Rolling upgrade halts after first batch New instances fail the health probe az vmss rolling-upgrade get-latest -g $RG -n vmss-app Fix the app/baseline; demote bad version with excludeFromLatest; re-bake
5 Auto-OS upgrade silently does nothing on Flex Missing prerequisite (no health ext / version mismatch / MaxSurge set) az vmss show --query upgradePolicy; az vmss extension list Add/align Application Health ext; remove MaxSurge; set image to latest
6 Orchestration features won’t engage at all Two health sources (extension and LB probe) Inspect model: extension list + --query ...loadBalancerConfigurations Remove one source; keep exactly one
7 Instances churn/replace constantly outside any upgrade Repair grace period shorter than boot + warm-up az vmss get-instance-view + activity log repair events Raise --automatic-repairs-grace-period to cover warm-up
8 Probe shows Unhealthy though the app “works” Probe returns a non-200 2xx (redirect/204) or wrong port curl -i http://<instance>:<port><path> from a peer Return a clean 200; match port/path to the listener
9 Rolling upgrade cancels from unrelated maintenance maxUnhealthyInstancePercent counts all unhealthy az vmss rolling-upgrade get-latest (whole-set threshold) Resolve the unrelated unhealthy; resume; widen threshold if appropriate
10 Spot fleet loses too much capacity on a reclaim Regular floor too low / no eviction handling az vmss show --query ...spotRestorePolicy / priorityMixPolicy Raise regular-priority-count; handle Preempt Scheduled Events
11 Can’t change FD count / zones after create Both are set once at creation az vmss show --query "{fd:platformFaultDomainCount,zones:zones}" Recreate the set with the correct values
12 Build ships a broken image but reports Succeeded A failing customizer didn’t fail the build Inspect the customizer log; missing set -euo pipefail Add set -euo pipefail; assert post-conditions in the script

The expanded form for the entries that cause the longest outages:

1. AIB build fails immediately and can’t publish. Root cause: the user-assigned identity lacks the image-version/disk actions on the gallery and staging resource groups (or you scoped it to the wrong RG). Confirm: az image builder show -g $RG -n <template> --query lastRunStatus.message returns an authorization error naming the action. Fix: grant a custom role (or Contributor on the gallery/staging RG only) with galleries/images/versions/write, images/write, disks/write. Never subscription Contributor.

4. Rolling upgrade halts after the first batch. Root cause: the freshly-upgraded instances fail the Application Health probe — the new image/baseline broke a dependency (a tightened firewall rule, a missing package, a wrong port). Confirm: az vmss rolling-upgrade get-latest -g $RG -n vmss-app shows MaxUnhealthyUpgradedInstancePercentExceededInRollingUpgrade; the previous OS disks were restored. Fix: fix the app or baseline; demote the bad version with excludeFromLatest=true; re-bake as a new version. The halt is the system working — the canary held the blast radius.

5. Automatic OS upgrade silently does nothing on Flex. Root cause: a missing preview prerequisite — the image is pinned to a version instead of latest, the Application Health extension is absent or version-mismatched against the model, or MaxSurge is set (unsupported with auto-OS on Flex). Confirm: az vmss show --query upgradePolicy.automaticOSUpgradePolicy; az vmss extension list; check the image reference has no version. Fix: reference the definition without a version, add/align the health extension, remove MaxSurge, then re-enable.

6. Orchestration features won’t engage at all. Root cause: two health sources configured — both an Application Health extension and a Load Balancer health probe. Confirm: the model shows a ApplicationHealth* extension and a networkProfile...loadBalancerConfigurations health probe. Fix: remove one; a scale set may have exactly one health source.

7. Instances churn outside any upgrade. Root cause: automatic instance repair grace period is shorter than boot plus app warm-up, so an instance is judged unhealthy and replaced before it ever becomes ready — a self-perpetuating loop. Confirm: az vmss get-instance-view shows repeated repairs; the activity log lists back-to-back repair operations. Fix: raise --automatic-repairs-grace-period (and the extension gracePeriod) to comfortably exceed boot + warm-up.

Best practices

Security notes

The security controls and what each buys you, secure and resilient pulling in the same direction:

Control Mechanism Secures against Also prevents
Least-privilege build identity Custom role on gallery/staging RG Pipeline compromise → subscription damage Accidental broad writes
No baked secrets Managed identity + Key Vault at boot Secret sprawl across every instance Rotation breaking a baked value
Trusted Launch Gen2 + Secure Boot + vTPM Boot-chain tampering, rootkits Unmeasured boot drift
Staging RG lockdown RBAC + VNet build Build-time exposure Orphaned build artifacts
Controlled script source Access-restricted blob Supply-chain injection Unknown scripts in the image
Monthly rebake version: latest source Unpatched CVEs in base Image rot / drift

Cost & sizing

The bill drivers and how to right-size them:

A rough monthly picture and what each driver buys:

Cost driver What you pay for Rough INR / month What it buys Watch-out
D2as_v5 (baseline) Three always-on instances ~₹18,000–24,000 The steady-state fleet Don’t over-set min
Autoscale to 30 at peak Extra instances during spikes + per-hour at peak only Demand headroom max must clear real peak
Spot above a floor of 3 Discounted interruptible capacity −60–90% on above-floor Cheap scale for batch Eviction handling required
Gallery replicas (2 regions) Stored image versions ~₹1,000–3,000 Fast multi-region create More replicas = more cost
AIB build (per bake) Build VM runtime + disk a few ₹ per build The golden image Keep timeout realistic
Ephemeral OS disk (no disk charge) ₹0 disk Faster reimage, lower cost Data is lost on reimage

For deeper cost governance across many such fleets, see Azure FinOps & Cost Management at Scale.

Interview & exam questions

1. Uniform vs Flexible orchestration — when do you pick each, and what’s irreversible? Flexible is the default: instances are real Microsoft.Compute/virtualMachines resources, so standard tooling and per-instance ops work, and it unlocks mixed sizes and Spot Priority Mix. Pick Uniform only for very large homogeneous fleets or Service Fabric. The orchestration mode (and platformFaultDomainCount, and zones) is set at creation and cannot be changed — recreating the set is the only way to change it.

2. What does platformFaultDomainCount=1 mean and why is it the recommended default? It requests max spreading — Azure distributes instances across as many fault domains as the region allows, best-effort. It gives the broadest availability and avoids allocation failures in constrained regions. Combine it with Availability Zones for the strongest posture; use a fixed 2–3 only for quorum systems that need a known FD count.

3. Why scope the AIB build identity tightly, and to what? A build identity that can publish images can, if over-privileged, be a subscription-wide compromise vector. Grant a custom role (or Contributor on just the gallery and staging RGs) with the image-version and disk actions AIB needs — galleries/images/versions/write, images/write, disks/write — never subscription Contributor.

4. How does replication gate a regional rollout, and how do you exploit it? A scale set in a region only sees a new image version once it has finished replicating there. You exploit this as a canary lever: replicate a new version to one region first, prove it with a health-gated rolling upgrade, then add the remaining regions to targetRegions to fan out.

5. Distinguish upgrade-policy mode from automatic OS image upgrade. The mode (Manual/Automatic/Rolling) governs what happens to existing instances when you change the scale set model. Automatic OS image upgrade governs what happens when a new image version appears, and it always uses the rolling-upgrade policy regardless of mode. They are configured independently and are constantly conflated.

6. What are the three thresholds in a rolling upgrade policy and which one actually triggers rollback? maxBatchInstancePercent caps batch size; maxUnhealthyInstancePercent halts if too much of the whole set is unhealthy; maxUnhealthyUpgradedInstancePercent cancels and rolls back if too many already-upgraded instances go unhealthy. The last one is the true rollback trigger for a bad bake — keep it tight.

7. Why is the Application Health extension required on Flex, and what’s the “exactly one health source” rule? On Flexible orchestration there is no load-balancer-probe fallback, so the Application Health extension is the required signal that gates rolling upgrades and instance repair. A scale set may have exactly one health source; configuring both the extension and an LB health probe disables orchestration features until you remove one.

8. How should a health probe be designed, and what’s the classic mistake? It must return a clean 200 only when the instance can do real work (e.g. reach its required downstream), not merely when the process is up. The classic mistake is a shallow “process alive” probe that greenlights a broken batch — the payments scenario’s /ready returning 200 only after a successful HSM connection is the correct shape.

9. What causes an automatic instance repair loop and how do you stop it? A repair grace period shorter than boot + warm-up: the instance is judged unhealthy and replaced before it ever becomes ready, repeatedly. Stop it by raising --automatic-repairs-grace-period (and the extension gracePeriod) to comfortably exceed warm-up.

10. Explain Spot Priority Mix and when it’s appropriate. It runs a guaranteed floor of regular VMs (regular-priority-count, never evicted) plus a percentage of regular instances above that floor, with the rest as Spot — evicted on capacity reclaim. It’s appropriate for interruptible work (batch, CI, stateless processors) where the regular floor protects core capacity; keep Spot out of a tier-1 synchronous path or set the floor to carry full load.

11. Why reference the gallery definition without a version, and what’s excludeFromLatest for? Referencing the definition without a version makes the set track latest, which is exactly what automatic OS upgrade keys off. excludeFromLatest removes a version from latest resolution — the kill switch that stops auto-upgrade from rolling out a bad bake, without deleting the version.

12. Why is version: latest on the AIB source useful? Because latest is resolved at build time, the same template rerun monthly always bakes on top of the newest patched base image — so a scheduled rebake keeps the golden image current instead of rotting on an old base.

These map to AZ-104 (Administrator)deploy and manage Azure compute resources, VM Scale Sets, scaling, and images — and AZ-305 (Solutions Architect Expert)design infrastructure solutions, compute resilience, and update strategy. The image-supply-chain and identity angles touch AZ-500. A compact cert-mapping for revision:

Question theme Primary cert Exam objective area
Orchestration mode, FD/zones AZ-104 Deploy & manage VM Scale Sets
AIB + Compute Gallery pipeline AZ-104 / AZ-305 Manage images; design compute
Rolling upgrade & health gate AZ-305 Design for resilience & updates
Autoscale & predictive scaling AZ-104 Configure scaling
Spot Priority Mix & cost AZ-305 Cost-optimized compute design
Build identity & Trusted Launch AZ-500 Secure compute & supply chain

Quick check

  1. You created a Flexible scale set with platformFaultDomainCount=2 and now want max spreading. What’s the only way to change it, and why?
  2. A new gallery image version exists, but instances in westus3 aren’t picking it up while eastus did. What’s the most likely cause and how do you confirm?
  3. True or false: setting upgradePolicy.mode=Rolling is what makes a new image version roll out automatically.
  4. Your rolling upgrade halted with MaxUnhealthyUpgradedInstancePercentExceededInRollingUpgrade after the first batch. Is this a failure of the system, and what do you do with the bad version?
  5. Instances are being replaced every few minutes even though no upgrade is running. Name the most likely cause and the fix.

Answers

  1. Recreate the set. platformFaultDomainCount (like orchestration mode and zones) is fixed at creation and cannot be changed on an existing scale set; you must deploy a new set with platformFaultDomainCount=1.
  2. The version hasn’t finished replicating to westus3 (or westus3 isn’t in targetRegions). Confirm with az sig image-version show ... --query publishingProfile.targetRegions and check each region’s provisioningState/regionalReplicaCount; add the region and wait for Succeeded.
  3. False. mode=Rolling governs what happens when you change the model. A new image version rolling out automatically is automatic OS image upgrade (enableAutomaticOSUpgrade=true), which uses the rolling policy but is a separate switch, and requires the definition referenced without a version.
  4. No — it’s the system working as designed. The health gate caught a bad bake, rolled the failed instances back to their previous OS disk, and halted with a tiny blast radius. Demote the bad version with excludeFromLatest=true (don’t delete it), fix the root cause, and ship a new version.
  5. The automatic-instance-repair grace period is shorter than boot + warm-up, so instances are judged unhealthy and replaced before they become ready — a repair loop. Raise --automatic-repairs-grace-period (and the extension gracePeriod) to exceed warm-up.

Glossary

Next steps

You can now run an immutably-versioned, self-healing IaaS fleet with safe rollouts. Build outward:

AzureVMSSImage BuilderCompute GalleryAutomation
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments