DevOps Azure

Azure DevOps Scale Set Agents: Ephemeral Pools, Autoscaling, and Pipeline Hardening

A standing pool of Microsoft-hosted or hand-built VM agents is a quiet liability. Hosted agents give you no network control and a fixed toolchain; persistent self-hosted agents accrue state between jobs, sit idle and billed overnight, and let one poisoned pull request contaminate the next build. Azure DevOps scale set agents sit in the middle: you own the VM Scale Set (VMSS) and its image, network, and identity, but Azure DevOps manages the agent lifecycle - provisioning VMs on demand, recycling each one after a job, and scaling the fleet to match queue depth. This guide builds that pool end to end and hardens it against the failure modes that matter at principal scale.

1. Why ephemeral agents beat persistent ones

Three problems with long-lived agents are worth naming precisely, because each maps to a concrete control later in this guide.

The trade-off is cold-start latency - provisioning a VM and starting the agent takes longer than dispatching to a warm one - and the discipline of baking everything into an image instead of apt install-ing it in the pipeline. Both are solvable, and the rest of this guide does so.

Scale set agents are not the same as running the agent container on AKS. There is no Kubernetes here. Azure DevOps talks directly to the VMSS resource and manages instance count and recycling for you. That makes this the right pattern when builds need a full VM (nested virtualization, Docker-in-Docker, kernel modules) rather than a pod.

2. Provision the VM Scale Set

The VMSS for a scale set agent pool has specific requirements that trip people up:

Create a resource group, a network, and the scale set. Start from a Microsoft-published Ubuntu image; we replace it with a custom image in step 4.

RG=rg-ado-agents
LOC=eastus2
VMSS=vmss-ado-linux

az group create --name "$RG" --location "$LOC"

# Dedicated VNet/subnet for build traffic (locked down in step 6)
az network vnet create \
 --resource-group "$RG" --name vnet-ado \
 --address-prefix 10.40.0.0/16 \
 --subnet-name snet-agents --subnet-prefix 10.40.1.0/24

az vmss create \
 --resource-group "$RG" --name "$VMSS" \
 --image Ubuntu2204 \
 --vm-sku Standard_D4ds_v5 \
 --storage-sku Premium_LRS \
 --vnet-name vnet-ado --subnet snet-agents \
 --instance-count 0 \
 --upgrade-policy-mode manual \
 --load-balancer '""' \
 --public-ip-address '""' \
 --orchestration-mode Uniform \
 --disable-overprovision \
 --assign-identity '[system]' \
 --admin-username azureuser \
 --generate-ssh-keys

A few choices are deliberate. --disable-overprovision enforces the 1:1 assumption. --public-ip-address '""' gives the instances no public IP - egress is controlled in step 6. --assign-identity '[system]' attaches a system-assigned managed identity so jobs can reach Azure resources without static secrets. Premium_LRS matters: build IO is the common bottleneck, and ephemeral OS disks or premium SSD shave minutes off image-heavy jobs.

Use a dedicated VMSS per pool. Do not point the agent pool at a scale set you also use for other workloads - Azure DevOps will resize and recycle instances out from under anything else running there.

3. Register the scale set as an elastic agent pool

The pool is created against the organization, then shared into projects. You can do this in the portal (Organization settings -> Agent pools -> Add pool -> Azure virtual machine scale set) or via the REST API. The REST shape is the source of truth and worth showing, because it exposes every autoscale knob you will tune in step 4.

First, the service principal or PAT you use needs the Owner or Contributor role on the VMSS so Azure DevOps can manage instances. Then create the elastic pool:

ORG=https://dev.azure.com/contoso
SUB=00000000-0000-0000-0000-000000000000

cat > pool.json <<'JSON'
{
 "serviceEndpointId": "11111111-1111-1111-1111-111111111111",
 "serviceEndpointScope": "22222222-2222-2222-2222-222222222222",
 "azureId": "/subscriptions/SUBID/resourceGroups/rg-ado-agents/providers/Microsoft.Compute/virtualMachineScaleSets/vmss-ado-linux",
 "maxCapacity": 20,
 "desiredIdle": 1,
 "recycleAfterEachUse": true,
 "maxSavedNodeCount": 0,
 "timeToLiveMinutes": 30,
 "agentInteractiveUI": false,
 "sizing": "automatic"
}
JSON

# serviceEndpointId/Scope refer to an ARM service connection in the project.
az devops invoke \
 --area distributedtask --resource elasticpools \
 --http-method POST --api-version 7.1 \
 --route-parameters poolName=ado-linux-ephemeral \
 --in-file pool.json --org "$ORG"

The fields that govern behaviour:

Field Meaning Hardening note
recycleAfterEachUse Destroy the VM after one job Always true for untrusted code. This is what makes the pool ephemeral.
desiredIdle Warm VMs kept ready for the next job Higher = lower queue time, higher idle cost.
maxCapacity Hard ceiling on instances Caps blast radius and spend.
timeToLiveMinutes Idle VM lifetime before scale-down Lower trims cost; too low causes thrash.
maxSavedNodeCount Failed VMs kept for debugging Keep 0 in prod so nothing lingers with build state.

recycleAfterEachUse: true plus maxSavedNodeCount: 0 is the core security posture: every job gets a brand-new VM, and nothing survives it.

4. Tune autoscale: sizing, min/max, sampling, scale-down

Azure DevOps samples the pool roughly every few minutes and reconciles instance count toward demand. Your levers are the pool fields above; the art is balancing queue time against idle spend.

Update an existing pool’s sizing without recreating it:

# Look up the elastic pool id first, then PATCH its sizing.
POOL_ID=$(az devops invoke \
 --area distributedtask --resource elasticpools \
 --http-method GET --api-version 7.1 --org "$ORG" \
 --query "value[?azureId.ends_with(@,'vmss-ado-linux')].poolId | [0]" -o tsv)

cat > resize.json <<'JSON'
{ "maxCapacity": 30, "desiredIdle": 3, "timeToLiveMinutes": 20 }
JSON

az devops invoke \
 --area distributedtask --resource elasticpool \
 --http-method PATCH --api-version 7.1 \
 --route-parameters poolId="$POOL_ID" \
 --in-file resize.json --org "$ORG"

Watch for the provisioning-failure feedback loop. If the image or extension is broken, Azure DevOps provisions a VM, the agent never comes online, the VM is recycled, and it tries again - burning money while every job sits queued. Alert on a non-zero “offline agents” count for the pool (step 8) so a bad image rollout pages you instead of draining the budget overnight.

5. Bake a hardened agent image with Packer

Installing toolchains in the pipeline (apt install, nvm install, …) defeats the purpose: it is slow, non-reproducible, and reaches out to the internet from inside an untrusted job. Bake a custom managed image with everything pre-installed and use it as the VMSS source. Packer’s azure-arm builder produces a managed image (or a Shared Image Gallery version) you can pin by version.

# agent-image.pkr.hcl
packer {
 required_plugins {
 azure = {
 source = "github.com/hashicorp/azure"
 version = ">= 2.0.0"
 }
 }
}

source "azure-arm" "ubuntu_agent" {
 use_azure_cli_auth = true

 managed_image_resource_group_name = "rg-ado-images"
 managed_image_name = "ado-agent-ubuntu2204-{{timestamp}}"

 os_type = "Linux"
 image_publisher = "canonical"
 image_offer = "0001-com-ubuntu-server-jammy"
 image_sku = "22_04-lts-gen2"

 location = "eastus2"
 vm_size = "Standard_D4ds_v5"
}

build {
 sources = ["source.azure-arm.ubuntu_agent"]

 provisioner "shell" {
 inline = [
 "set -euxo pipefail",
 "sudo apt-get update",
 "sudo apt-get install -y --no-install-recommends git curl jq unzip ca-certificates",
 # Pin toolchains to exact versions for reproducibility
 "curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -",
 "sudo apt-get install -y nodejs=20.*",
 "curl -sSL https://aka.ms/InstallAzureCLIDeb | sudo bash",
 # Trivy for in-image scanning of build outputs
 "curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sudo sh -s -- -b /usr/local/bin",
 # CIS-style hardening: disable root SSH, remove cloud-init creds at boot
 "sudo sed -i 's/^#\\?PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config",
 "sudo apt-get -y autoremove && sudo apt-get -y clean"
 ]
 }

 # MANDATORY final step for Azure managed images: deprovision the agent.
 provisioner "shell" {
 execute_command = "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'"
 inline = ["/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync"]
 inline_shebang = "/bin/sh -x"
 }
}

The final waagent -force -deprovision+user step is non-negotiable for Azure images - it strips the user, SSH host keys, and machine-specific state so every VMSS instance comes up clean. Skipping it bakes one machine’s identity into every agent.

Point the VMSS at the new image and let Azure DevOps roll it out by recycling instances:

IMG=$(az image show -g rg-ado-images -n ado-agent-ubuntu2204-1717000000 --query id -o tsv)

az vmss update \
 --resource-group "$RG" --name "$VMSS" \
 --set virtualMachineProfile.storageProfile.imageReference.id="$IMG"

Do not pre-install the Azure DevOps agent into the image. The scale set agent feature injects and configures the agent via a VM extension at provision time, so it always matches your organization. Bake the tools, not the agent.

6. Network isolation and egress control

Build traffic is a top exfiltration path: a compromised dependency in a PR can phone home or pull a second-stage payload. Because the VMSS lives in your VNet, you control its egress - something hosted agents cannot offer.

The agent needs outbound HTTPS to Azure DevOps and your package feeds, and nothing else. Lock egress with an NSG that denies general internet and a route table that forces traffic through Azure Firewall (or a NAT gateway with FQDN filtering).

# Deny-by-default egress, then allow only what builds legitimately need.
az network nsg create -g "$RG" -n nsg-agents

az network nsg rule create -g "$RG" --nsg-name nsg-agents \
 --name allow-azure-devops --priority 100 --direction Outbound \
 --access Allow --protocol Tcp --destination-port-ranges 443 \
 --destination-address-prefixes AzureDevOps

az network nsg rule create -g "$RG" --nsg-name nsg-agents \
 --name deny-all-egress --priority 4096 --direction Outbound \
 --access Deny --protocol "*" --destination-port-ranges "*" \
 --destination-address-prefixes Internet

az network vnet subnet update -g "$RG" --vnet-name vnet-ado \
 --name snet-agents --network-security-group nsg-agents

AzureDevOps is a service tag that Microsoft keeps current with the platform’s IP ranges, so you do not hand-maintain a CIDR list. For package registries, prefer Azure Artifacts upstream sources reached over a Private Endpoint, or allow your firewall’s FQDN rules for *.nuget.org / registry.npmjs.org explicitly rather than opening the internet. For Azure resources the build touches (storage, Key Vault, ACR), use Private Endpoints into the same VNet so that traffic never leaves the Microsoft backbone.

If you self-host the agent behind a deny-all egress, allow the service tag, not a static IP list. Azure DevOps rotates ranges; a hardcoded allow-list will eventually black-hole your whole fleet at 2 a.m.

7. Pipeline permissions, approvals, and checks on the pool

An ephemeral pool with open access is still dangerous: any pipeline that can target it can run arbitrary code on a VM with a managed identity inside your VNet. Treat the agent pool as a protected resource and gate it the same way you gate an Environment.

First, turn off open access so a new pipeline cannot silently use the pool. In Project settings -> Agent pools -> {pool} -> Security, remove “Open access” and grant only the specific pipelines that should run here. Then attach checks to the pool itself, so they evaluate before a VM is provisioned:

Branch control as a check is the highest-leverage gate, because it stops untrusted PR branches from ever reaching the pool:

# Conceptual: a Branch control check on the agent pool resource,
# configured under pool Security -> Checks (not in pipeline YAML).
allowedBranches: "refs/heads/main,refs/heads/release/*"
ensureProtectionOfBranch: true # require branch protection policies to exist

In the pipeline that consumes the pool, reference it by name and pin the demands so a job lands only on agents that advertise the right capabilities:

pool:
 name: ado-linux-ephemeral
 demands:
 - Agent.OS -equals Linux
 - docker # capability published by the image

Checks on a pool run before the agent is acquired, exactly like checks on an Environment. A blocked approval or a branch that fails branch control costs you zero agent minutes - the VM is never provisioned.

8. Per-job credential scoping and log hygiene

Ephemeral VMs remove cross-job credential carry-over, but a job can still leak a secret into its own logs or pull more privilege than it needs. Close both gaps.

Use Workload Identity Federation, not stored keys. Configure your ARM service connection for OIDC so the pipeline mints a short-lived federated token instead of holding a client secret. Combined with the VMSS managed identity, builds reach Azure with credentials that expire in minutes and never sit on disk.

steps:
 - task: AzureCLI@2
 inputs:
 azureSubscription: sc-workload-identity # OIDC-based service connection
 scriptType: bash
 scriptLocation: inlineScript
 inlineScript: |
 # Token is federated and short-lived; no secret is stored anywhere.
 az storage blob upload --account-name builddrops \
 --container-name artifacts --file ./out/app.zip --auth-mode login

Scope variable groups per stage. Do not pour every secret into one group attached to the whole pipeline. Link a Key Vault-backed variable group only to the stage that needs it, so a build job cannot read deploy-time production secrets.

Mask aggressively and fail on leaks. Secret variables are masked in logs automatically, but values that arrive from a script, a file, or string interpolation are not. Register them explicitly and avoid echoing environment dumps:

steps:
 - script: |
 TOKEN=$(./scripts/mint-token.sh)
 echo "##vso[task.setsecret]$TOKEN" # mask before any later use
 ./deploy --token "$TOKEN"
 displayName: Deploy with masked token

##vso[task.setsecret] masks a value computed at runtime. Reach for it whenever a secret originates inside the job (a generated token, a decrypted file) rather than from a pre-declared secret variable, because only declared secrets are masked for you.

Verify

Confirm the pool is genuinely ephemeral, isolated, and gated before you route real traffic to it.

# 1. Pool is registered and elastic. desiredIdle/maxCapacity should match step 4.
az devops invoke --area distributedtask --resource elasticpools \
 --http-method GET --api-version 7.1 --org "$ORG" \
 --query "value[?azureId.ends_with(@,'vmss-ado-linux')].{idle:desiredIdle,max:maxCapacity,recycle:recycleAfterEachUse}"

# 2. VMSS has no public IPs and no autoscale rules of its own.
az vmss list-instance-public-ips -g "$RG" -n "$VMSS" -o table # expect empty
az monitor autoscale list -g "$RG" -o table # expect empty

# 3. Egress is deny-by-default with only AzureDevOps allowed.
az network nsg rule list -g "$RG" --nsg-name nsg-agents \
 --query "[?direction=='Outbound'].{name:name,access:access,dst:destinationAddressPrefix}" -o table

Then run a canary pipeline and check three behaviours:

  1. Ephemerality. Run a job that writes a marker file to $HOME, then run a second job. The marker must be gone - different VM, clean room.
  2. Scale-down. After a burst, watch instance count fall back toward desiredIdle within timeToLiveMinutes: az vmss list-instances -g "$RG" -n "$VMSS" -o table.
  3. Gating. Push the pipeline from a non-allowed branch. The branch-control check must block it before a VM is provisioned (instance count stays flat).

For queue-time and cost monitoring, the pipeline analytics view exposes wait time, and Azure Monitor metrics on the VMSS give you instance-count over time. A KQL query against your cost export surfaces idle spend:

// Daily compute cost attributed to the agent VMSS, last 30 days.
Usage
| where TimeGenerated > ago(30d)
| where ResourceId has "virtualMachineScaleSets/vmss-ado-linux"
| summarize CostUSD = sum(CostInBillingCurrency) by bin(TimeGenerated, 1d)
| order by TimeGenerated asc

Enterprise scenario

A payments platform team ran a 24-agent persistent Linux pool for a monorepo of ~140 services. Two constraints collided. First, PCI segmentation: their auditor flagged that build agents with broad outbound internet access sat one hop from the cardholder-data environment, and any compromised npm or NuGet dependency could exfiltrate or pivot. Second, cost: the pool was sized for the 9-11 a.m. merge storm but billed flat around the clock, running near 22% average utilization.

They moved to a scale set pool with recycleAfterEachUse: true, desiredIdle: 4 (their measured merge-storm floor), and maxCapacity: 40. The VMSS went into a dedicated, internet-egress-denied subnet; the only allowed outbound was the AzureDevOps service tag plus a Private Endpoint to Azure Artifacts upstream feeds, which proxied and cached the public registries. That single change satisfied the segmentation finding - builds could no longer reach the open internet - while the cached feeds actually sped up restores.

The subtle bug they hit: with desiredIdle: 0 (their first attempt at maximum savings), the merge storm produced 3-4 minute queue times because every one of the dozen simultaneous builds cold-started a VM. Worse, a flaky image-extension rollout one night entered a provision-fail-recycle loop and quietly burned a weekend of compute. The fix was twofold - raise desiredIdle to cover the concurrent floor, and alert on offline agents so a bad image pages someone:

// Alert: agents that provisioned but never came online (broken image/extension).
AzureDiagnostics
| where Category == "ScaleSetAgentPool"
| where Status_s == "offline"
| summarize OfflineAgents = dcount(AgentName_s) by bin(TimeGenerated, 5m), PoolName_s
| where OfflineAgents > 2

Net result: a 34% drop in monthly CI compute spend, a clean-room VM per job that closed the cross-build contamination finding, and an egress posture the auditor signed off on - all without changing a line of pipeline YAML beyond the pool: reference.

Hardening checklist

azure-devopsci-cdscale-set-agentsautoscalingpipeline-security

Comments

Keep Reading