A standing pool of Microsoft-hosted or hand-built VM agents is a quiet liability. Hosted agents give you no network control and a fixed toolchain; persistent self-hosted agents accrue state between jobs, sit idle and billed overnight, and let one poisoned pull request contaminate the next build. Azure DevOps scale set agents sit in the middle: you own the VM Scale Set (VMSS) and its image, network, and identity, but Azure DevOps manages the agent lifecycle - provisioning VMs on demand, recycling each one after a job, and scaling the fleet to match queue depth. This guide builds that pool end to end and hardens it against the failure modes that matter at principal scale.
1. Why ephemeral agents beat persistent ones
Three problems with long-lived agents are worth naming precisely, because each maps to a concrete control later in this guide.
- Cross-job contamination. A persistent agent carries dependency caches, cloned repos, leftover credentials, and Docker layers from job to job. A malicious or buggy build can plant a backdoor in
~/.npmrc, a Git hook, or a cached base image that the next job silently inherits. A fresh VM per job is a clean room - Azure DevOps tears the VM down after the job and provisions a new one from the image, so there is no carry-over to poison. - The cost of idle. Persistent fleets are sized for peak and billed 24/7. Most CI fleets run at 15-30% utilization, so two-thirds of the spend buys capacity that does nothing. A scale set pool scales to a configurable floor (often zero) overnight.
- Drift. Hand-patched agents diverge from each other and from “what we think is installed.” When the image is the only durable artifact and VMs are disposable, “what ran my build” is answerable by an image version, not by SSH-ing in to check.
The trade-off is cold-start latency - provisioning a VM and starting the agent takes longer than dispatching to a warm one - and the discipline of baking everything into an image instead of apt install-ing it in the pipeline. Both are solvable, and the rest of this guide does so.
Scale set agents are not the same as running the agent container on AKS. There is no Kubernetes here. Azure DevOps talks directly to the VMSS resource and manages instance count and recycling for you. That makes this the right pattern when builds need a full VM (nested virtualization, Docker-in-Docker, kernel modules) rather than a pod.
2. Provision the VM Scale Set
The VMSS for a scale set agent pool has specific requirements that trip people up:
- No autoscale rules of your own. Azure DevOps owns the instance count. If a VMSS autoscale profile exists, the service fights it. Set
--instance-count 0and do not attach Azure Monitor autoscale. - Overprovisioning off. Overprovisioning causes Azure to create extra VMs and delete the slowest, which breaks the agent’s 1:1 VM-to-agent assumption.
- Manual upgrade policy. The service decides when to cycle instances onto a new image.
Create a resource group, a network, and the scale set. Start from a Microsoft-published Ubuntu image; we replace it with a custom image in step 4.
RG=rg-ado-agents
LOC=eastus2
VMSS=vmss-ado-linux
az group create --name "$RG" --location "$LOC"
# Dedicated VNet/subnet for build traffic (locked down in step 6)
az network vnet create \
--resource-group "$RG" --name vnet-ado \
--address-prefix 10.40.0.0/16 \
--subnet-name snet-agents --subnet-prefix 10.40.1.0/24
az vmss create \
--resource-group "$RG" --name "$VMSS" \
--image Ubuntu2204 \
--vm-sku Standard_D4ds_v5 \
--storage-sku Premium_LRS \
--vnet-name vnet-ado --subnet snet-agents \
--instance-count 0 \
--upgrade-policy-mode manual \
--load-balancer '""' \
--public-ip-address '""' \
--orchestration-mode Uniform \
--disable-overprovision \
--assign-identity '[system]' \
--admin-username azureuser \
--generate-ssh-keys
A few choices are deliberate. --disable-overprovision enforces the 1:1 assumption. --public-ip-address '""' gives the instances no public IP - egress is controlled in step 6. --assign-identity '[system]' attaches a system-assigned managed identity so jobs can reach Azure resources without static secrets. Premium_LRS matters: build IO is the common bottleneck, and ephemeral OS disks or premium SSD shave minutes off image-heavy jobs.
Use a dedicated VMSS per pool. Do not point the agent pool at a scale set you also use for other workloads - Azure DevOps will resize and recycle instances out from under anything else running there.
3. Register the scale set as an elastic agent pool
The pool is created against the organization, then shared into projects. You can do this in the portal (Organization settings -> Agent pools -> Add pool -> Azure virtual machine scale set) or via the REST API. The REST shape is the source of truth and worth showing, because it exposes every autoscale knob you will tune in step 4.
First, the service principal or PAT you use needs the Owner or Contributor role on the VMSS so Azure DevOps can manage instances. Then create the elastic pool:
ORG=https://dev.azure.com/contoso
SUB=00000000-0000-0000-0000-000000000000
cat > pool.json <<'JSON'
{
"serviceEndpointId": "11111111-1111-1111-1111-111111111111",
"serviceEndpointScope": "22222222-2222-2222-2222-222222222222",
"azureId": "/subscriptions/SUBID/resourceGroups/rg-ado-agents/providers/Microsoft.Compute/virtualMachineScaleSets/vmss-ado-linux",
"maxCapacity": 20,
"desiredIdle": 1,
"recycleAfterEachUse": true,
"maxSavedNodeCount": 0,
"timeToLiveMinutes": 30,
"agentInteractiveUI": false,
"sizing": "automatic"
}
JSON
# serviceEndpointId/Scope refer to an ARM service connection in the project.
az devops invoke \
--area distributedtask --resource elasticpools \
--http-method POST --api-version 7.1 \
--route-parameters poolName=ado-linux-ephemeral \
--in-file pool.json --org "$ORG"
The fields that govern behaviour:
| Field | Meaning | Hardening note |
|---|---|---|
recycleAfterEachUse |
Destroy the VM after one job | Always true for untrusted code. This is what makes the pool ephemeral. |
desiredIdle |
Warm VMs kept ready for the next job | Higher = lower queue time, higher idle cost. |
maxCapacity |
Hard ceiling on instances | Caps blast radius and spend. |
timeToLiveMinutes |
Idle VM lifetime before scale-down | Lower trims cost; too low causes thrash. |
maxSavedNodeCount |
Failed VMs kept for debugging | Keep 0 in prod so nothing lingers with build state. |
recycleAfterEachUse: true plus maxSavedNodeCount: 0 is the core security posture: every job gets a brand-new VM, and nothing survives it.
4. Tune autoscale: sizing, min/max, sampling, scale-down
Azure DevOps samples the pool roughly every few minutes and reconciles instance count toward demand. Your levers are the pool fields above; the art is balancing queue time against idle spend.
desiredIdle(the warm pool). This is the single biggest lever on developer experience.desiredIdle: 0means every job pays full cold-start (provision VM + boot + start agent), often 2-4 minutes. Set it to your typical concurrent-burst floor - for a team that routinely fires 3-4 builds at once on a push,desiredIdle: 3keeps p50 queue time near zero.maxCapacity(the ceiling). Size it from observed peak concurrency plus headroom, not from hope. A ceiling that is too low silently serializes builds during busy hours; too high invites a runaway fan-out (or a fork-bomb pipeline) to spin up dozens of billed VMs.timeToLiveMinutes(scale-down delay). After a burst, idle VMs abovedesiredIdleare torn down once they have been idle this long. Set it longer than your inter-build gap so you do not thrash - provisioning then immediately destroying VMs wastes both time and money. 30 minutes is a sane default for interactive teams; drop to 10-15 for batch-heavy, bursty patterns.
Update an existing pool’s sizing without recreating it:
# Look up the elastic pool id first, then PATCH its sizing.
POOL_ID=$(az devops invoke \
--area distributedtask --resource elasticpools \
--http-method GET --api-version 7.1 --org "$ORG" \
--query "value[?azureId.ends_with(@,'vmss-ado-linux')].poolId | [0]" -o tsv)
cat > resize.json <<'JSON'
{ "maxCapacity": 30, "desiredIdle": 3, "timeToLiveMinutes": 20 }
JSON
az devops invoke \
--area distributedtask --resource elasticpool \
--http-method PATCH --api-version 7.1 \
--route-parameters poolId="$POOL_ID" \
--in-file resize.json --org "$ORG"
Watch for the provisioning-failure feedback loop. If the image or extension is broken, Azure DevOps provisions a VM, the agent never comes online, the VM is recycled, and it tries again - burning money while every job sits queued. Alert on a non-zero “offline agents” count for the pool (step 8) so a bad image rollout pages you instead of draining the budget overnight.
5. Bake a hardened agent image with Packer
Installing toolchains in the pipeline (apt install, nvm install, …) defeats the purpose: it is slow, non-reproducible, and reaches out to the internet from inside an untrusted job. Bake a custom managed image with everything pre-installed and use it as the VMSS source. Packer’s azure-arm builder produces a managed image (or a Shared Image Gallery version) you can pin by version.
# agent-image.pkr.hcl
packer {
required_plugins {
azure = {
source = "github.com/hashicorp/azure"
version = ">= 2.0.0"
}
}
}
source "azure-arm" "ubuntu_agent" {
use_azure_cli_auth = true
managed_image_resource_group_name = "rg-ado-images"
managed_image_name = "ado-agent-ubuntu2204-{{timestamp}}"
os_type = "Linux"
image_publisher = "canonical"
image_offer = "0001-com-ubuntu-server-jammy"
image_sku = "22_04-lts-gen2"
location = "eastus2"
vm_size = "Standard_D4ds_v5"
}
build {
sources = ["source.azure-arm.ubuntu_agent"]
provisioner "shell" {
inline = [
"set -euxo pipefail",
"sudo apt-get update",
"sudo apt-get install -y --no-install-recommends git curl jq unzip ca-certificates",
# Pin toolchains to exact versions for reproducibility
"curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -",
"sudo apt-get install -y nodejs=20.*",
"curl -sSL https://aka.ms/InstallAzureCLIDeb | sudo bash",
# Trivy for in-image scanning of build outputs
"curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sudo sh -s -- -b /usr/local/bin",
# CIS-style hardening: disable root SSH, remove cloud-init creds at boot
"sudo sed -i 's/^#\\?PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config",
"sudo apt-get -y autoremove && sudo apt-get -y clean"
]
}
# MANDATORY final step for Azure managed images: deprovision the agent.
provisioner "shell" {
execute_command = "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'"
inline = ["/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync"]
inline_shebang = "/bin/sh -x"
}
}
The final waagent -force -deprovision+user step is non-negotiable for Azure images - it strips the user, SSH host keys, and machine-specific state so every VMSS instance comes up clean. Skipping it bakes one machine’s identity into every agent.
Point the VMSS at the new image and let Azure DevOps roll it out by recycling instances:
IMG=$(az image show -g rg-ado-images -n ado-agent-ubuntu2204-1717000000 --query id -o tsv)
az vmss update \
--resource-group "$RG" --name "$VMSS" \
--set virtualMachineProfile.storageProfile.imageReference.id="$IMG"
Do not pre-install the Azure DevOps agent into the image. The scale set agent feature injects and configures the agent via a VM extension at provision time, so it always matches your organization. Bake the tools, not the agent.
6. Network isolation and egress control
Build traffic is a top exfiltration path: a compromised dependency in a PR can phone home or pull a second-stage payload. Because the VMSS lives in your VNet, you control its egress - something hosted agents cannot offer.
The agent needs outbound HTTPS to Azure DevOps and your package feeds, and nothing else. Lock egress with an NSG that denies general internet and a route table that forces traffic through Azure Firewall (or a NAT gateway with FQDN filtering).
# Deny-by-default egress, then allow only what builds legitimately need.
az network nsg create -g "$RG" -n nsg-agents
az network nsg rule create -g "$RG" --nsg-name nsg-agents \
--name allow-azure-devops --priority 100 --direction Outbound \
--access Allow --protocol Tcp --destination-port-ranges 443 \
--destination-address-prefixes AzureDevOps
az network nsg rule create -g "$RG" --nsg-name nsg-agents \
--name deny-all-egress --priority 4096 --direction Outbound \
--access Deny --protocol "*" --destination-port-ranges "*" \
--destination-address-prefixes Internet
az network vnet subnet update -g "$RG" --vnet-name vnet-ado \
--name snet-agents --network-security-group nsg-agents
AzureDevOps is a service tag that Microsoft keeps current with the platform’s IP ranges, so you do not hand-maintain a CIDR list. For package registries, prefer Azure Artifacts upstream sources reached over a Private Endpoint, or allow your firewall’s FQDN rules for *.nuget.org / registry.npmjs.org explicitly rather than opening the internet. For Azure resources the build touches (storage, Key Vault, ACR), use Private Endpoints into the same VNet so that traffic never leaves the Microsoft backbone.
If you self-host the agent behind a deny-all egress, allow the service tag, not a static IP list. Azure DevOps rotates ranges; a hardcoded allow-list will eventually black-hole your whole fleet at 2 a.m.
7. Pipeline permissions, approvals, and checks on the pool
An ephemeral pool with open access is still dangerous: any pipeline that can target it can run arbitrary code on a VM with a managed identity inside your VNet. Treat the agent pool as a protected resource and gate it the same way you gate an Environment.
First, turn off open access so a new pipeline cannot silently use the pool. In Project settings -> Agent pools -> {pool} -> Security, remove “Open access” and grant only the specific pipelines that should run here. Then attach checks to the pool itself, so they evaluate before a VM is provisioned:
- Approvals - a human gate before any job runs on the pool. Use this for pools that touch production networks.
- Branch control - restrict the pool to pipelines running from protected branches (e.g.
refs/heads/main), so a fork or feature branch cannot grab a privileged VM. - Business hours / exclusive lock - serialize access so an attacker cannot fan out across the whole fleet at once.
Branch control as a check is the highest-leverage gate, because it stops untrusted PR branches from ever reaching the pool:
# Conceptual: a Branch control check on the agent pool resource,
# configured under pool Security -> Checks (not in pipeline YAML).
allowedBranches: "refs/heads/main,refs/heads/release/*"
ensureProtectionOfBranch: true # require branch protection policies to exist
In the pipeline that consumes the pool, reference it by name and pin the demands so a job lands only on agents that advertise the right capabilities:
pool:
name: ado-linux-ephemeral
demands:
- Agent.OS -equals Linux
- docker # capability published by the image
Checks on a pool run before the agent is acquired, exactly like checks on an Environment. A blocked approval or a branch that fails branch control costs you zero agent minutes - the VM is never provisioned.
8. Per-job credential scoping and log hygiene
Ephemeral VMs remove cross-job credential carry-over, but a job can still leak a secret into its own logs or pull more privilege than it needs. Close both gaps.
Use Workload Identity Federation, not stored keys. Configure your ARM service connection for OIDC so the pipeline mints a short-lived federated token instead of holding a client secret. Combined with the VMSS managed identity, builds reach Azure with credentials that expire in minutes and never sit on disk.
steps:
- task: AzureCLI@2
inputs:
azureSubscription: sc-workload-identity # OIDC-based service connection
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
# Token is federated and short-lived; no secret is stored anywhere.
az storage blob upload --account-name builddrops \
--container-name artifacts --file ./out/app.zip --auth-mode login
Scope variable groups per stage. Do not pour every secret into one group attached to the whole pipeline. Link a Key Vault-backed variable group only to the stage that needs it, so a build job cannot read deploy-time production secrets.
Mask aggressively and fail on leaks. Secret variables are masked in logs automatically, but values that arrive from a script, a file, or string interpolation are not. Register them explicitly and avoid echoing environment dumps:
steps:
- script: |
TOKEN=$(./scripts/mint-token.sh)
echo "##vso[task.setsecret]$TOKEN" # mask before any later use
./deploy --token "$TOKEN"
displayName: Deploy with masked token
##vso[task.setsecret]masks a value computed at runtime. Reach for it whenever a secret originates inside the job (a generated token, a decrypted file) rather than from a pre-declared secret variable, because only declared secrets are masked for you.
Verify
Confirm the pool is genuinely ephemeral, isolated, and gated before you route real traffic to it.
# 1. Pool is registered and elastic. desiredIdle/maxCapacity should match step 4.
az devops invoke --area distributedtask --resource elasticpools \
--http-method GET --api-version 7.1 --org "$ORG" \
--query "value[?azureId.ends_with(@,'vmss-ado-linux')].{idle:desiredIdle,max:maxCapacity,recycle:recycleAfterEachUse}"
# 2. VMSS has no public IPs and no autoscale rules of its own.
az vmss list-instance-public-ips -g "$RG" -n "$VMSS" -o table # expect empty
az monitor autoscale list -g "$RG" -o table # expect empty
# 3. Egress is deny-by-default with only AzureDevOps allowed.
az network nsg rule list -g "$RG" --nsg-name nsg-agents \
--query "[?direction=='Outbound'].{name:name,access:access,dst:destinationAddressPrefix}" -o table
Then run a canary pipeline and check three behaviours:
- Ephemerality. Run a job that writes a marker file to
$HOME, then run a second job. The marker must be gone - different VM, clean room. - Scale-down. After a burst, watch instance count fall back toward
desiredIdlewithintimeToLiveMinutes:az vmss list-instances -g "$RG" -n "$VMSS" -o table. - Gating. Push the pipeline from a non-allowed branch. The branch-control check must block it before a VM is provisioned (instance count stays flat).
For queue-time and cost monitoring, the pipeline analytics view exposes wait time, and Azure Monitor metrics on the VMSS give you instance-count over time. A KQL query against your cost export surfaces idle spend:
// Daily compute cost attributed to the agent VMSS, last 30 days.
Usage
| where TimeGenerated > ago(30d)
| where ResourceId has "virtualMachineScaleSets/vmss-ado-linux"
| summarize CostUSD = sum(CostInBillingCurrency) by bin(TimeGenerated, 1d)
| order by TimeGenerated asc
Enterprise scenario
A payments platform team ran a 24-agent persistent Linux pool for a monorepo of ~140 services. Two constraints collided. First, PCI segmentation: their auditor flagged that build agents with broad outbound internet access sat one hop from the cardholder-data environment, and any compromised npm or NuGet dependency could exfiltrate or pivot. Second, cost: the pool was sized for the 9-11 a.m. merge storm but billed flat around the clock, running near 22% average utilization.
They moved to a scale set pool with recycleAfterEachUse: true, desiredIdle: 4 (their measured merge-storm floor), and maxCapacity: 40. The VMSS went into a dedicated, internet-egress-denied subnet; the only allowed outbound was the AzureDevOps service tag plus a Private Endpoint to Azure Artifacts upstream feeds, which proxied and cached the public registries. That single change satisfied the segmentation finding - builds could no longer reach the open internet - while the cached feeds actually sped up restores.
The subtle bug they hit: with desiredIdle: 0 (their first attempt at maximum savings), the merge storm produced 3-4 minute queue times because every one of the dozen simultaneous builds cold-started a VM. Worse, a flaky image-extension rollout one night entered a provision-fail-recycle loop and quietly burned a weekend of compute. The fix was twofold - raise desiredIdle to cover the concurrent floor, and alert on offline agents so a bad image pages someone:
// Alert: agents that provisioned but never came online (broken image/extension).
AzureDiagnostics
| where Category == "ScaleSetAgentPool"
| where Status_s == "offline"
| summarize OfflineAgents = dcount(AgentName_s) by bin(TimeGenerated, 5m), PoolName_s
| where OfflineAgents > 2
Net result: a 34% drop in monthly CI compute spend, a clean-room VM per job that closed the cross-build contamination finding, and an egress posture the auditor signed off on - all without changing a line of pipeline YAML beyond the pool: reference.