Azure Security

Securing Azure Container Registry: Private Endpoints, ACR Tasks, Content Trust, and Geo-Replication

A container registry is the single most concentrated point of supply-chain risk in a platform. Every node in every cluster pulls from it, the images it serves run with whatever privileges the workload grants, and a compromised or stale image propagates silently across the fleet. Yet most ACR deployments I inherit are a Standard SKU with the admin user enabled, a long-lived password in a pipeline variable, public network access wide open, and no idea whether the latest tag is the thing that was scanned three months ago. Azure Container Registry (ACR) — the managed, OCI-compliant registry that stores your Docker and Helm artifacts, signatures, and SBOMs — can be the opposite of that: a hardened distribution point that proves what it serves and refuses to serve anything unproven.

This article builds that hardened registry end to end. A Premium registry locked behind private endpoints, with repository-scoped tokens instead of admin creds, automated multi-step ACR Tasks that build inside the security boundary, Notation signatures gated by quarantine-on-push, geo-replicated zone-redundant distribution, Defender for Containers scanning, retention and purge to keep the surface small, and keyless OIDC CI/CD so the last long-lived secret disappears. Each control gets the exact az/Bicep to apply it, the exact command to verify it is in force, and a table that enumerates every option, default, and gotcha so you can pick correctly the first time.

Everything here requires the Premium tier. Private Link, tokens and scope maps, geo-replication, customer-managed keys, content-trust workflows, soft delete, and connected registries are all Premium-only. If you are on Basic or Standard, the first move is az acr update --sku Premium — the rest does not apply until you do. By the end you will be able to stand up a registry that an auditor signs off on and an SRE trusts at 02:00, and you will know precisely which command tells you each guarantee is real rather than merely configured.

RG=rg-platform-acr
ACR=kvacrprod          # globally unique, alphanumeric, 5-50 chars
LOC=australiaeast

az group create -n $RG -l $LOC
az acr create -n $ACR -g $RG --sku Premium \
  --admin-enabled false

What problem this solves

The pain is concrete and it shows up in three forms. The first is credential sprawl: the admin user is enabled, its password is pasted into a Kubernetes imagePullSecret and three pipelines, and it has not rotated in over a year. Anyone with read access to any of those locations has full read/write to every repository in the registry — and there is no way to scope, expire, or attribute that access. The second is provenance blindness: the cluster pulls myapp:latest, but nobody can prove which commit built it, whether it was scanned, or whether the bytes on disk are the bytes the build produced. A tampered or back-doored image is indistinguishable from a legitimate one. The third is availability and locality: a single-region registry with public access means cross-region pulls pay egress and latency on every cold start, and a regional outage takes the registry — and therefore every deployment in every region — down with it.

What breaks without this: a leaked admin password is a full registry compromise with no blast-radius limit; an unsigned image with a critical CVE deploys to production because nothing gated it; a flash-sale scale-out in another region stalls because every pod pulls layers across the ocean; and a zone or region failure that should have been survivable instead halts CI/CD platform-wide. None of these are exotic — they are the default posture of a registry created with the portal “next-next-finish” flow.

Who hits this: every platform team running AKS, Container Apps, or App Service for Containers at any real scale, every team subject to a supply-chain or compliance audit (SLSA, SOC 2, the US EO 14028 SBOM mandate), and every multi-region workload that cannot tolerate a single-region dependency. The fix is not one feature — it is a layered posture, and this article is the layer-by-layer build. To frame the whole field before the deep dive, here is every control class, the risk it removes, and the single command that proves it:

Control class Risk it removes Premium-only? One-command proof
Private endpoints + firewall Public exposure of registry & layers Yes az acr show --query publicNetworkAccessDisabled
Disable admin user Single shared full-access credential No az acr show --query adminUserEnabledfalse
Tokens + scope maps Unscoped, non-expiring creds Yes az acr scope-map list shows least-privilege maps
Managed-identity pulls Static imagePullSecret in clusters No az aks check-acr succeeds, no secret in cluster
ACR Tasks (build inside) Source/build on dev laptops No az acr task list shows Git-triggered tasks
Quarantine-on-push Unscanned image is pullable Yes Push then pull a new image → denied until passed
Notation signing Unprovable image provenance Yes notation verify passes; tamper → fails
Geo-replication + AZ Region/zone outage halts pulls Yes az acr replication list shows ≥2 Ready replicas
Defender for Containers CVEs ship undetected No (subscription plan) az security pricing show -n ContainersStandard
Retention + purge + soft delete Storage bloat, irrecoverable deletes Yes az acr config retention showenabled
OIDC federated CI/CD Long-lived pipeline secret No No clientSecret/token password in pipeline vars

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the basics of OCI registries and Docker: an image is a set of content-addressable layers referenced by a manifest, a tag is a mutable human label pointing at a manifest digest, and a pull authenticates against the registry endpoint then downloads layer blobs from a data endpoint. You should be comfortable running az in Cloud Shell, reading JSON output, and you should know what a managed identity, an Entra-ID role assignment, and a private endpoint are at a conceptual level. Familiarity with AKS or Container Apps as the consumer helps but is not required.

This sits in the Security & Supply Chain track and leans on several adjacent topics. The networking lockdown reuses everything in Azure Private Link and Private DNS for PaaS and the decision in Private Endpoint vs Service Endpoint. The signing and quarantine flow stores its certificate in Azure Key Vault: Secrets, Keys & Certificates. Geo-replication and zone redundancy build directly on Azure Regions and Availability Zones Explained and feed a Multi-Region Active-Active Design. The consuming compute is usually one of the platforms in Azure App Service vs Container Apps vs AKS. A quick map of who owns and confirms each layer during a supply-chain review:

Layer What lives here Who usually owns it What it gates
Registry endpoint *.azurecr.io Docker v2 API + auth Platform team Login, manifest read/write
Data endpoints *.<region>.data.azurecr.io layer blobs Platform team Layer pull/push; per-replica
Private Link / DNS Private IPs, privatelink.azurecr.io zone Network team Whether pulls leave the VNet
Identity & RBAC Tokens, scope maps, AcrPull/AcrPush Security / IAM Who can do what, on which repos
Tasks compute ACR-managed build agents Platform / dev Where images are built
Content trust Notation certs, trust policy, Ratify Security Whether unsigned images admit
Scanning Defender for Containers, quarantine Security Whether vulnerable images ship

Core concepts

Six mental models make every later decision obvious.

The registry has two endpoint classes, and lockdown must cover both. The registry endpoint (<name>.azurecr.io) serves the Docker v2 API and authentication. The data endpoints (<name>.<region>.data.azurecr.io, one per region with geo-replication) serve the actual layer blobs. When you restrict networking, the registry private-endpoint group ID projects both into your VNet — but if your DNS only resolves the control endpoint, pulls succeed on auth and then hang on layer download. This split is the source of most “the firewall half-works” tickets.

Identity is the new perimeter, and there are three credential models. The admin user is one shared username/password with full read/write — disable it always. Tokens bound to scope maps are credentials scoped to named actions on specific repositories, with optional expiry — use them where Entra ID is impossible (a third-party appliance). Entra-ID identities with AcrPull/AcrPush role assignments are the strong default: a managed identity pulls with no stored secret at all. The progression from admin → token → managed identity is a progression from “everyone with the password owns everything” to “this exact workload can pull these exact repos.”

Build provenance starts at the build location. An image built on a developer laptop or a generic CI agent has touched untrusted compute before it reaches the registry. ACR Tasks run the build on ACR-managed compute inside the registry’s boundary, so source never lands on a laptop and the image is born where it will live. A multi-step task (buildcmdpush, gated by when) lets you test the freshly built image before it is pushed — a build-test-push gate in one unit.

Trust is two independent gates: quarantine and signatures. Quarantine-on-push makes every pushed image invisible to normal pulls until a process explicitly marks it good — turning “push” into “push to staging.” Notation signatures attach a cryptographic proof of provenance and integrity that a consumer (Ratify at the AKS admission gate) verifies against a trust policy. Quarantine answers “has this been checked?”; signatures answer “is this the thing we checked, signed by who we trust?” You want both.

Resilience is layered and platform-driven. Zone redundancy (now default) spreads each replica’s storage across availability zones, surviving a zone failure. Geo-replication makes the registry one logical resource with storage in multiple regions behind one login server, surviving a region failure and serving pulls from the nearest replica. Failover is health-aware and automatic — there is no customer failover button. Your job is capacity planning and ensuring each consuming region has a nearby replica.

The surface must be actively shrunk. A CI pipeline tagging every build by run ID accumulates thousands of manifests, bloating storage and scan scope. Untagged-manifest retention auto-deletes orphaned manifests; purge tasks delete tags on a schedule; soft delete keeps deleted artifacts recoverable for a window so a bad filter is not a catastrophe. Cleanup is a security control, not just housekeeping — fewer artifacts means a smaller attack surface and a cheaper, faster scan.

Almost every control in this article is gated on the SKU, so the very first decision is the tier. What each SKU includes — and why this posture is Premium-only:

Capability Basic Standard Premium
Included storage ~10 GB ~100 GB ~500 GB
Private endpoints / Private Link No No Yes
Public-access disable + IP firewall No No Yes
Tokens + scope maps No No Yes
Geo-replication No No Yes
Zone redundancy No No Yes (default)
Quarantine-on-push No No Yes
Customer-managed keys (CMK) No No Yes
Soft delete No No Yes
ACR Tasks (build/cmd/push) Yes Yes Yes
AcrPull/AcrPush RBAC + admin-off Yes Yes Yes
Image signing artifacts (Notation) Yes* Yes* Yes

(*Notation can push signature artifacts to any tier, but quarantine gating and private distribution — the parts that make signing enforceable end to end — are Premium.)

The vocabulary in one table

Before the deep sections, pin every moving part. The glossary repeats these for lookup; this table is the model side by side:

Concept One-line definition Where it lives Why it matters to the supply chain
Registry endpoint Docker v2 API + auth (*.azurecr.io) Per registry Login and manifest ops; lock with PE
Data endpoint Layer-blob host (*.<region>.data.*) Per replica Pull/push of bytes; DNS must cover it
Admin user Shared full-access credential Registry property Disable — single point of compromise
Scope map Named action-set on repositories Registry Least-privilege policy for a token
Token Credential bound to a scope map Registry Scoped, expirable non-Entra access
AcrPull / AcrPush Entra-ID RBAC roles Role assignment Keyless pull/push via managed identity
ACR Task Build/cmd/push on ACR compute Registry Builds inside the boundary; triggers
Base-image trigger Rebuild when FROM digest moves Task property Auto-patch derived images vs CVEs
Quarantine Image invisible until promoted Policy Gate before anything is pullable
Notation signature Crypto provenance/integrity proof Artifact on the manifest Prove what you pull
Trust policy Which signer is trusted for which repo Notation config Enforce signer identity
Geo-replica Live writable copy in another region Replica resource Region failover + pull locality
Zone redundancy Storage spread across AZs Replica property (default) Survive a zone outage
Retention / purge Auto-delete untagged / old tags Policy + task Shrink surface and cost
Soft delete Recoverable deleted artifacts Policy Safety net for bad purges
OIDC federation Short-lived token from CI to Entra Federated credential Removes stored pipeline secrets

Premium architecture: private endpoints, firewall, and trusted services

The data plane of ACR has two endpoint classes: the registry endpoint (<name>.azurecr.io, used for the Docker v2 API and auth) and the data endpoints that serve the actual layer blobs. With geo-replication, each region gets its own data endpoint (<name>.<region>.data.azurecr.io). When you lock down networking, you must account for both, or pulls succeed on auth and then hang on layer download.

Start by disabling public access and attaching a private endpoint. The private endpoint projects the registry into your VNet with a private IP, and Private Link automatically wires up the per-region data endpoints behind it.

# Disable public network access entirely
az acr update -n $ACR --public-network-enabled false

PE_SUBNET=/subscriptions/<sub>/resourceGroups/rg-net/providers/Microsoft.Network/virtualNetworks/vnet-hub/subnets/snet-pe
ACR_ID=$(az acr show -n $ACR -g $RG --query id -o tsv)

az network private-endpoint create \
  -g $RG -n pe-$ACR \
  --subnet $PE_SUBNET \
  --private-connection-resource-id $ACR_ID \
  --group-id registry \
  --connection-name pe-$ACR-conn

The registry group ID covers both the control endpoint and all data endpoints — you do not create a separate private endpoint per region. Now wire the private DNS zone so <name>.azurecr.io and <name>.<region>.data.azurecr.io resolve to private IPs inside the VNet:

az network private-dns zone create -g rg-net -n privatelink.azurecr.io
az network private-dns link vnet create \
  -g rg-net -n link-acr \
  -z privatelink.azurecr.io \
  -v vnet-hub --registration-enabled false

az network private-endpoint dns-zone-group create \
  -g $RG --endpoint-name pe-$ACR -n acr-zone-group \
  --private-dns-zone privatelink.azurecr.io --zone-name registry

The DNS zone group auto-populates A records for the registry and every replica data endpoint, so when you add a geo-replica later the record appears without manual intervention. Verify with az network private-dns record-set a list -g rg-net -z privatelink.azurecr.io -o table — you should see one entry per region.

Knowing exactly which A records should exist in the zone is how you spot a half-wired private path before it pages you. The expected records for a two-region registry:

A record (in privatelink.azurecr.io) Resolves Created by Missing → symptom
<name> Registry/control endpoint Zone group (always) Login itself fails / public IP returned
<name>.<homeRegion>.data Home-region data endpoint Zone group Auth ok, home pulls hang on layers
<name>.<replicaRegion>.data Replica data endpoint Zone group on replica add Auth ok, replica-region pulls hang
<name>.<region>.data (new replica) Newly added replica Auto on replication create New region pulls hang until record appears

The network-access surface has more knobs than public-network-enabled, and getting the combination right is what separates “locked down” from “looks locked but a CI agent still reaches it over the internet.” Every networking control, end to end:

Setting Values Default When to change Trade-off / gotcha
publicNetworkAccess Enabled / Disabled Enabled Disable once PE + DNS are live Disable before PE exists → you lock yourself out
Private endpoint --group-id registry n/a Always (single PE for all endpoints) Wrong group ID → data endpoints unreachable
Private DNS zone privatelink.azurecr.io none Always with PE Missing zone → auth works, layer pull hangs
--default-action (IP rules) Allow / Deny Allow Deny to make the firewall default-deny Public still on unless you also disable it
IP network rule CIDR allow-list none Allow a specific NAT/egress IP Premium-only; max ~100 rules
networkRuleBypassOptions AzureServices / None AzureServices Keep AzureServices for Defender/Tasks None blocks trusted-service scanning
--allow-trusted-services true / false true Keep true with public off false breaks Defender, Tasks reach-back
dataEndpointEnabled true / false false true for dedicated data endpoints Needed for tight firewall egress allow-listing
zoneRedundancy (home) Enabled / Disabled Enabled* Leave on in AZ regions *Default in supporting regions; free

Trusted services bypass

With public access disabled, platform services that legitimately need to reach the registry — Defender for Cloud scanning, ACR Tasks, Container Apps, the AKS image-cleaner — cannot traverse your private endpoint. ACR exposes a trusted services bypass for exactly this. It is not a blanket “allow Microsoft”; the trusted service must authenticate with its own managed identity that holds an AcrPull (or finer) role.

az acr update -n $ACR --allow-trusted-services true

A subtle failure mode: az acr build and az acr task run on ACR’s own compute, which is a trusted service, so they bypass the firewall. But az acr import from a network-restricted source, or a docker push from a self-hosted agent, is not trusted — that agent must sit inside the VNet or reach a private endpoint. Most “my firewall blocks ACR Tasks” tickets are actually about the source registry on an import, not the task itself.

Which callers are trusted and which are not is the exact knowledge that resolves those tickets. The reach-back matrix:

Caller Trusted-service bypass? How it must reach a locked registry Common failure
ACR Tasks (az acr build/task) Yes Bypasses firewall on ACR compute None — but the source registry on import is not trusted
Defender for Containers scanner Yes (with AzureServices) Bypass + its managed identity networkRuleBypassOptions=None blocks it
Container Apps environment Yes (system MI) Bypass + AcrPull on the env MI MI missing AcrPull → image pull error
AKS kubelet identity No (data-plane pull) Private endpoint / private DNS in the cluster VNet DNS not linked to cluster VNet → pull hangs
App Service for Containers No VNet integration + private endpoint No VNet integration → cannot resolve private IP
Self-hosted pipeline agent No Agent inside the VNet or via PE Public off + agent outside VNet → denied
az acr import (source side) No (source registry) Source reachable; target via trusted reach-back Network-restricted source → import times out
GitHub-hosted Actions runner No OIDC + public on, or self-hosted in VNet Public off + hosted runner → cannot reach registry

Token and scope-map repository-scoped access without the admin user

The admin user is a single shared credential with full read/write to the entire registry. Disable it (we did, at creation) and use tokens scoped by scope maps instead. A scope map is an IAM policy for the registry: it grants a named set of actions on specific repositories. A token binds credentials to a scope map.

The valid actions are content/read, content/write, content/delete, metadata/read, and metadata/write. A pull-only CI consumer needs content/read plus metadata/read; a build agent that pushes needs content/write added.

# A pull-only scope map for the payments team's repos
az acr scope-map create -r $ACR -n payments-pull \
  --repository payments/api    content/read metadata/read \
  --repository payments/worker content/read metadata/read \
  --description "Pull-only access to payments images"

# Token bound to that scope map
az acr token create -r $ACR -n k8s-payments-puller \
  --scope-map payments-pull

Wildcards make this scale. samples/* matches every repository under that prefix, and wildcard grants are additive with exact-match grants, so a CD service account can be given broad pull and narrow push in one map:

az acr scope-map create -r $ACR -n cd-pipeline \
  --repository 'apps/*'        content/read metadata/read \
  --repository apps/checkout   content/read content/write metadata/read metadata/write

Tokens carry passwords (two for rotation), but the strong pattern is to skip token passwords entirely and let Entra-ID identities pull via AcrPull role assignments with managed identity — covered in the CI/CD section. Use scope-map tokens where you genuinely cannot use Entra ID (a third party, an appliance), and rotate them:

az acr token credential generate -r $ACR -n k8s-payments-puller \
  --password1 --expiration-in-days 90 -o json

The five scope-map actions are the entire vocabulary of token permissions — knowing exactly what each gates (and what it does not) is how you grant the minimum. The action reference:

Action Grants Does NOT grant Typical consumer
content/read Pull image layers + manifests List repos/tags, push, delete Any puller (AKS, CI consumer)
content/write Push image layers + manifests Delete, read others’ repos Build/CD agent
content/delete Delete images/manifests Push, read Purge/cleanup automation
metadata/read List tags, read manifest metadata Pull layer bytes Catalog/UI, dependency scanners
metadata/write Update tag/manifest attributes Pull/push content Promotion tooling (lock tags)

The registry exposes several credential models at once; choosing the wrong one is how an audit finding is born. The full comparison:

Credential model Scope Expiry Entra-aware Best for Worst for
Admin user Whole registry, read+write Never No Nothing in production Everything — disable it
Scope-map token Named repos + actions Optional (--expiration-in-days) No 3rd-party appliance, non-Entra consumer Workloads that can use MI
Service principal + secret RBAC role on registry Secret expiry Yes Legacy automation New work — secret to rotate
System-assigned MI RBAC role, tied to one resource n/a (keyless) Yes AKS kubelet, Container Apps Cross-resource reuse
User-assigned MI RBAC role, reusable n/a (keyless) Yes Shared pipeline identity When you need per-resource isolation
OIDC federated cred RBAC via short-lived token Minutes (token TTL) Yes GitHub/ADO pipelines Inside-cluster pulls

The four built-in Entra roles cover almost every case without a custom role; reach for a custom role only when you must scope push to specific repositories. The RBAC role reference:

Role Pull Push Delete Manage registry When to assign
AcrPull Yes No No No AKS kubelet, any read-only consumer
AcrPush Yes Yes No No CI/CD build-push identity
AcrDelete No No Yes No Purge/retention automation
AcrImageSigner No No No Sign images Notation signing identity
Owner / Contributor Yes Yes Yes Yes Humans (PIM-elevated) — never a workload

ACR Tasks: multi-step builds, base-image triggers, and cache

ACR Tasks run builds on ACR-managed compute, so source never touches a developer laptop and the resulting image is born inside the security boundary. A multi-step task is defined in acr-task.yaml with three step types — build, cmd, and push — and a when property to express dependencies. Critically, unlike az acr build, a multi-step build step does not auto-push; you only push after validation passes. That gives you a build-test-push gate in a single task.

# acr-task.yaml
version: v1.1.0
steps:
  - id: build
    build: -t $Registry/payments/api:$ID -f Dockerfile .
  # Run the freshly built image through tests before it is pushed
  - id: unit-tests
    cmd: $Registry/payments/api:$ID pytest -q
    when: ["build"]
  # Only push if tests succeeded
  - id: push
    push:
      - $Registry/payments/api:$ID
      - $Registry/payments/api:latest
    when: ["unit-tests"]

$Registry expands at runtime to the executing registry’s login server, and $ID is the unique run ID — using it as the immutable tag means every build is independently addressable. Create the task with a Git trigger so a commit to main builds automatically:

az acr task create -r $ACR -n payments-api-ci \
  --file acr-task.yaml \
  --context https://github.com/org/payments.git#main \
  --git-access-token $GH_PAT \
  --commit-trigger-enabled true \
  --base-image-trigger-enabled true \
  --base-image-trigger-type Runtime

The base-image trigger is the feature that earns ACR Tasks its keep. When the base image your FROM line references is updated — whether that is an upstream mcr.microsoft.com/dotnet/aspnet digest or a hardened internal base you maintain — the task re-runs and rebuilds your application image with the patched layers. This is how you keep thousands of derived images current against CVEs without anyone manually rebuilding. The trigger requires your Dockerfile to pin a specific base tag (not nothing, and ideally not latest); ACR tracks the digest behind that tag and fires when it moves.

For an internal base-image chain, point a task at the base repo and let the derived task’s Runtime trigger cascade:

# Base image task — its push moves the digest behind myorg/base:1.0
az acr task create -r $ACR -n base-image \
  --image myorg/base:1.0 \
  --context https://github.com/org/base.git#main \
  --git-access-token $GH_PAT \
  --commit-trigger-enabled true

ACR Tasks caches layers between runs automatically, and BuildKit can be enabled by setting DOCKER_BUILDKIT=1 in the task env for better cache behavior and secret mounts. The task model has several variants and a handful of trigger types; picking the wrong combination is why some pipelines “don’t rebuild on a CVE.” The task-type and trigger matrix:

Task type Defined by Triggers supported Auto-push? Use for
Quick task (az acr build) One-off CLI invocation None (manual) Yes Ad-hoc / CI-driven builds
Multi-step (--file) acr-task.yaml Commit, base-image, schedule, manual No (explicit push) Build-test-push gate
Single-image (--image) --image + Dockerfile Commit, base-image, schedule, manual Yes Simple derived-image rebuilds
Scheduled (--schedule) Cron timer Timer only Depends on steps Nightly purge, periodic rebuild
Trigger Flag Fires when Requires Gotcha
Commit --commit-trigger-enabled true Push to the tracked branch Git context + PAT/OAuth PAT scope must include repo + webhook
Pull request --pull-request-trigger-enabled true PR opened/updated Git context Builds untrusted PR code — scope carefully
Base image (Runtime) --base-image-trigger-type Runtime FROM digest moves Pinned base tag latest/unpinned base won’t track cleanly
Base image (All) --base-image-trigger-type All Buildtime + runtime base changes Pinned base Noisier; more rebuilds
Schedule --schedule "0 2 * * *" Cron time (UTC) Cron is UTC; mind your TZ
Manual az acr task run You invoke it No automation — for testing

The task YAML exposes more than three step types’ worth of behaviour; the runtime variables and step properties below are what make a task portable across registries:

Token / property Expands to / does Example
$Registry Executing registry login server $Registry/payments/api:$ID
$ID Unique run ID (immutable tag) payments/api:cf3a1
$Date / $Commit Run date / source commit SHA Tag by commit for traceability
when: ["step-id"] Run only after named step(s) succeed Gate push on unit-tests
env: Per-step environment variables DOCKER_BUILDKIT=1
secret: (Key Vault) Mount a KV secret into a step Inject a registry/login secret
--platform Target OS/arch linux/arm64 for multi-arch
--no-push Suppress auto-push on a quick task Validate before publishing

A task run moves through a small set of statuses; reading them (az acr task list-runs -r $ACR -o table) is how you tell a flaky build from a triggering problem:

Run status Meaning Likely next step If stuck here
Queued Waiting for build agent Starts shortly Long queue → concurrency/region capacity
Running Build/test/push in progress Completes or fails Hang → check the step log live
Succeeded All steps passed; image pushed Image available (or quarantined)
Failed A step returned non-zero Inspect az acr task logs Test step failing → push correctly gated off
Canceled Manually or superseded Re-run if needed Superseded by a newer commit
Error Task infra/config problem Fix YAML/context/credentials Bad Git PAT or unreachable source

Image signing with Notation and quarantine-on-push gating

Two independent controls combine here. Notation attaches a cryptographic signature to an image so consumers can prove provenance and integrity. Quarantine holds every pushed image invisible until a process explicitly marks it good — turning “push” into “push to staging” and forcing a gate before anything is pullable.

Quarantine on push

Quarantine is configured through the management policy API. Once enabled, a freshly pushed image is visible only to identities with quarantine-reader permission; normal pulls fail until the image is marked passed. Your scanner subscribes to the quarantine webhook, scans, and promotes.

ID=$(az acr show -n $ACR --query id -o tsv)
az resource update --ids $ID \
  --set properties.policies.quarantinePolicy.status=enabled

Enabling quarantine is a breaking change to existing workflows: any image not explicitly marked good is blocked for pull. Roll it out per registry with the consuming teams aware, and make sure your promotion automation is live before you flip it, or every deployment stalls.

The quarantine lifecycle has a small number of states and transitions; knowing them is how you debug “my CI pushed but AKS can’t pull.” The state machine:

State Set by Pullable by normal identity? Next transition
Quarantined (on push) Platform (policy enabled) No Scanner reads via quarantine permission
Passed Promotion automation Yes Image is generally available
Failed Promotion automation No Stays blocked; purge or re-build
(policy disabled) Admin Yes immediately No gate — every push is live

Signing with Notation and Azure Key Vault

Notation signs with a certificate stored in Key Vault via the azure-kv plugin. Install the CLI and plugin (pin versions — these are the current releases):

curl -Lo notation.tar.gz \
  https://github.com/notaryproject/notation/releases/download/v1.3.2/notation_1.3.2_linux_amd64.tar.gz
tar xzf notation.tar.gz && cp ./notation /usr/local/bin

notation plugin install --url \
  https://github.com/Azure/notation-azure-kv/releases/download/v1.2.1/notation-azure-kv_1.2.1_linux_amd64.tar.gz \
  --sha256sum 67c5ccaaf28dd44d2b6572684d84e344a02c2258af1d65ead3910b3156d3eaf5

The signing identity needs Key Vault Certificates Officer and Key Vault Crypto User on the vault (RBAC mode), plus pull/push on the registry. Always sign by digest, never by tag — tags are mutable, and a signature must bind to immutable content:

KEY_ID=$(az keyvault certificate show -n signing-cert \
  --vault-name kv-signing --query 'kid' -o tsv)

DIGEST=$(az acr build -r $ACR -t $ACR.azurecr.io/payments/api:v1 \
  https://github.com/org/payments.git#main \
  --no-logs --query "outputImages[0].digest" -o tsv)
IMAGE=$ACR.azurecr.io/payments/api@$DIGEST

notation sign --signature-format cose \
  --id $KEY_ID --plugin azure-kv \
  --plugin-config self_signed=true \
  $IMAGE

Verification is policy-driven. Add the certificate to a named trust store, then import a trust policy that scopes which signers are trusted for which repositories:

az keyvault certificate download -n signing-cert --vault-name kv-signing -f cert.pem
notation cert add --type ca --store payments-ca cert.pem
{
  "version": "1.0",
  "trustPolicies": [
    {
      "name": "payments-images",
      "registryScopes": [ "kvacrprod.azurecr.io/payments/api" ],
      "signatureVerification": { "level": "strict" },
      "trustStores": [ "ca:payments-ca" ],
      "trustedIdentities": [
        "x509.subject: CN=payments.org,O=Platform,L=Sydney,ST=NSW,C=AU"
      ]
    }
  ]
}
notation policy import ./trustpolicy.json
notation verify $IMAGE

At the cluster, enforcement is done by Ratify plus an Azure Policy / Gatekeeper constraint that admits only images whose Notation signature validates against this trust policy. That closes the loop: ACR signs, AKS refuses anything unsigned or signed by the wrong identity. (Note Notation v1.2+ also supports RFC 3161 timestamping so signatures stay verifiable after the signing cert expires — essential with short-lived certs.)

The trust policy’s signatureVerification.level is the single most consequential knob — it decides what a verification failure actually does. The verification-level matrix:

Level Signature required? Expiry enforced? Revocation checked? Use for
strict Yes Yes (hard fail) Yes (hard fail) Production — full enforcement
permissive Yes Warn only Warn only Rollout/grace period
audit No (logs result) Logged Logged Observe before enforcing
skip No No No Explicitly trusted scope (rare)

Quarantine and signing answer different questions and fail in different ways; conflating them is how teams think they have “supply-chain security” with only half of it. The two gates side by side:

Dimension Quarantine-on-push Notation signing
Question answered “Has this image been checked?” “Is this the checked image, from a trusted signer?”
Gate point Registry (pull blocked until passed) Admission (Ratify at AKS) + notation verify
Protects against Pulling an unscanned image Tampering, wrong-signer, provenance forgery
Breaking-change risk High (blocks all pulls until promoted) Low (audit → permissive → strict ramp)
Premium-only Yes Signing artifacts work on any tier; enforcement is yours
Failure mode if misconfigured Deployments stall (no promotion) Images admit unsigned (level too loose)

Geo-replication, zone redundancy, and regional failover

Geo-replication makes the registry a single logical resource with image storage in multiple regions, served through one login server (<name>.azurecr.io). Pulls from a region are served by the nearest replica’s data endpoint, which cuts egress cost and latency for multi-region clusters, and survives a regional outage because the global endpoint routes around an unhealthy replica.

az acr replication create -r $ACR -l southeastasia
az acr replication create -r $ACR -l westus2
az acr replication list -r $ACR -o table

Zone redundancy is now on by default for every replica (and for the home region in AZ-supporting regions) at no extra cost — ACR spreads each replica’s storage across availability zones automatically. The --zone-redundancy flag still exists for backward compatibility but you no longer need to set it. The practical upshot: a single replica already survives a zone failure; geo-replication is what you add for region failure and pull locality.

Failover is platform-managed and health-aware. ACR continuously checks each replica and reroutes the global endpoint away from a replica that cannot serve reliably. There is no customer-invocable failover button and no DNS change on your side — pushes, pulls, and deletes continue through the surviving replicas. Your job is capacity planning (enough replicas that losing one does not overload the rest) and ensuring each consuming region actually has a nearby replica.

Concern Mechanism Who triggers it Customer action
Zone outage Zone-redundant replica storage (default) Platform, automatic None — confirm AZ region
Region outage Geo-replication, health-aware routing Platform, automatic Add a replica per consuming region
Pull latency / egress Regional data endpoint nearest the client Routing, automatic Place a replica near each cluster
Disaster recovery copy Replica acts as a live, writable copy You, by adding the replica Decide topology + capacity
Replica capacity loss Surviving replicas absorb load Platform routing Size for N-1 (lose one, survive)

The resilience features overlap in name but protect against different blast radii; this is the table that settles “do we need geo-replication if we already have zone redundancy?” (yes — they cover different failures):

Feature Blast radius covered Default? Extra cost Single-replica enough?
Zone redundancy One availability zone Yes (AZ regions) None Yes, for zone failure
Geo-replication An entire region No (you add replicas) Per-replica Premium unit No — need ≥2 regions
Health-aware routing Unhealthy replica Yes (with replicas) Included n/a — needs ≥2 replicas
Soft delete Accidental/malicious delete No (opt-in) Storage of deleted items Independent of replicas
Customer-managed key Key compromise / BYOK control No (opt-in) Key Vault + ops overhead Independent of replicas

Replica state and the per-region data endpoint are what you actually monitor; the lifecycle of a replica:

Replica state Meaning Serves pulls? Action
Creating Initial sync in progress Partial (syncing) Wait; don’t depend on it yet
Ready Synced, serving locally Yes Normal operation
Syncing Catching up after a write Yes (may lag briefly) Normal; eventual consistency
Unhealthy Cannot serve reliably No (routed around) Platform reroutes; investigate region
Deleting Removal in progress No Ensure no region depends on it

Vulnerability scanning with Defender for Containers

Microsoft Defender for Containers scans images in ACR on push, on pull, and continuously (re-scanning already-pushed images as new CVE definitions land, for images pulled in the last 30 days). Enable the plan at the subscription level:

az security pricing create -n Containers --tier Standard

Because we disabled public access, Defender’s scanner reaches the registry through the trusted-services bypass — which is precisely why --allow-trusted-services true is not optional once you turn on scanning. Findings surface in Defender for Cloud and can be queried in Azure Resource Graph to drive a fail-the-build or block-the-pull gate:

securityresources
| where type == "microsoft.security/assessments/subassessments"
| where id contains "containerRegistryVulnerability"
| extend sev = properties.status.severity,
         cve = properties.id,
         repo = properties.additionalData.repositoryName,
         digest = properties.additionalData.imageDigest
| where sev in ("High", "Critical")
| project repo, digest, cve, sev, description = properties.description
| order by sev desc

Wire that query into a scheduled check or an Azure Monitor alert so a Critical finding on an in-use image pages the owning team, rather than sitting in a portal blade nobody opens. Defender scans at three distinct triggers, each with its own coverage window and cost model; knowing which trigger catches what tells you whether a gap is a config miss or a feature limit:

Scan trigger When it runs Coverage window Catches Limit / note
On push Image pushed to ACR The new image New CVEs at publish time Per-image billing event
On pull Image pulled The pulled image Drift if scanned long ago Only images actually pulled
Continuous New CVE definitions land Images pulled in last 30 days Newly disclosed CVEs in running images Beyond 30 days, not re-scanned
Registry baseline Plan enabled Existing images Backlog of known CVEs One-time sweep on enablement

Where signing/quarantine/scanning each fit in the supply-chain gate sequence — they are complementary, not interchangeable:

Gate stage Control Blocks what Fail-closed by default?
Build ACR Tasks build-test-push Unverified build output Yes (push gated on test)
Push Quarantine policy Unscanned image becoming pullable Yes (when enabled)
Scan Defender for Containers Known High/Critical CVEs No — you wire the gate
Sign Notation + Key Vault Unsigned artifacts (post-sign) No — signing is additive
Admit Ratify + Gatekeeper Unsigned/wrong-signer at AKS Yes (with strict + deny policy)

Purge tasks, retention policies, and untagged manifest cleanup

A busy CI pipeline tagging every build by run ID will accumulate thousands of manifests and bloat storage and scan scope. Two complementary tools clean up: a retention policy for untagged manifests, and a purge task for tags.

The retention policy auto-deletes untagged manifests after N days. Untagged manifests are typically the orphans left when a tag is overwritten:

az acr config retention update -r $ACR \
  --status enabled --days 14 --type UntaggedManifests

For tag-level cleanup on a schedule, ACR ships a containerized acr purge command you run as a scheduled task. This deletes tags older than a duration matching a filter, and --untagged then removes the now-unreferenced manifests:

PURGE_CMD="acr purge \
  --filter 'payments/api:.*' \
  --filter 'payments/worker:.*' \
  --ago 30d --untagged"

az acr task create -r $ACR -n nightly-purge \
  --cmd "$PURGE_CMD" \
  --schedule "0 2 * * *" \
  --context /dev/null

Two sharp edges. First, acr purge --untagged can delete manifests that belong to multi-arch images or signatures if you are not careful with filters — anything referenced only by digest (signatures, SBOMs, multi-arch child manifests) looks “untagged.” Test filters with a --dry-run (supported by the purge command) before scheduling. Second, deleted image data is unrecoverable unless soft delete is enabled, which keeps deleted artifacts recoverable for a retention window — turn it on first if you want a safety net.

az acr config soft-delete update -r $ACR --status enabled --days 7

The cleanup tools overlap and interact; running a purge before soft delete is on is the classic “we deleted a signed prod image and couldn’t get it back” incident. The cleanup-mechanism matrix:

Mechanism Deletes Scheduled? Reversible? Key flag Sharp edge
Untagged retention Orphaned (untagged) manifests Auto after N days Only with soft delete --type UntaggedManifests --days Signatures/SBOMs are “untagged”
Purge task (acr purge) Tags older than --ago + their manifests Yes (cron) Only with soft delete --filter, --ago, --untagged Greedy filters delete multi-arch children
Manual delete A specific tag/manifest No Only with soft delete az acr repository delete No undo without soft delete
Soft delete (recovery layer) n/a Yes (within window) --status enabled --days Counts toward storage while retained

acr purge has enough flags that a wrong combination is destructive; the flag reference, with the safe defaults highlighted:

Flag Effect Safe default Danger if misused
--filter 'repo:regex' Which repo:tags are in scope Narrow, per-repo regex .*:.* matches the whole registry
--ago 30d Only tags older than this Generous window 0d deletes everything matched
--untagged Also delete now-orphaned manifests Off until tested Removes signatures/multi-arch children
--keep N Retain the N most recent matching tags --keep 3+ for prod Omitting it keeps none beyond --ago
--dry-run Print what would be deleted, delete nothing Always run first Skipping it = blind destructive run

CI/CD wiring with managed identity and OIDC keyless push

The final piece removes the last long-lived secret. Instead of a token password or service-principal secret in the pipeline, use OIDC federated credentials: GitHub Actions (or Azure DevOps) presents a short-lived OIDC token, Entra ID validates it against a federated credential on a user-assigned managed identity, and the pipeline gets a transient access token. Nothing persistent is stored.

# User-assigned identity the pipeline will assume
az identity create -g $RG -n id-payments-cicd
APP_ID=$(az identity show -g $RG -n id-payments-cicd --query clientId -o tsv)
OID=$(az identity show -g $RG -n id-payments-cicd --query principalId -o tsv)

# Push rights to the registry (use a custom role / scope-map for least privilege)
az role assignment create --assignee $OID --role AcrPush --scope $ACR_ID

# Federate to a specific repo + branch — subject must match exactly
az identity federated-credential create \
  -g $RG --identity-name id-payments-cicd \
  -n gh-payments-main \
  --issuer https://token.actions.githubusercontent.com \
  --subject repo:org/payments:ref:refs/heads/main \
  --audiences api://AzureADTokenExchange

The workflow requests id-token: write, logs in with no secret, and pushes:

permissions:
  id-token: write   # required to fetch the OIDC token
  contents: read

jobs:
  build-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: azure/login@v2
        with:
          client-id: ${{ vars.AZURE_CLIENT_ID }}
          tenant-id: ${{ vars.AZURE_TENANT_ID }}
          subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
      - name: Build and push via ACR Tasks
        run: |
          az acr login --name kvacrprod
          az acr build -r kvacrprod -t kvacrprod.azurecr.io/payments/api:${{ github.sha }} .

Granting id-token: write only allows the job to request an OIDC token; it confers no resource access by itself. All authorization flows from the federated-credential subject match and the role assignment, so scope both tightly — federate per repo and branch, and assign push on the narrowest scope (a custom role limited to specific repositories beats AcrPush across the registry).

The federated-credential subject is the security boundary, and an over-broad subject is the difference between “main of this repo can push” and “any branch or fork can push.” The subject-pattern reference:

Scenario --subject pattern Scope granted Risk if loosened
Specific branch repo:org/repo:ref:refs/heads/main Only main of that repo Any-branch push if you wildcard
Specific tag repo:org/repo:ref:refs/tags/v* Release tags only
Pull request repo:org/repo:pull_request PR-triggered runs Untrusted fork code can push
Environment repo:org/repo:environment:prod Jobs targeting prod env Gate the env with reviewers
Azure DevOps sc://org/project/connection A specific service connection Connection reuse across pipelines

The CI authentication options to a locked-down registry, ranked from worst to best, so a reviewer can say exactly why a PR’s choice is or isn’t acceptable:

Auth method Stored secret? Rotation burden Reach locked registry? Verdict
Admin user password Yes (long-lived) Manual, high Needs public on or VNet Reject — never
Scope-map token password Yes (expirable) Scheduled rotation Same Last resort (non-Entra)
SP client secret Yes (expirable) Scheduled rotation Yes (Entra) Legacy only
Self-hosted runner + MI No (keyless) None Yes (in VNet) Good for private-only registries
OIDC federated credential No (keyless) None Public on, or self-hosted in VNet Best for hosted runners

Architecture at a glance

Trace a single image from a developer’s commit to a running pod, and every control in this article lines up on one left-to-right path. On the far left, a commit to main fires the ACR Task — but the build does not run on the developer’s machine or a generic CI agent; it runs on ACR-managed compute inside the registry boundary, authenticated by an OIDC federated credential so no secret is stored anywhere. The task builds, runs unit tests against the freshly built image, and only then pushes by digest. The push lands the image in a quarantined state: invisible to normal pulls. Defender for Containers scans it; a Notation signature is attached using a certificate held in Key Vault; and once the promotion automation marks it passed, the image becomes pullable.

From there the request path inverts. The registry itself sits behind a private endpoint with public access disabled, so its *.azurecr.io control endpoint and every *.<region>.data.azurecr.io data endpoint resolve to private IPs inside the VNet via the privatelink.azurecr.io zone. The registry is geo-replicated and zone-redundant: a copy lives in each consuming region, each spread across availability zones, with health-aware routing in front. When an AKS cluster pulls, its kubelet managed identity authenticates with AcrPull (no imagePullSecret), the global login server routes it to the nearest replica’s data endpoint, and Ratify at the admission gate refuses the image unless its Notation signature validates against the trust policy. The numbered badges below mark the five hops where this most commonly breaks — read the legend as symptom · confirm · fix.

Rich architecture diagram tracing an image from a Git commit through an OIDC-authenticated ACR Task build-test-push, into a quarantined-then-scanned-then-Notation-signed Premium registry behind a private endpoint with private DNS, geo-replicated and zone-redundant across two regions, pulled by an AKS kubelet managed identity via the nearest data endpoint and admitted only after Ratify validates the signature; five numbered failure badges mark the build credential, the port/DNS resolution, quarantine gating, signature verification, and replica failover hops

Real-world scenario

A fintech platform team ran a single Standard ACR in Australia East feeding AKS clusters in Australia East and Southeast Asia. Two problems surfaced in the same quarter. First, a security review flagged that the registry’s admin user was enabled and its password lived in a Kubernetes imagePullSecret that had not rotated in 14 months — and the same secret was pasted into three pipelines. Second, the Southeast Asia clusters were pulling every layer cross-region on cold starts, adding seconds to pod startup during scale-out and racking up inter-region egress on every deployment.

They upgraded to Premium and made three coordinated changes. For credentials, they killed the admin user, moved AKS to managed-identity pulls by attaching the registry to each cluster (az aks update --attach-acr), which assigns AcrPull to the kubelet identity — no secret in the cluster at all — and moved pipelines to OIDC federated credentials. For locality, they added a geo-replica in Southeast Asia. Because the global login server is unchanged, no manifests, Helm charts, or pipelines needed editing; the Southeast Asia kubelets simply began resolving to the local data endpoint and pulling within region. For provenance, they enabled quarantine-on-push and Notation signing, ramping enforcement from audit to permissive to strict over three sprints so a missed signature degraded gracefully instead of blocking deploys on day one.

# Replica colocated with the SEA clusters — single command, zero manifest changes
az acr replication create -r kvacrprod -l southeastasia

# Each AKS cluster pulls with its kubelet managed identity, no imagePullSecret
az aks update -g rg-aks-sea -n aks-sea --attach-acr kvacrprod

The measurable outcomes: cold-start pull time in Southeast Asia dropped because layers no longer crossed the region boundary, inter-region egress on deploys went to near zero, and the credential audit finding closed because there were no static registry secrets left to rotate. The replica also gave them an unplanned benefit during a later Australia East zone disruption — the SEA replica kept serving pulls while the home region recovered, with no failover action on their part. The rollout was not free of friction: the first attempt to enable quarantine on day one stalled every deployment because the promotion automation was not yet live, which is exactly why the second attempt sequenced the automation first. The lesson the team took away: geo-replication is sold as DR, but the day-to-day wins are pull locality and the fact that one login server lets you change the topology underneath without touching a single workload manifest — and that any breaking gate (quarantine, strict signing) must have its promotion path live before you flip it. The phased numbers:

Change Before After Mechanism
Registry credential Admin password in 3 pipelines + cluster Zero stored secrets MI pulls + OIDC
Credential audit finding Open (14-month-old secret) Closed No static creds to rotate
SEA cold-start pull Cross-region, seconds added In-region Local geo-replica
Inter-region egress on deploy Per-layer, per-deploy ~Zero Nearest data endpoint
Unsigned image admission Allowed Denied (strict) Notation + Ratify
AZ-East zone disruption Would halt pulls SEA replica served through it Health-aware routing

Advantages and disadvantages

The hardened posture is not free — it trades operational simplicity for security, resilience, and locality. The explicit two-column view:

Advantages Disadvantages
No standing secret to leak or rotate (MI + OIDC) Requires Premium SKU (higher floor cost)
Blast radius scoped per repo (scope maps / RBAC) More moving parts to operate and monitor
Provable provenance (signing) blocks tampering Signing/quarantine add a learning curve + ramp risk
Unscanned/unsigned images cannot ship (gates) Breaking gates stall deploys if promotion isn’t live
Survives zone and region failure, automatically Each replica is a billable Premium unit
Pull locality cuts egress + cold-start latency Eventual consistency: a just-pushed tag may lag a replica briefly
Registry never internet-reachable (private endpoint) DNS/PE misconfig can lock you (or CI) out
Smaller, cheaper, faster scans (retention/purge) Aggressive purge without soft delete is irrecoverable

Where each advantage actually matters: the keyless story matters most to teams with audit obligations or a history of leaked credentials — it removes an entire class of finding. Geo-replication matters to genuinely multi-region workloads; for a single-region app it is pure cost with no benefit, so do not add replicas you do not pull from. Quarantine and signing matter most where a compromised image is catastrophic (anything handling money or PII) and least where you are iterating on an internal dev tool — there, the strict ramp is overhead you can defer. The private endpoint matters whenever the registry would otherwise be one leaked credential away from full public exposure, which is to say almost always. Read the disadvantages as a sequencing guide, not a deterrent: every one of them is mitigated by rolling the breaking controls out after their safety nets (promotion automation, soft delete, a permissive signing ramp) are live.

Hands-on lab

This builds a hardened Premium registry, proves the admin user is gone, signs an image, and tears it all down. It uses real commands; the Premium registry and a single geo-replica accrue cost while they exist, so do the teardown. Run it in Cloud Shell.

1. Create the resource group and a Premium registry with the admin user disabled.

RG=rg-acr-lab
ACR=kvacrlab$RANDOM          # must be globally unique
LOC=australiaeast
az group create -n $RG -l $LOC
az acr create -n $ACR -g $RG --sku Premium --admin-enabled false

Expected: a registry resource with "adminUserEnabled": false and "sku": { "name": "Premium" }.

2. Confirm the admin user is actually off.

az acr show -n $ACR --query adminUserEnabled -o tsv     # expect: false
az acr credential show -n $ACR 2>&1 | head -1           # expect: an error — admin disabled

3. Build an image inside the registry with a quick task (no Docker daemon needed).

cat > Dockerfile <<'EOF'
FROM mcr.microsoft.com/cbl-mariner/busybox:2.0
CMD ["echo", "hello from a registry-built image"]
EOF
az acr build -r $ACR -t demo/hello:v1 .

Expected: a remote build log ending with the pushed image and its digest.

4. Create a least-privilege scope map and a pull-only token.

az acr scope-map create -r $ACR -n demo-pull \
  --repository demo/hello content/read metadata/read \
  --description "Pull-only for the demo repo"
az acr token create -r $ACR -n demo-puller --scope-map demo-pull -o json \
  --query "{name:name, status:status}"

Expected: a token in enabled status bound to demo-pull.

5. Turn on untagged retention and soft delete (the safety nets).

az acr config retention update -r $ACR --status enabled --days 7 --type UntaggedManifests
az acr config soft-delete update -r $ACR --status enabled --days 7
az acr config retention show -r $ACR -o table

Expected: both policies report enabled.

6. Add a geo-replica and watch it reach Ready.

az acr replication create -r $ACR -l southeastasia
az acr replication list -r $ACR -o table     # status goes Creating -> Ready

7. (Optional) Sign by digest with Notation + Key Vault. If you have a Key Vault with a signing certificate and the azure-kv plugin installed, sign the digest from step 3:

DIGEST=$(az acr repository show -n $ACR -t demo/hello:v1 --query digest -o tsv)
IMAGE=$ACR.azurecr.io/demo/hello@$DIGEST
notation sign --signature-format cose --id $KEY_ID --plugin azure-kv \
  --plugin-config self_signed=true $IMAGE
notation verify $IMAGE     # expect: verification succeeded

8. Tear it all down so nothing accrues cost:

az group delete -n $RG --yes --no-wait

Expected commands at each step and what a healthy result looks like:

Step Command (core) Healthy result If it fails
1 az acr create --sku Premium --admin-enabled false Premium registry, admin off Name not unique → choose another
2 az acr credential show Error (admin disabled) If it returns creds, admin is still on
3 az acr build -t demo/hello:v1 . Remote build + digest Quota/region issue → retry, check SKU
4 az acr token create --scope-map demo-pull Token enabled Scope map missing → create it first
5 az acr config retention/soft-delete update Both enabled Basic/Standard → Premium-only feature
6 az acr replication create -l southeastasia Replica Ready Region not AZ-capable → pick another
7 notation verify Verification succeeded Plugin/cert missing → install/grant KV
8 az group delete --yes RG removed Locks present → remove resource locks

Common mistakes & troubleshooting

The failures below are the ones that actually page people. Each is symptom → root cause → confirm (exact command/path) → fix. Scan the playbook table first, then read the detail for the row that matches.

# Symptom Root cause Confirm Fix
1 Login works, layer pull hangs/times out DNS resolves control endpoint but not data endpoints nslookup $ACR.<region>.data.azurecr.io returns public IP Add private DNS zone group; verify per-region A records
2 docker login/pull → denied from CI Public off + agent outside VNet, not trusted az acr show --query publicNetworkAccess = Disabled Self-hosted runner in VNet, or OIDC + scoped allow
3 Defender shows no scan results networkRuleBypassOptions=None or plan off az acr show --query networkRuleBypassOptions; az security pricing show -n Containers Set bypass AzureServices; enable plan Standard
4 ACR Task fails on import, not on build Source registry is network-restricted (not trusted) Task log shows timeout pulling source, not pushing Make source reachable; run import from inside VNet
5 Base-image trigger never fires on a CVE Dockerfile FROM unpinned or latest az acr task show --query "...baseImageTrigger" Pin base to a specific tag; --base-image-trigger-type Runtime
6 Every deploy stalls after enabling quarantine Promotion automation not live; images stuck quarantined az acr manifest list-metadata shows quarantine state Promote/disable; bring scanner+promotion live first
7 notation verify fails for a legit image Verified by tag (mutable) or wrong trust identity Re-run notation verify against the digest Sign+verify by digest; fix trustedIdentities/store
8 AKS won’t pull: unauthorized/forbidden Kubelet MI lacks AcrPull, or no --attach-acr az aks check-acr -n <aks> --acr $ACR az aks update --attach-acr; assign AcrPull
9 Purge deleted a signed/multi-arch image --untagged removed digest-only referrers Soft-delete blade shows the deleted manifest Restore from soft delete; narrow filter; --dry-run first
10 Pull returns a stale tag in one region Replica still Syncing (eventual consistency) az acr replication list shows Syncing Wait for Ready; pin by digest for determinism
11 OIDC login fails: AADSTS70021 no matching FIC Federated-credential subject mismatch Compare workflow sub claim vs --subject Align subject exactly (repo:branch/tag/env)
12 Locked out of the registry after lockdown Disabled public access before PE/DNS were ready az acr show --query publicNetworkAccess from outside Temporarily re-enable public via an allowed network; fix PE/DNS

Detail on the highest-frequency failures

#1 — Auth succeeds, layers hang. This is the canonical two-endpoint mistake. Your DNS zone group registered the registry group but the data-endpoint A records never populated (often because the zone wasn’t linked to the pulling VNet, only the hub). Confirm by resolving the data endpoint from inside the consuming VNet — a public IP means the private path isn’t wired. Fix by ensuring the private DNS zone is linked to every VNet that pulls, and that the zone group used --zone-name registry so the data records auto-populate.

#6 — Quarantine stalls everything. Quarantine is a breaking change: with the policy on, nothing new is pullable until promoted. If you flip it before the scanner-and-promotion loop is live, every deployment of a new image stalls. Confirm with the manifest metadata showing images stuck in the quarantined state. The fix in an incident is to promote the stuck images (or disable the policy), then re-enable only after the promotion automation is proven. This is the single most common self-inflicted ACR outage.

#7 — Signatures verify by tag. A signature binds to immutable content (a digest). If you notation sign or verify against a tag, a later overwrite of that tag breaks the binding and verification fails for reasons that look mysterious. Always operate on repo@sha256:.... The second cause is a trustedIdentities/trust-store mismatch — the cert in the store doesn’t match the signer’s x509.subject. Re-download the cert into the named store and confirm the subject string matches exactly.

#8 — AKS can’t pull. The cluster’s kubelet identity needs AcrPull. az aks check-acr is the purpose-built diagnostic — it tells you whether the cluster can authenticate and resolve the registry. If it reports an auth failure, run az aks update --attach-acr; if it reports a DNS/network failure, the private endpoint isn’t reachable from the cluster VNet (see #1).

Best practices

Security notes

The security controls and exactly what each one prevents — secure and resilient pull in the same direction here:

Control Setting / mechanism Prevents Also helps
Disable admin user --admin-enabled false Shared full-access credential leak Forces per-consumer identity
Private endpoint + DNS registry group + privatelink.azurecr.io Public exposure of registry/layers Cuts egress (private path)
Trusted-services bypass --allow-trusted-services true + MI Over-broad firewall holes Lets Defender/Tasks reach in safely
Scope maps / RBAC content/* actions, AcrPull/AcrPush Unscoped over-privileged access Per-repo blast-radius limit
Quarantine-on-push quarantinePolicy.status=enabled Pulling an unscanned image Forces a scan gate
Notation + Ratify Sign by digest + strict trust policy Tampered/wrong-signer admission Provable provenance
Defender for Containers Plan Standard CVEs shipping undetected Continuous re-scan of in-use images
Soft delete --status enabled --days Irrecoverable accidental/malicious delete Recovery from a bad purge
CMK encryption Key Vault key + registry encryption Loss of BYOK key control Compliance (BYOK)
Diagnostic logging ContainerRegistry*Events → LA Unattributable data-plane actions Forensics, audit

Cost & sizing

The bill is driven by the Premium SKU daily price, the number of geo-replicas (each a Premium unit), storage beyond the included allowance, outbound data transfer, ACR Tasks compute (per CPU-second, with a free monthly grant), and Defender for Containers (per image scanned). The Premium tier is a fixed daily charge that includes a large storage allowance and the full feature set; the variable costs are replicas, overage storage, egress, and scan volume.

Right-sizing is mostly about replica placement and surface size. The cost drivers and what each one buys:

Cost driver What you pay for Rough INR / month What it buys Watch-out
Premium SKU (home) Fixed daily + bundled storage + all features ~₹40,000–45,000 Private endpoints, tokens, signing, replicas Required floor; no cheaper path to these features
Geo-replica (each) One additional Premium unit ~₹40,000–45,000 each Region failover + pull locality Don’t replicate where you don’t pull
Storage overage Per GB-month beyond allowance Variable (per GB) Capacity for many tags/artifacts Retention/purge to keep it down
Outbound data transfer Per GB egress (cross-region pulls) Variable (per GB) Pulls served to far regions A local replica eliminates most of it
ACR Tasks compute Per CPU-second (after free grant) Variable (usage) Builds inside the boundary Big images/heavy CI exceed the grant
Defender for Containers Per image scanned Variable (per image) CVE scanning push/pull/continuous Many churning tags = more scans
Soft delete retention Storage of deleted items in window Marginal Recovery net Counts toward storage while retained

A rough monthly picture for a two-region fintech registry: home + one replica (~₹80,000–90,000 in SKU), modest storage overage and egress (now small thanks to the local replica), Tasks within the free grant for a handful of services, and Defender scanning a few hundred images (~low thousands of ₹). The dominant line is always the Premium units; everything else is rounding by comparison, which is why the single biggest cost decision is how many regions you genuinely pull from.

Interview & exam questions

1. Why must a network lockdown of ACR account for two endpoint classes, and what breaks if it doesn’t? ACR has a registry endpoint (*.azurecr.io, Docker v2 API + auth) and per-region data endpoints (*.<region>.data.azurecr.io, layer blobs). The registry private-endpoint group ID projects both into the VNet, but if private DNS only resolves the control endpoint, login succeeds and layer pulls hang/time out. You must link the privatelink.azurecr.io zone to every pulling VNet so the data-endpoint A records resolve privately.

2. The admin user is enabled and its password is in three pipelines. Walk through the remediation. Disable the admin user (--admin-enabled false), then move each consumer to an identity-based model: AKS to managed-identity pulls via az aks update --attach-acr (assigns AcrPull to the kubelet identity, no imagePullSecret), and pipelines to OIDC federated credentials scoped per repo/branch. Where a consumer truly cannot use Entra (a third-party appliance), issue a scope-map token with an expiry and the minimum actions. Net result: zero standing secrets.

3. What does a multi-step ACR Task give you that az acr build does not? A multi-step task’s build step does not auto-push; combined with cmd (run tests against the freshly built image) and a gated push (when: ["unit-tests"]), it is a build-test-push gate inside the registry boundary — the image is only published if it passes. az acr build always pushes. The task also supports commit and base-image triggers.

4. Explain the base-image trigger and what it requires. When the digest behind your Dockerfile’s FROM tag moves (an upstream or internal base is rebuilt), a base-image trigger (--base-image-trigger-type Runtime) re-runs the task and rebuilds your image with the patched layers — auto-patching derived images against CVEs at scale. It requires the base to be pinned to a specific tag (not unpinned, ideally not latest) so ACR can track the digest behind it.

5. Quarantine-on-push vs Notation signing — what does each guarantee, and why have both? Quarantine makes a pushed image invisible to normal pulls until promoted, forcing a scan gate (“has this been checked?”). Notation signing attaches a cryptographic proof verified at admission by Ratify (“is this the checked thing, from a signer we trust?”). They protect different things — quarantine against unscanned images, signing against tampering and wrong-signer — so a complete posture uses both.

6. Why sign by digest rather than tag, and what fails if you sign by tag? A signature binds to immutable content; a tag is a mutable pointer. If you sign or verify by tag and the tag is later overwritten, the signature no longer matches the content the tag points to and verification fails for reasons that look mysterious. Always operate on repo@sha256:....

7. How does ACR survive a zone failure versus a region failure, and who triggers failover? Zone redundancy (default in AZ regions, free) spreads each replica’s storage across availability zones, surviving a zone outage. Geo-replication keeps live writable copies in multiple regions behind one login server, surviving a region outage and serving pulls from the nearest replica. Failover is platform-managed and health-aware — there is no customer failover button; the global endpoint routes around an unhealthy replica automatically.

8. A pull returns a stale tag in one region right after a push. Why, and how do you make it deterministic? Geo-replicas are eventually consistent; a replica may briefly be Syncing after a write, so it can serve the previous manifest for that tag momentarily. Confirm with az acr replication list showing Syncing. For determinism, pin by digest (repo@sha256:...) rather than by a mutable tag, or wait for the replica to reach Ready.

9. After enabling quarantine, every deployment stalls. Root cause and fix? The promotion automation was not live when quarantine was enabled, so every newly pushed image is stuck quarantined and unpullable. Confirm via manifest metadata showing the quarantined state. Fix by promoting (or disabling the policy) and only re-enabling once the scanner-and-promotion loop is proven. The lesson: enable any breaking gate after its promotion path exists.

10. An OIDC pipeline login fails with “no matching federated identity credential.” What’s wrong? The subject claim presented by the workflow doesn’t match the federated credential’s --subject. The credential federates a specific subject (e.g. repo:org/repo:ref:refs/heads/main); if the workflow runs on a different branch, tag, environment, or PR, the subject differs and Entra rejects it. Align the --subject exactly to how the pipeline runs.

11. How do you keep a busy registry small without losing signatures or multi-arch images? Use untagged-manifest retention plus a scheduled purge task, but be careful: signatures, SBOMs, and multi-arch child manifests are referenced only by digest and look “untagged.” Always --dry-run purge filters first, use --keep N for production repos, and enable soft delete so a bad filter is recoverable.

12. Which Azure roles cover pull, push, and signing, and why never put a workload on Contributor? AcrPull (pull), AcrPush (pull+push), AcrImageSigner (sign), AcrDelete (delete) — all scoped to the registry or a custom-role’d subset of repositories. A workload on Contributor/Owner has full management rights (delete the registry, change networking), far beyond pull/push, violating least privilege and widening blast radius catastrophically.

These map primarily to AZ-500 (Security Engineer)secure compute, storage, and registries; manage identities and access; configure private networking — and AZ-204 (Developer)create and manage container images; implement CI/CD; manage secrets via managed identity. The networking lockdown touches AZ-700, and the RBAC/identity material overlaps AZ-104. A compact cert mapping:

Question theme Primary cert Objective area
Private endpoints, DNS, firewall AZ-500 / AZ-700 Secure & isolate PaaS networking
Tokens, scope maps, RBAC roles AZ-500 / AZ-104 Manage access to resources
Managed identity & OIDC pulls/push AZ-204 / AZ-500 Secure app config; CI/CD
ACR Tasks, base-image triggers AZ-204 Build & manage container images
Quarantine, Notation, Ratify AZ-500 Supply-chain & content trust
Geo-replication, zone redundancy AZ-104 / AZ-305 Resilience & high availability
Defender for Containers AZ-500 Implement threat protection

Quick check

  1. Login to a private-endpoint ACR succeeds but layer pulls hang. What single DNS thing is almost certainly missing, and how do you confirm it?
  2. You disabled the admin user and need AKS to pull with no stored secret. What one command wires the kubelet identity, and what role does it assign?
  3. True or false: enabling quarantine-on-push is a safe, non-breaking change you can flip any time.
  4. Why must you sign and verify images by digest rather than by tag?
  5. Your registry survives a zone outage automatically but you also need to survive a region outage. What do you add, and who triggers the failover?

Answers

  1. The private DNS zone group for the data endpoints is missing (or the privatelink.azurecr.io zone isn’t linked to the pulling VNet). Confirm by resolving $ACR.<region>.data.azurecr.io from inside that VNet — a public IP means the data endpoint isn’t projected privately. Fix by linking the zone to every pulling VNet and ensuring the zone group used --zone-name registry so data-endpoint A records auto-populate.
  2. az aks update --attach-acr <registry> — it assigns the AcrPull role to the cluster’s kubelet managed identity, so pods pull with no imagePullSecret.
  3. False. It is a breaking change: every newly pushed image is unpullable until promotion automation marks it passed. Enable it only after the scanner-and-promotion loop is live, or every deployment stalls.
  4. A signature binds to immutable content, and a tag is mutable. Sign/verify by tag and a later overwrite breaks the binding, so verification fails. Always use repo@sha256:....
  5. Add geo-replication (a replica in each consuming region). Failover is platform-managed and health-aware — there is no customer failover button; the global login server routes around an unhealthy replica automatically.

Glossary

Next steps

You can now stand up a registry that proves what it serves and refuses what it can’t. Build outward:

AzureACRContainer RegistrySupply ChainSecurity
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading