Azure Lesson 62 of 137

Eliminating Secrets: Key Vault and Workload Identity Federation End to End

Every stored credential is a liability with a half-life: secrets expire at the worst moment, leak into logs and .env files, and outlive the engineer who created them. This guide walks the full path to a secret-free estate — Azure Key Vault as the system of record for the few secrets you cannot avoid, managed identities for anything running inside Azure, and workload identity federation (OIDC) to extend that passwordless model to GitHub Actions and AKS. The destination is an estate where the only thing you rotate is trust, not strings.

The reason this is hard is not the vault — creating a vault takes one command. The reason is the bootstrap: to read a secret you must first authenticate, and if that authentication is itself a stored secret you have moved the problem one hop upstream, not solved it. The entire discipline in this article is closing that last gap so that no stored credential anywhere grants access to your secrets. You will learn the exact trust assertions (issuer, subject, audience), the RBAC roles that gate the data plane, the federation subjects for each platform, and — because this is operational — the precise az and portal paths to confirm why a passwordless sign-in failed, since the failure modes are subtle and the error messages are deliberately vague.

By the end you will be able to stand up Key Vault with the right authorization and network posture, attach the right flavour of managed identity to each Azure workload, federate GitHub Actions and AKS service accounts to Entra ID with no stored secret, rotate secrets with zero downtime, and prove the whole thing is secret-free with Resource Graph and audit logs. Because you will return to this mid-incident, the federation subjects, the RBAC roles, the error codes, and the failure playbook are all laid out as scannable tables — read the prose once, then keep the tables open when a deploy fails at 02:00 with AADSTS70021.

What problem this solves

Secrets do not fail loudly. They fail at 02:00 on a Saturday when a certificate expires, or six months after an engineer leaves and their personal access token is finally revoked, or the day a .env file lands in a public repo. The pain in production terms is fourfold: expiry (a rotated database password that nobody propagated takes the app down), leakage (a secret in CI logs, an image layer, a Slack message), sprawl (the same credential copied into twelve app settings, none of which you can find when you must rotate), and attribution loss (a shared service-principal secret used by forty pipelines, so a breach implicates all of them and the audit trail names one principal for every action).

What breaks without this: teams hand-roll secret rotation and it desynchronises; they store an AZURE_CREDENTIALS JSON blob in every GitHub repo and can never rotate it without coordinating forty pipelines; they grant a runtime workload Key Vault Contributor (a control-plane role) and accidentally let it grant itself more access. The instinct — “we have a vault, we are secure” — is the trap. A vault you authenticate to with a stored secret is a vault with a key under the doormat.

Who hits this: every team running workloads that need credentials — which is every team. It bites hardest on CI/CD pipelines (the long-lived deploy credential is the single most over-privileged, most-copied secret in most estates), AKS workloads (pod-managed identity is deprecated and the migration is non-obvious), and multi-repo platforms (the 20-federated-credential ceiling arrives fast when you model identity per repository). The fix is almost never “add another secret to the vault” — it is “make the platform vouch for the workload so there is no secret to store.”

To frame the whole field before the deep dive, here is every identity-bootstrap mechanism this article covers, where the trust originates, and the one failure that defines it:

Mechanism Where identity originates Use it for Defining failure mode
System-assigned managed identity Azure platform, bound 1:1 to one resource A standalone Azure service whose identity should die with it Identity vanishes on resource delete; orphaned role assignments
User-assigned managed identity (UAMI) Azure platform, standalone resource Workload families that share access; survives blue/green Forget to attach it → workload falls back to no identity
Workload identity federation (FIC) External OIDC issuer (GitHub, AKS, GitLab) Workloads outside Azure’s IMDS reach Subject string drift → AADSTS70021 no matching FIC
Key Vault reference App setting resolved by a managed identity Injecting a vault secret into app config without code Identity lacks Secrets User → resolves to empty → crash loop
CSI Secrets Store UAMI brokered into a pod via a webhook Mounting vault secrets as files in AKS Missing pod label → no token → mount fails

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand Entra ID basics: a tenant, an app registration (the identity of an application), a service principal (the local instance of that app in a tenant), and Azure RBAC (role assignments on a scope). You should be comfortable running az in Cloud Shell, reading JSON output, and reading a Bicep resource. Familiarity with OIDC at the level of “an issuer mints a signed token with claims, a relying party validates it” is assumed; you do not need to know JOSE internals.

This sits at the centre of the Identity & Platform Security track. Upstream of it is Azure Entra ID Fundamentals: Tenants, Users, Groups & RBAC, which defines the principals you assign roles to, and Entra Managed Identities Deep Dive: User-Assigned, FIC & RBAC, which goes deeper on the identity objects themselves. It pairs tightly with Azure Key Vault: Secrets, Keys & Certificates (the data-plane objects you are protecting) and Azure Key Vault Secret Rotation with Managed Identity. The federation half generalises across clouds — see GitHub Actions OIDC: Keyless Deploys to Multi-Cloud and Workload Identity Federation for Secretless CI/CD.

A quick map of who owns each layer, so you escalate to the right team when a passwordless flow breaks:

Layer What lives here Who usually owns it Failure classes it causes
External OIDC issuer GitHub/AKS token endpoint, sub claim shape Platform / DevOps AADSTS70021 (no matching FIC), subject drift
Entra ID (FIC + app/UAMI) Trust assertions, app registration, role grants Identity team AADSTS700213, AADSTS50034, missing role
Key Vault control plane Vault config, networking, RBAC model Platform team Privilege escalation via access policies
Key Vault data plane Secret values, versions, rotation Secret-ops + app Forbidden (no Secrets User), empty KV reference
Network path Private endpoint, vault firewall, DNS Network team Resolution to public IP, firewall block, timeout
Workload runtime IMDS / projected SA token, SDK credential App / dev team DefaultAzureCredential chain failures

Core concepts

Five mental models make every later step obvious.

Secret-zero is the only hard part. To read a secret from Key Vault, a workload must authenticate to Entra ID. If that authentication relies on a stored client secret, you have only moved the problem one hop upstream. The answer is platform-issued identity: the platform a workload runs on (an Azure VM, an AKS pod, a GitHub runner) issues it a short-lived token, and Entra ID is configured to trust that platform. No secret is stored anywhere. Everything in this article is a variation on that single idea.

Managed identity is “Azure trusts itself”; federation is “Entra trusts a named external subject.” Inside Azure, the platform mints and rotates an identity bound to a resource and exposes it via IMDS (the Instance Metadata Service at 169.254.169.254). Outside Azure, an external OIDC issuer mints a token and Entra ID validates it against a configured federated identity credential (FIC). Both paths end in a normal short-lived Entra access token and zero stored secrets. The fork is purely “is the workload inside Azure’s IMDS reach?”

A FIC is a three-field trust assertion, matched exactly. A federated identity credential says: I will accept a token from this issuer, identifying this subject, for this audience. All three must match the incoming token exactly — subjects are case- and string-sensitive. Issuer is the OIDC issuer URL; subject is the sub claim (a repo+environment, or a Kubernetes service account); audience for Entra is always api://AzureADTokenExchange. Get one character wrong and Entra returns “no matching federated identity credential,” not “access denied” — a distinction that wastes hours if you do not know it.

Authorization has two planes, and confusing them is the classic mistake. The control plane (manage the vault: create it, set networking, assign roles) is governed by Azure RBAC roles like Key Vault Contributor. The data plane (read/write secret values) is governed either by legacy access policies or by Azure RBAC data roles like Key Vault Secrets User. A runtime workload needs data-plane read and nothing else; giving it Contributor lets it grant itself more — a privilege-escalation path that RBAC-for-data-plane closes.

Rotation is a vault-side event consumers observe, never a coordinated deploy. The discipline is: store each secret in exactly one place (the vault), reference it versionlessly everywhere, and let resolvers follow the current version. A versioned URI or a hardcoded value anywhere reintroduces a rotation outage. Done right, rotating a secret is one operation in the vault; every consumer picks it up on its own refresh cadence.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the model side by side:

Concept One-line definition Where it lives Why it matters here
Key Vault Managed store for secrets, keys, certs Resource group The system of record for unavoidable secrets
Secret-zero The bootstrap credential you must not have (nowhere, ideally) The whole problem this article solves
Managed identity Platform-minted identity for an Azure resource Entra + resource Passwordless auth inside Azure
UAMI Standalone, reusable managed identity Its own resource Shared access across a workload family
IMDS Metadata endpoint that issues the token 169.254.169.254 Where in-Azure workloads get their token
FIC Federated identity credential (trust assertion) On an app or UAMI Lets Entra trust an external OIDC subject
Issuer / subject / audience The three fields a FIC matches In the FIC + token All must match exactly or sign-in fails
Access policy Legacy flat data-plane permission list On the vault The escalation-prone model to avoid
RBAC data role Secrets User/Officer/Administrator Role assignment The recommended least-privilege model
Key Vault reference @Microsoft.KeyVault(SecretUri=…) App setting Injects a secret without code seeing it
CSI Secrets Store Mounts vault secrets as files in a pod AKS add-on Workload-identity-mode secret mounting
Versionless URI SecretUri ending in / (no version) Reference / config The foundation of zero-downtime rotation

The authorization & error reference

Before the per-step detail, here is the lookup table you scan first when a passwordless flow fails: every error you realistically see across Key Vault, managed identity, and federation, what it means, the likely cause, how to confirm it, and the fix. The non-obvious ones are the AADSTS codes (Entra token-exchange failures) and the difference between a control-plane 403 and a data-plane Forbidden.

Code / error Where it surfaces Likely cause How to confirm First fix
AADSTS70021 No matching federated identity record azure/login, token exchange Token sub does not match any FIC subject Compare workflow environment/ref to FIC subject Fix the subject string to match exactly
AADSTS700213 No matching federated identity record for issuer Token exchange Issuer URL wrong/trailing-slash mismatch az ad app federated-credential list vs token iss Correct the FIC issuer URL
AADSTS700211 No configured federation in tenant Token exchange Issuer not configured at all List FICs on the app/UAMI Add the FIC for that issuer/subject
AADSTS50034 User/app not found in directory azure/login Wrong client-id / SP not created az ad sp show --id <appId> az ad sp create --id; fix client-id
AADSTS7000215 Invalid client secret provided azure/login A secret is still being sent (not OIDC) Workflow uses creds: JSON, not OIDC Remove AZURE_CREDENTIALS; use id-token: write
Forbidden (data plane) az keyvault secret show Identity lacks Key Vault Secrets User az role assignment list --assignee <pid> Grant Secrets User at vault scope
403 (control plane) az keyvault update Identity lacks Key Vault Contributor Role list on the vault scope Grant control role to the operator
ForbiddenByFirewall Any data-plane call Vault firewall blocks the caller Vault → Networking shows “selected networks” Add IP / private endpoint / trusted services
KV reference empty / app crash App boot Identity not enabled or lacks role; bad URI Environment variables blade red error Enable identity; grant role; fix SecretUri
SecretNotFound (404) Resolve Secret deleted/disabled, or wrong vault name az keyvault secret show 404 Restore/enable secret; correct vault
maximum allowed value of 20 federated-credential create 20-FIC ceiling on the app/UAMI az ad app federated-credential list | length Consolidate via env scoping / flexible FIC
Conflict on purge az keyvault purge Purge protection blocks hard-delete Vault shows enablePurgeProtection: true Wait out retention; this is by design

Three reading notes that save the most time:

Distinction The trap How to tell them apart
AADSTS70021 (subject) vs 700213 (issuer) Both say “no matching federated identity” 70021 = subject mismatch; 700213 = issuer mismatch — check which field differs
Control-plane 403 vs data-plane Forbidden Both look like “permission denied” 403 on vaults/write-type ops = RBAC; Forbidden on secrets/getValue = data role/policy
“No matching FIC” vs “access denied” You add a role when the subject is wrong If the token never exchanged, it is a FIC/subject problem, not RBAC — no token reached the data plane

Step 1 — Key Vault foundations

Before federating anything, get the vault right. Two decisions dominate: the authorization model and data protection. Both are one-way doors in practice.

RBAC over access policies. Legacy access policies are a flat list on the vault; anyone with Microsoft.KeyVault/vaults/write (Contributor, Key Vault Contributor) can grant themselves data access — a privilege-escalation path. Azure RBAC uses the standard role-assignment plane, supports scoping down to an individual secret, and is the recommended model. As of recent Key Vault API versions, RBAC is the default for newly created vaults.

az keyvault create \
  --name kv-plat-prod-001 \
  --resource-group rg-platform-prod \
  --location australiaeast \
  --enable-rbac-authorization true \
  --enable-purge-protection true \
  --retention-days 90 \
  --public-network-access Disabled \
  --sku standard
resource kv 'Microsoft.KeyVault/vaults@2023-07-01' = {
  name: 'kv-plat-prod-001'
  location: location
  properties: {
    tenantId: subscription().tenantId
    sku: { family: 'A', name: 'standard' }
    enableRbacAuthorization: true       // RBAC data plane, not access policies
    enableSoftDelete: true              // always on; explicit for clarity
    softDeleteRetentionInDays: 90
    enablePurgeProtection: true         // irreversible — production default
    publicNetworkAccess: 'Disabled'
    networkAcls: { defaultAction: 'Deny', bypass: 'AzureServices' }
  }
}

The two authorization models, side by side — pick RBAC unless you have a specific legacy reason:

Dimension Access policies (legacy) Azure RBAC (recommended)
Granularity Per-vault only (all secrets) Per-vault, per-object (down to one secret)
Escalation risk High — vaults/write can self-grant data Low — data roles are separate from control
Where it lives A list on the vault resource Standard role assignments (auditable centrally)
Max entries ~1024 policies per vault RBAC role-assignment limits per scope
PIM / just-in-time Not supported Supported (eligible roles, activation)
Default for new vaults Off On (recent API versions)
Use it when A legacy tool hard-codes policy APIs Everything else

The data-plane RBAC roles you will actually use, and who gets each:

Role Grants Assign to Never assign to
Key Vault Secrets User Read secret values Runtime workloads (MI, federated apps) Humans by default
Key Vault Secrets Officer Create/update/delete secrets CI/CD that seeds secrets; secret-ops Runtime app identities
Key Vault Certificates Officer Manage certificates PKI automation, cert-ops Runtime app identities
Key Vault Crypto User Use keys (wrap/unwrap/sign) Apps doing envelope encryption Anyone needing only secrets
Key Vault Crypto Officer Manage keys (create/rotate/delete) Key-ops, HSM admins Runtime app identities
Key Vault Administrator All data-plane ops Break-glass, platform admins only Pipelines, runtime workloads
Key Vault Reader Read vault metadata (not values) Auditors, inventory tooling

Assign least privilege at the secret scope where you can, and never hand a runtime workload more than Secrets User:

az role assignment create \
  --role "Key Vault Secrets User" \
  --assignee-object-id "$APP_PRINCIPAL_ID" \
  --assignee-principal-type ServicePrincipal \
  --scope "/subscriptions/$SUB/resourceGroups/rg-platform-prod/providers/Microsoft.KeyVault/vaults/kv-plat-prod-001/secrets/orders-db-conn"

Soft-delete and purge protection. Soft-delete (always on) recovers a deleted vault or secret within the retention window. Purge protection blocks even a privileged actor from hard-deleting before that window elapses, defeating a ransomware-style destroy. It is irreversible once enabled — that is the point. The data-protection knobs and their trade-offs:

Setting Values Default When to change Trade-off / gotcha
enableSoftDelete true (forced) true Cannot disable Deleted objects occupy the namespace until purged
softDeleteRetentionInDays 7–90 90 Lower only for cost/test Can’t reuse a soft-deleted name until purge/retention
enablePurgeProtection true / (unset) unset Always on in prod Irreversible; blocks redeploy that recreates the same vault name
enableRbacAuthorization true / false true (new) Keep true Switching mid-life requires re-granting data access
publicNetworkAccess Enabled / Disabled Enabled Disabled in prod Disabling without a private path locks out your own pipelines
networkAcls.defaultAction Allow / Deny Allow Deny in prod Deny without bypass: AzureServices breaks some integrations
sku.name standard / premium standard premium for HSM-backed keys Premium costs more; only needed for FIPS 140-2 L2 keys

Network isolation. --public-network-access Disabled plus a private endpoint keeps the data plane off the internet. Pair it with a Key Vault firewall that allows trusted Azure services so platform integrations still resolve. The network options, ordered by how locked-down they are:

Posture What it does Effort Use it for Watch-out
Public, no firewall Reachable from anywhere with RBAC None Dev/throwaway only Data plane on the internet
Public + IP firewall Allow-listed source IPs only Low Small fixed egress sets Cloud Shell / runner IPs drift
Trusted services bypass Allow Azure platform integrations Low App Service KV references, etc. Broad “Azure services,” not your tenant only
Private endpoint Vault gets a private IP in your VNet Medium Production default Needs privatelink.vaultcore.azure.net DNS
Private + public disabled Only the VNet path resolves Medium Strict isolation/compliance Pipelines need a private path or self-hosted runner

Step 2 — Managed identities, decoded

Inside Azure, you almost never need federation — you need a managed identity. There are two flavours, and choosing wrong creates real operational pain.

# A UAMI shared across a workload family
az identity create \
  --name id-orders-api \
  --resource-group rg-platform-prod \
  --location australiaeast

APP_PRINCIPAL_ID=$(az identity show -n id-orders-api -g rg-platform-prod --query principalId -o tsv)
APP_CLIENT_ID=$(az identity show -n id-orders-api -g rg-platform-prod --query clientId -o tsv)

The two flavours, decided as a table — most platforms standardise on UAMI:

Dimension System-assigned User-assigned (UAMI)
Lifecycle Born/dies with the resource Independent resource
Reuse across workloads No (1:1) Yes (1:many)
Survives blue/green replace No (new identity each time) Yes (re-attach the same UAMI)
Role-assignment churn Re-grant on every recreate Grant once, inherit everywhere
Best for A single standalone service A workload family / platform scale
Federation target Cannot hold a FIC Can hold FICs (AKS, external)
Cleanup risk Auto-cleaned with resource Orphaned role assignments if forgotten
Cost Free Free

Where you can attach a managed identity, and how the token is delivered — this determines whether you even can use one:

Host MI support Token delivery Notes / limit
App Service / Functions System + user IMDS-like endpoint (env-injected) Multiple UAMIs allowed; pick one for KV references
Virtual Machine / VMSS System + user IMDS 169.254.169.254 UAMI must be assigned to the VM
AKS (workload identity) UAMI via FIC Projected SA token → exchange Pod-managed identity is deprecated
Container Apps System + user Managed endpoint Similar to App Service
Logic Apps (Standard) System + user Managed endpoint Use for connectors needing KV
Azure DevOps / GitHub App + FIC (federation) External OIDC, not IMDS No IMDS off-Azure → federation, not MI

For an App Service, attach the UAMI and point app settings at the vault using Key Vault references — the platform resolves them at startup using the identity, so your code never sees a secret string:

az webapp identity assign \
  --name app-orders-prod --resource-group rg-platform-prod \
  --identities "/subscriptions/$SUB/resourceGroups/rg-platform-prod/providers/Microsoft.ManagedIdentity/userAssignedIdentities/id-orders-api"

az webapp config appsettings set \
  --name app-orders-prod --resource-group rg-platform-prod \
  --settings "Db__ConnString=@Microsoft.KeyVault(SecretUri=https://kv-plat-prod-001.vault.azure.net/secrets/orders-db-conn/)"
resource site 'Microsoft.Web/sites@2023-12-01' = {
  name: 'app-orders-prod'
  location: location
  identity: {
    type: 'UserAssigned'
    userAssignedIdentities: { '${uami.id}': {} }
  }
  properties: {
    serverFarmId: plan.id
    keyVaultReferenceIdentity: uami.id   // which identity resolves KV references
    siteConfig: {
      appSettings: [
        {
          name: 'Db__ConnString'
          value: '@Microsoft.KeyVault(SecretUri=https://kv-plat-prod-001.vault.azure.net/secrets/orders-db-conn/)'
        }
      ]
    }
  }
}

The SecretUri without a version (trailing /) resolves the current version. That single decision is the foundation of zero-downtime rotation in Step 6. With multiple UAMIs attached, you must set keyVaultReferenceIdentity or the platform does not know which identity to use and the reference fails.

The Key Vault reference syntax has exactly two forms — know both and their reload behaviour:

Reference form Resolves Reloads on rotation? Use it when
SecretUri=…/secrets/<name>/ (no version) Current version Yes (on restart + periodic refresh) Default — enables rotation
SecretUri=…/secrets/<name>/<version> That pinned version No — frozen Almost never; reintroduces rotation outages
VaultName=…;SecretName=… (alt syntax) Current version Yes Older syntax; prefer SecretUri

Step 3 — Workload identity federation: how the trust works

Federation lets Entra ID accept an OIDC token from an external issuer in exchange for an Entra access token — no client secret involved. You configure a federated identity credential (FIC) on either an app registration or a user-assigned managed identity. A FIC is a trust assertion with three fields that must all match the incoming token:

At runtime the external platform issues a short-lived OIDC token, the workload presents it to Entra ID’s token endpoint, Entra validates issuer/subject/audience against a configured FIC, and returns a normal access token. The OIDC token lives minutes; nothing durable is stored.

The three fields, what each does, and the exact failure when it is wrong:

FIC field What it is Example Failure if wrong
issuer OIDC issuer URL (must match token iss) https://token.actions.githubusercontent.com AADSTS700213 issuer mismatch
subject Exact sub claim of the workload repo:contoso/orders-api:environment:prod AADSTS70021 no matching subject
audiences Who the token is for (Entra fixed value) api://AzureADTokenExchange Token rejected / audience mismatch
name A label for the FIC (your choice) gh-orders-prod-env Cosmetic; must be unique on the object

Where you can host a FIC, and the trade-off:

FIC host Holds FICs? Pros Cons
App registration Yes Supports flexible FICs (claims matching, wildcards) Two objects (app + SP) to manage
User-assigned MI Yes Single object; natural for AKS SAs Exact-match subjects only (no wildcards yet)
System-assigned MI No Cannot federate; use a UAMI instead

Limit: a single managed identity (or app) supports a maximum of 20 federated identity credentials. Plan subjects accordingly — one FIC per branch and per environment adds up fast. Flexible federated credentials (claims matching with wildcards) exist for GitHub/GitLab/Terraform Cloud on app objects if you outgrow exact-match.

The federation limits you will actually hit:

Limit Value Consequence Mitigation
FICs per app / UAMI 20 21st create fails Env-scope subjects; flexible FIC; one identity per trust boundary
OIDC token lifetime (GitHub) ~minutes Long jobs may need re-issue SDK re-requests automatically
Entra access-token lifetime ~60–90 min Token expires mid-job SDK refreshes via the FIC
Subject string length / format Issuer-defined Mismatch → 70021 Copy the exact sub from a token dump
Flexible FIC issuers GitHub/GitLab/TF Cloud (app only) Not on UAMI Use an app registration for wildcards

Step 4 — Federating GitHub Actions to Azure

This kills the AZURE_CREDENTIALS JSON secret that haunts so many pipelines. Create (or reuse) an app registration, then add a FIC whose subject pins the exact repo and ref.

APP_ID=$(az ad app create --display-name "gh-orders-deploy" --query appId -o tsv)
az ad sp create --id "$APP_ID"

The subject claim is where least privilege lives. Pin to a branch or a GitHub Environment — environment scoping is stronger because it lets you gate on approvals and environment protection rules:

# Environment-scoped: only the 'prod' environment of this repo can assume the identity
az ad app federated-credential create \
  --id "$APP_ID" \
  --parameters '{
    "name": "gh-orders-prod-env",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:contoso/orders-api:environment:prod",
    "audiences": ["api://AzureADTokenExchange"]
  }'

Common subject formats — copy the one that matches how the workflow is triggered:

Scenario Subject Strength
Branch push repo:ORG/REPO:ref:refs/heads/main Medium (no approvals)
Tag repo:ORG/REPO:ref:refs/tags/v1.2.3 Medium
Pull request repo:ORG/REPO:pull_request Low (any PR)
Environment (preferred) repo:ORG/REPO:environment:prod High (approvals + protection rules)
Reusable workflow repo:ORG/REPO:job_workflow_ref:ORG/REPO/.github/workflows/x.yml@ref High (pins the workflow)
Org-wide (flexible FIC) claims match repository_owner == 'ORG' Scales to many repos

Grant the app’s service principal only the roles that deployment needs — scoped to the target resource group, never the subscription. Then the workflow needs the id-token: write permission and the azure/login action with no secret:

name: deploy-orders
on:
  push:
    branches: [main]

permissions:
  id-token: write        # required to request the GitHub OIDC token
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: prod      # must match the FIC subject 'environment:prod'
    steps:
      - uses: actions/checkout@v4
      - uses: azure/login@v2
        with:
          client-id: ${{ vars.AZURE_CLIENT_ID }}
          tenant-id: ${{ vars.AZURE_TENANT_ID }}
          subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
      - run: az webapp deploy --name app-orders-prod --resource-group rg-platform-prod --src-path ./app.zip --type zip

Note AZURE_CLIENT_ID and friends are repository variables, not secrets — they are identifiers, not credentials, and leaking them grants nothing without the matching OIDC trust. The two workflow permissions that gate this, and what breaks without them:

Workflow element Purpose If missing Symptom
permissions: id-token: write Lets the job request the GitHub OIDC token No token minted azure/login cannot get an assertion
permissions: contents: read Checkout access Checkout fails Job fails before login
environment: prod Adds environment:prod to the sub Subject mismatch AADSTS70021 if FIC is env-scoped
client-id (variable, not secret) Identifies the app to Entra Wrong/empty AADSTS50034 app not found
azure/login@v2 Performs the token exchange (older v1 lacks OIDC) Falls back to secret-based login

The GitHub-vs-secret comparison that justifies the migration:

Aspect AZURE_CREDENTIALS secret (old) OIDC federation (new)
Stored credential Long-lived JSON in every repo None
Rotation Manual, coordinated across repos Nothing to rotate
Blast radius if leaked Full SP access until revoked Identifiers only; useless without trust
Scoping One SP, broad Per repo/branch/environment subject
Audit attribution Shared SP for all repos Per-FIC, per-environment sign-in
Approvals gate No Yes (environment protection rules)

Step 5 — AKS workload identity

Inside the cluster, pod-managed identity is deprecated; Microsoft Entra Workload ID is the model. The cluster runs an OIDC issuer, and a mutating webhook injects a projected service-account token plus the environment variables the Azure SDKs expect. Enable both:

az aks update \
  --name aks-plat-prod --resource-group rg-platform-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

OIDC_ISSUER=$(az aks show -n aks-plat-prod -g rg-platform-prod \
  --query "oidcIssuerProfile.issuerUrl" -o tsv)

Federate a UAMI to a specific Kubernetes service account. The subject is system:serviceaccount:<namespace>:<name> and the issuer is the cluster’s OIDC URL:

az identity federated-credential create \
  --name fic-orders-sa \
  --identity-name id-orders-api \
  --resource-group rg-platform-prod \
  --issuer "$OIDC_ISSUER" \
  --subject "system:serviceaccount:orders:sa-orders" \
  --audiences "api://AzureADTokenExchange"

Annotate the service account with the UAMI client ID, and label pods to opt in. The annotation tells the webhook which identity to broker; the pod label flips the workload into the webhook’s injection path.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-orders
  namespace: orders
  annotations:
    azure.workload.identity/client-id: "<APP_CLIENT_ID of id-orders-api>"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
  namespace: orders
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"   # opt this pod into the webhook
    spec:
      serviceAccountName: sa-orders
      containers:
        - name: orders-api
          image: acrplatprod.azurecr.io/orders-api:1.4.0

The four moving parts of AKS workload identity, and the failure when each is missing — this is the table to keep open when a pod can’t get a token:

Part What it does If missing How to confirm
--enable-oidc-issuer Cluster issues OIDC tokens No issuer URL to federate az aks show --query oidcIssuerProfile.issuerUrl empty
--enable-workload-identity Installs the mutating webhook No env vars / token injected Webhook pods absent in kube-system
FIC subject = system:serviceaccount:ns:name Entra trusts that SA AADSTS70021 Compare FIC subject to the pod’s SA
SA annotation client-id Tells webhook which identity Webhook can’t broker kubectl get sa -o yaml shows no annotation
Pod label azure.workload.identity/use: "true" Opts the pod in No env vars injected kubectl exec … env | grep AZURE_ empty

The environment variables the webhook injects (your SDK reads these automatically):

Variable Value Used by
AZURE_CLIENT_ID The UAMI client ID SDK to identify the identity
AZURE_TENANT_ID Your tenant SDK token request
AZURE_FEDERATED_TOKEN_FILE Path to the projected SA token SDK reads the assertion
AZURE_AUTHORITY_HOST Entra login host SDK token endpoint

With DefaultAzureCredential, the SDK inside the pod now authenticates with zero config. If you prefer secrets mounted as files, layer the Azure Key Vault provider for Secrets Store CSI Driver, which also works in workload-identity mode:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: spc-orders-kv
  namespace: orders
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    clientID: "<APP_CLIENT_ID of id-orders-api>"   # workload identity mode
    keyvaultName: "kv-plat-prod-001"
    tenantId: "<TENANT_ID>"
    objects: |
      array:
        - |
          objectName: orders-db-conn
          objectType: secret

Enable the add-on with rotation when you create or update the cluster:

az aks enable-addons \
  --addons azure-keyvault-secrets-provider \
  --name aks-plat-prod --resource-group rg-platform-prod \
  --enable-secret-rotation \
  --rotation-poll-interval 2m

The two ways an AKS pod consumes a vault secret, side by side — DefaultAzureCredential vs CSI mount:

Aspect SDK + DefaultAzureCredential CSI Secrets Store mount
How the app gets the value Calls Key Vault at runtime Reads a mounted file
Code change Minimal (SDK call) None (read a file path)
Rotation pickup Per call / your cache Polled at rotation-poll-interval
Network path Pod → Key Vault (needs egress/PE) Same, via the CSI driver pod
K8s Secret sync No Optional (secretObjects)
Best for Apps already using the SDK Legacy apps that expect files
Failure mode Token/role errors surface in app Mount fails → pod stuck ContainerCreating

Step 6 — Rotation without downtime

Rotation breaks applications when code pins a version. The discipline is to reference secrets without a version and let the resolver follow the current one.

How each consumer picks up a rotated secret, and the latency you should expect:

Consumer Pickup mechanism Typical latency App restart needed? Gotcha
App Service KV reference Restart + periodic refresh Up to several hours (refresh) No (but restart is instant) Pinned version never refreshes
CSI mount (per-request read) Poll interval rotation-poll-interval (2m default) No App must re-read the file
CSI mount (read at startup) Poll updates file only n/a until restart Yes, or watch the file Stale in-memory value
SDK + cached secret Your cache TTL Your design No Cache too long → stale; too short → throttle
Event Grid → Function Event push Seconds Optional (you control) Must build the handler
Hardcoded value anywhere None Never Yes (redeploy) This is the anti-pattern

The golden rule: store the secret in exactly one place (the vault), reference it versionlessly everywhere, and treat rotation as a vault-side operation that consumers observe — never a coordinated multi-system deploy.

The Event Grid event types Key Vault emits, and what to wire each to:

Event type Fires when Wire it to
SecretNewVersionCreated A new secret version is created Cache invalidation / rolling restart
SecretNearExpiry Secret nears its expiry date Rotation automation / alert
SecretExpired Secret has expired Page on-call; block deploys
CertificateNewVersionCreated Cert renewed Reload TLS listeners
CertificateNearExpiry / Expired Cert lifecycle PKI automation / alert
KeyNewVersionCreated Key rotated Re-wrap data-encryption keys

Step 7 — Auditing and detecting orphaned secrets

You cannot claim “secret-free” without proving it. Two fronts: find the secrets you missed, and watch the vault you kept.

Find orphaned secrets. Sweep app settings and pipeline definitions for plaintext that should be a Key Vault reference or a federated identity:

# App settings that look like inline secrets rather than KV references
az webapp config appsettings list -n app-orders-prod -g rg-platform-prod \
  --query "[?!contains(value, '@Microsoft.KeyVault')].name" -o tsv

Hunt the classic offenders across the estate with Resource Graph — for example, web apps inventory, then app registrations that still carry password credentials (a federation candidate):

az graph query -q "
  resources
  | where type == 'microsoft.web/sites'
  | extend kind = tostring(kind)
  | project name, resourceGroup, kind"

The estate-wide checks worth scripting into a weekly job:

What to hunt Where Why it matters Action
App settings without @Microsoft.KeyVault App Service config Inline secret instead of a reference Convert to a KV reference
App registrations with passwordCredentials Entra (Graph) A federation candidate / leakable secret Add a FIC, revoke the secret
SP secrets nearing expiry Entra Imminent outage when they lapse Federate or rotate
Pinned-version SecretUri App config Breaks rotation silently Drop the version segment
Vaults with access policies (not RBAC) Key Vault Escalation-prone authorization Migrate to RBAC data plane
Vaults with public network + no firewall Key Vault Data plane on the internet Add private endpoint / firewall
Key Vault Administrator on a runtime identity RBAC Massive over-grant Downgrade to Secrets User

Diagnostic logs. Route Key Vault AuditEvent logs to Log Analytics so every data-plane access is queryable and retained:

az monitor diagnostic-settings create \
  --name kv-audit \
  --resource "/subscriptions/$SUB/resourceGroups/rg-platform-prod/providers/Microsoft.KeyVault/vaults/kv-plat-prod-001" \
  --logs '[{"category":"AuditEvent","enabled":true}]' \
  --workspace "/subscriptions/$SUB/resourceGroups/rg-obs/providers/Microsoft.OperationalInsights/workspaces/law-platform"

The Key Vault log categories and what each is the source of truth for:

Category Captures Use it for
AuditEvent Every data-plane op (get/set/delete) + caller identity Who read which secret, and result
AzurePolicyEvaluationDetails Policy evaluation on the vault Compliance/governance audits
AllMetrics Latency, availability, saturation Health dashboards, capacity

Alert on anomalies. A KQL alert for access from an unexpected identity or a spike in SecretGet denials catches both misconfiguration and intrusion:

AzureDiagnostics
| where ResourceType == "VAULTS" and OperationName == "SecretGet"
| where ResultType != "Success"
| summarize denials = count() by identity_claim_appid_g, bin(TimeGenerated, 15m)
| where denials > 10

The KQL you will reach for most — one query per question you ask during an incident or audit:

Question Operation filter Key column One-liner
Who is being denied secrets? SecretGet, ResultType != Success identity_claim_appid_g summarize count() by appid
Who read this specific secret? SecretGet, success id_s (secret URI) where id_s contains "orders-db-conn"
Sudden spike in reads (exfil)? SecretGet bin(TimeGenerated, 5m) summarize count() by bin(…)
New/unexpected caller identity? any identity_claim_appid_g distinct appid vs an allow-list
Secret deletions (destructive)? SecretDelete CallerIPAddress where OperationName == "SecretDelete"
Access from outside expected IPs? any CallerIPAddress where CallerIPAddress !in (…)

Architecture at a glance

The diagram traces the credential path exactly as a workload travels it, left to right, with each numbered badge marking the precise hop where a passwordless flow fails. Read it as the secret-zero journey: a workload (a GitHub runner, an AKS pod, or an App Service) starts with no stored secret. Off-Azure, the external OIDC issuer mints a short-lived token whose sub claim names the workload; on-Azure, IMDS plays the same role. That token is presented to Entra ID, where a federated identity credential (or the managed identity itself) is matched on issuer/subject/audience and exchanged for a normal Entra access token. Only then does the workload reach the Key Vault data plane, where an RBAC data role (Key Vault Secrets User) gates whether it can read the secret value — which finally resolves the versionless reference the app consumes. The private endpoint on the right keeps that last hop off the internet.

Notice the badges cluster where trust is actually established and where it most often breaks: badge 1 on the issuer/subject (the AADSTS70021 subject-drift trap), badge 2 on the Entra FIC (issuer mismatch and the 20-FIC ceiling), badge 3 on the data-plane role grant (the Forbidden that means “no Secrets User,” not “wrong subject”), badge 4 on the versionless reference (a pinned version that silently never rotates), and badge 5 on the network path (a private endpoint whose DNS resolves to a public IP). The legend narrates each as symptom · confirm · fix — that is the whole diagnostic method: localise the failure to one hop, read the confirm command, apply the fix.

Azure secret-zero credential path: a workload (GitHub runner, AKS pod, App Service) with no stored secret obtains a short-lived OIDC token from an external issuer or IMDS, presents it to Entra ID where a federated identity credential is matched on issuer/subject/audience and exchanged for an access token, then reads a secret from the Key Vault data plane gated by the Key Vault Secrets User RBAC role and resolved through a versionless reference, with a private endpoint isolating the vault — numbered badges mark the five failure hops: subject drift (AADSTS70021), issuer mismatch and the 20-FIC ceiling, the data-plane Forbidden from a missing Secrets User role, a pinned version that never rotates, and a private endpoint resolving to a public IP

Real-world scenario

Meridian Retail runs a forty-service microservice platform on Azure: AKS for the runtime, App Service for a handful of legacy APIs, and GitHub Actions for every deployment. The platform team is six engineers; the estate spans three subscriptions in australiaeast. Their mandate from a post-incident review was blunt: after a contractor’s leaked personal access token was found to still hold deploy rights two months after offboarding, no long-lived deploy secret may exist anywhere in the estate within one quarter.

They started where the risk was highest — CI/CD. Every one of the forty repos carried the same AZURE_CREDENTIALS JSON secret for a single shared service principal. They federated each repo’s prod environment to one shared gh-deploy app registration, one FIC per repo. Within two weeks they hit the wall: the 21st az ad app federated-credential create failed with The number of federated identity credentials on the application has reached the maximum allowed value of 20. The instinct was to mint more app registrations — but that scatters role assignments and audit identity across dozens of principals, exactly the sprawl they were trying to kill.

The fix was to stop modelling identity per repo. They created one user-assigned managed identity per deployment tier (id-deploy-prod, id-deploy-nonprod) and adopted GitHub’s repository_owner claim instead of pinning each repo. Crucially, a plain sub match cannot express “any repo in this org,” so they switched to a flexible federated credential on an app registration, using claimsMatchingExpression against assertion.repository_owner gated on the prod environment:

az ad app federated-credential create \
  --id "$APP_ID" \
  --parameters '{
    "name": "gh-org-prod",
    "issuer": "https://token.actions.githubusercontent.com",
    "audiences": ["api://AzureADTokenExchange"],
    "claimsMatchingExpression": {
      "value": "claims['"'"'repository_owner'"'"'] eq '"'"'meridian'"'"' and claims['"'"'environment'"'"'] eq '"'"'prod'"'"'",
      "languageVersion": 1
    }
  }'

One credential now covered every repo the org owned, gated on prod so approvals still applied. Forty FICs collapsed to two, role assignments lived on two identities, and sign-in logs attributed every deploy to one auditable principal.

The AKS side had its own trap. Three teams had copied a working SecretProviderClass but their pods kept failing with secrets that mounted empty. The platform on-call traced it to two distinct causes via the failure table: two teams had omitted the azure.workload.identity/use: "true" pod label (so the webhook never injected the token — kubectl exec … env | grep AZURE_ came back empty), and one team’s FIC subject read system:serviceaccount:orders:orders-sa while the deployment used serviceAccountName: sa-orders — a one-token mismatch that produced AADSTS70021, not a Key Vault error, which is why they had spent a day staring at vault RBAC.

By quarter end: every pipeline federated, the AZURE_CREDENTIALS secret deleted from all forty repos, AKS on workload identity with CSI rotation polling every two minutes, App Service on versionless references, and a Resource Graph job that fails the nightly build if any app registration still carries a password credential. The contractor-token class of incident became impossible — there was no longer a stored credential to leak. The lesson on the wall: “Federation subjects map to a trust boundary, not to a repository. Model the boundary first and the credential count takes care of itself.”

The migration as a timeline, because the order of moves is the lesson:

Week State Action taken Effect What it should have been
1 40 repos, shared AZURE_CREDENTIALS Federate each repo’s prod env to one app First repos go secretless Sound start
2 20 FICs created 21st federated-credential create fails Hit the 20-FIC ceiling Anticipate the ceiling up front
2 Ceiling hit Plan to mint more app registrations Identity/audit sprawl looming Don’t — model the boundary
3 Re-modelled UAMI per tier + flexible FIC on repository_owner 40 FICs → 2; one principal per tier The correct design
4 AKS migration Copy SecretProviderClass, pods mount empty Two causes: missing label + subject typo Use the failure table first
4 Diagnosed Add pod label; fix FIC subject to match SA Pods get tokens; secrets mount
13 Secret-free Delete AZURE_CREDENTIALS; nightly Graph gate Leak-class incident impossible The destination

Advantages and disadvantages

The passwordless model removes the credential you most fear leaking, but it relocates the complexity into trust configuration — which has its own sharp edges. Weigh it honestly:

Advantages (why this model wins) Disadvantages (why it bites)
No stored credential to leak — the highest-risk secret simply does not exist Trust config (issuer/subject/audience) is exact-match and unforgiving — one typo → cryptic AADSTS70021
Nothing to rotate — rotation becomes a vault-side event, not a coordinated deploy The 20-FIC ceiling forces you to model trust boundaries, not just wire up repos
Per-workload, per-environment attribution in sign-in logs Failures are vague by design — “no matching FIC” vs “access denied” confuses teams for hours
RBAC data plane gives least privilege down to a single secret Two authorization planes (control vs data) — easy to grant the wrong one
Managed identity needs zero app config inside Azure (DefaultAzureCredential) Off-Azure (CI, on-prem) you must federate — IMDS isn’t there to lean on
Private endpoint + RBAC keeps the data plane off the internet Disabling public access without a private path locks out your own pipelines
Purge protection defeats a ransomware-style destroy Purge protection is irreversible and blocks redeploys that recreate a vault name

The model is right for any estate that runs workloads needing credentials — which is all of them — and especially for CI/CD and AKS where the long-lived deploy secret is the crown-jewel risk. It bites hardest on teams that model identity per repository (the FIC ceiling), teams new to the control/data-plane split (wrong-role grants), and anyone who flips network isolation before landing a private path. Every disadvantage is manageable — but only if you know it exists, which is the point of this article.

Hands-on lab

Stand up a vault with RBAC, attach a user-assigned identity, store a secret, grant least privilege, and read it back as that identity — all free-tier-friendly. Then reproduce the classic Forbidden failure and fix it. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-kv-lab
LOC=australiaeast
KV=kv-lab-$RANDOM        # globally-unique vault name
UAMI=id-kv-lab
az group create -n $RG -l $LOC -o table
SUB=$(az account show --query id -o tsv)

Step 2 — Create a vault with RBAC authorization (no access policies).

az keyvault create -n $KV -g $RG -l $LOC \
  --enable-rbac-authorization true \
  --sku standard -o table

Expected: a vault row; properties.enableRbacAuthorization = true.

Step 3 — Create a user-assigned identity and capture its principal ID.

az identity create -n $UAMI -g $RG -l $LOC -o table
PID=$(az identity show -n $UAMI -g $RG --query principalId -o tsv)
CID=$(az identity show -n $UAMI -g $RG --query clientId -o tsv)
echo "principalId=$PID clientId=$CID"

Step 4 — Seed a secret (as yourself — you need Secrets Officer). First grant yourself, then write:

ME=$(az ad signed-in-user show --query id -o tsv)
az role assignment create --assignee-object-id $ME --assignee-principal-type User \
  --role "Key Vault Secrets Officer" \
  --scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.KeyVault/vaults/$KV"
# wait a few seconds for RBAC to propagate, then:
az keyvault secret set --vault-name $KV --name demo-conn --value "Server=db;Pwd=p@ss" -o table

Step 5 — Reproduce the Forbidden failure. The UAMI has no data role yet. Simulate its access check:

# This lists role assignments for the UAMI on the vault — expect EMPTY (the bug)
az role assignment list --assignee $PID \
  --scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.KeyVault/vaults/$KV" -o table

Empty output is the root cause: a workload carrying this UAMI would get Forbidden on SecretGet, which surfaces as an empty Key Vault reference and a crash loop — not an obvious “denied” in the app.

Step 6 — Grant least privilege and confirm.

az role assignment create --assignee-object-id $PID --assignee-principal-type ServicePrincipal \
  --role "Key Vault Secrets User" \
  --scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.KeyVault/vaults/$KV"

az role assignment list --assignee $PID \
  --scope "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.KeyVault/vaults/$KV" \
  --query "[].roleDefinitionName" -o tsv
# Expected: Key Vault Secrets User

Any workload (App Service, AKS pod) carrying this UAMI can now read demo-conn via a versionless reference — with zero stored secret.

Validation checklist. You created an RBAC vault, attached a reusable identity, hit the exact Forbidden/empty-reference failure from a missing data role, and fixed it with least privilege. No secret was stored to authenticate. The steps mapped to what each proves:

Step What you did What it proves Real-world analogue
2 RBAC vault, no access policies The escalation-safe model is one flag Every new production vault
4 Grant yourself Secrets Officer Control and data planes are separate Seeding secrets from CI
5 UAMI has no role → empty result The “empty KV reference” crash has a cause The 02:00 crash-loop
6 Grant Secrets User, confirm Least privilege is the fix, not Administrator Hardening every workload identity

Cleanup (avoid lingering charges and a soft-deleted name).

az group delete -n $RG --yes --no-wait
# The vault soft-deletes; purge if you want the name back immediately (no purge protection here):
az keyvault purge --name $KV --no-wait 2>/dev/null || true

Cost note. A Standard vault has no hourly charge — you pay per 10,000 operations (fractions of a rupee for this lab). The UAMI is free. Deleting the resource group stops everything; the vault soft-deletes for 90 days unless purged.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you read mid-incident, then the entries that bite hardest in full.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 azure/login fails AADSTS70021 no matching FIC Workflow environment/ref ≠ FIC subject Compare workflow sub to az ad app federated-credential list --id <appId> Correct the subject string to match exactly
2 AADSTS700213 no matching issuer FIC issuer URL wrong / trailing slash az ad app federated-credential list vs token iss Fix the issuer URL (no trailing slash)
3 AADSTS7000215 invalid client secret Workflow still sends a secret, not OIDC Workflow uses creds: JSON; no id-token: write Add permissions: id-token: write; remove AZURE_CREDENTIALS
4 21st FIC create fails “maximum value of 20” 20-FIC ceiling on the app/UAMI az ad app federated-credential list --id <appId> | jq length Env-scope subjects; flexible FIC; one identity per tier
5 App boots, KV-backed setting is empty, crash loop Identity lacks Key Vault Secrets User Env variables blade red error; az role assignment list --assignee <pid> empty Grant Secrets User at vault scope
6 Forbidden on az keyvault secret show No data role, or vault on access policies az keyvault show --query properties.enableRbacAuthorization Grant Secrets User (RBAC) or add access policy
7 KV reference resolves to public IP / times out Vault firewall blocks, or private DNS wrong Vault → Networking “selected networks”; nslookup <vault>.vault.azure.net Add private endpoint + privatelink.vaultcore.azure.net zone; allow trusted services
8 AKS pod has no AZURE_* env vars Missing pod label azure.workload.identity/use kubectl exec … env | grep AZURE_ empty Add the label; restart the deployment
9 AKS pod gets token but AADSTS70021 FIC subject ≠ system:serviceaccount:ns:name Compare FIC subject to serviceAccountName Fix subject to match the SA exactly
10 CSI mount stuck ContainerCreating SecretProviderClass wrong vault/secret/clientID kubectl describe pod events; CSI driver logs Correct keyvaultName/objectName/clientID
11 Secret rotated but app still uses the old value Pinned-version SecretUri, or read-once-at-startup Grep config for a version segment in the URI Drop the version; restart or watch the file
12 Can’t recreate a vault — name “already exists” Soft-deleted vault holds the name az keyvault list-deleted --query "[].name" Recover it, or purge (if not purge-protected)
13 az keyvault purge fails Conflict Purge protection blocks hard-delete az keyvault show --query properties.enablePurgeProtection Wait out retention — by design, not a bug
14 Deploy works on main but not on a tag FIC subject pins a branch, not the tag Token sub is ref:refs/tags/… Add a tag-subject FIC or use a broader claim

The expanded form, with the full reasoning for the entries that waste the most time:

1. azure/login fails with AADSTS70021 “No matching federated identity record found.” Root cause: The OIDC token’s sub claim does not match any FIC subject. Most often the workflow lacks the environment: key (so the sub is ref:… not environment:prod), or a branch/tag/environment was renamed. Confirm: Print the FIC subjects with az ad app federated-credential list --id "$APP_ID" --query "[].subject" and compare to how the workflow is actually triggered. Add a debug step to dump the token claims if unsure. Fix: Make the subject string match the token exactly — including environment:prod when the job sets environment: prod. Subjects are case- and string-sensitive.

4. The 21st az ad app federated-credential create fails: “maximum allowed value of 20.” Root cause: The 20-FIC ceiling per app/UAMI, reached because identity was modelled per repo/branch. Confirm: az ad app federated-credential list --id "$APP_ID" | jq length returns 20. Fix: Stop pinning each repo. Use environment-scoped subjects, or a flexible federated credential matching repository_owner on an app registration, or one identity per trust boundary (deployment tier) rather than per repo. Minting more app registrations scatters audit identity — avoid it.

5. App boots but a Key Vault-backed app setting is empty and the app crash-loops. Root cause: The app’s identity has no Key Vault Secrets User role (or no identity is enabled, or the SecretUri is wrong), so the reference resolves to nothing. The app never sees “denied” — it sees an empty connection string. Confirm: Portal → Environment variables shows the reference with a red error; az webapp config appsettings list --query "[?contains(value,'KeyVault')]"; check az webapp identity show and az role assignment list --assignee <principalId>. Fix: Enable the identity; grant Key Vault Secrets User; set keyVaultReferenceIdentity if multiple UAMIs are attached; verify the secret exists/enabled and the URI (drop any pinned version).

7. The Key Vault reference resolves to a public IP or times out behind a private endpoint. Root cause: The vault is private but DNS resolves the public name, or the vault firewall blocks the caller. Confirm: nslookup kv-plat-prod-001.vault.azure.net returns a public IP instead of the private endpoint IP; the vault’s Networking blade shows “selected networks” without your path. Fix: Link the privatelink.vaultcore.azure.net private DNS zone to the VNet (group id vault); allow trusted Azure services on the firewall for App Service KV references; ensure the app’s outbound routes through the VNet.

8 & 9. AKS pod can’t authenticate. Two distinct failures that look identical from the app: 8 — no AZURE_* env vars at all: the pod is missing azure.workload.identity/use: "true", so the webhook never injected the token. Confirm with kubectl exec … env | grep AZURE_ (empty). Fix: add the label, restart. 9 — env vars present but AADSTS70021: the FIC subject does not match the pod’s service account. Confirm by comparing the FIC subject to the deployment’s serviceAccountName. Fix: align the subject to system:serviceaccount:<ns>:<name> exactly.

11. A rotated secret is ignored; the app keeps using the old value. Root cause: A pinned-version SecretUri (which never refreshes) or an app that reads the secret once at startup and caches it forever. Confirm: Grep the config/Bicep for a version segment after /secrets/<name>/; check whether the app re-reads on each use. Fix: Use a versionless SecretUri; for CSI mounts read the file per request; subscribe to SecretNewVersionCreated for an immediate signal, or restart on rotation.

12 & 13. Vault name conflicts and purge. 12 — “name already exists” on create: a soft-deleted vault still holds the name. az keyvault list-deleted to see it; recover with az keyvault recover, or az keyvault purge if it is not purge-protected. 13 — purge fails Conflict: purge protection is on and the retention window has not elapsed. This is by design — there is no override. Plan vault names so you do not need to recreate them.

Best practices

The settings worth standardising across every vault and identity, with the value you want:

Standard Setting / control Target value Why
RBAC data plane enableRbacAuthorization true Escalation-safe, scoped, PIM-capable
Purge protection enablePurgeProtection true Defeats destructive delete
Soft-delete retention softDeleteRetentionInDays 90 Maximum recovery window
Network publicNetworkAccess Disabled (+ PE) Data plane off the internet
Default network action networkAcls.defaultAction Deny (+ trusted bypass) Deny-by-default with platform exceptions
Runtime role data role on workload identity Key Vault Secrets User Least privilege
Reference form SecretUri versionless (…/) Zero-downtime rotation
KV ref identity keyVaultReferenceIdentity the chosen UAMI Disambiguates multi-UAMI resolution

Security notes

The security controls and what each defends against:

Control Mechanism Defends against Also prevents
Federation / managed identity FIC, IMDS Leaked long-lived secret Rotation outages
RBAC data plane Secrets User/Officer roles Privilege escalation via control plane Over-broad data access
Secret-scoped assignment Role at /secrets/<name> One identity reading every secret Lateral access within a vault
Private endpoint + firewall publicNetworkAccess: Disabled Data plane exposed to the internet Exfil from outside the VNet
Purge protection + soft-delete Vault data-protection flags Malicious/accidental destroy Irrecoverable loss
Environment-scoped subjects FIC subject + GH protection rules Untrusted repos/forks deploying Unapproved production deploys
AuditEvent + alerts Diagnostic logs → KQL Silent abuse Undetected misconfiguration

Cost & sizing

The bill for this whole pattern is dominated by operations, not capacity — which is why it is one of the cheapest security wins available.

A rough monthly picture and what drives each line:

Cost driver What you pay for Rough INR / month What it buys Watch-out
Standard vault operations Per 10k secret ops ~₹20–200 (typical app) The secret store itself Per-request reads blow this up
Premium vault (HSM keys) Higher per-op + per-key ~₹400+ per HSM key FIPS 140-2 L2 key protection Only if you need HSM-backed keys
Managed identities / FICs ₹0 Passwordless auth None
Private endpoint Hourly + per-GB ~₹400–900 each Data plane off the internet One per vault per VNet
Log Analytics (AuditEvent) Per-GB ingestion ~₹100–1,000 Queryable audit trail High-traffic vaults ingest more
Caching layer (your design) ₹0 Cuts operation count ~10–100× Stale-vs-throttle TTL tuning

The cache-vs-cost trade-off as a table — pick a TTL that fits the secret’s rotation cadence:

Read pattern Vault ops Rotation latency Cost Use when
Per request, no cache Very high Instant Highest Almost never
Cache with short TTL (1–5 min) Moderate ≤ TTL Low Frequently-rotated secrets
KV reference / CSI poll Low Refresh/poll interval Low App Service / AKS default
Cache + Event Grid invalidation Lowest Seconds (event-driven) Lowest Rotation-sensitive, high-traffic

Interview & exam questions

1. What is the secret-zero problem and how do managed identity and federation solve it? Secret-zero is the bootstrap credential: to read a secret you must authenticate, and if that authentication is itself a stored secret you have only moved the problem upstream. Managed identity (inside Azure) and workload identity federation (outside) solve it by having the platform issue a short-lived token that Entra ID trusts, so no durable credential is stored anywhere.

2. Why prefer Azure RBAC over access policies for a Key Vault data plane? Access policies are a flat per-vault list, and anyone with vaults/write (Contributor) can self-grant data access — a privilege-escalation path. RBAC separates control-plane from data-plane permissions, supports scoping to a single secret, integrates with PIM, and is the default for new vaults. A runtime workload should get Key Vault Secrets User and nothing more.

3. A GitHub Actions deploy fails with AADSTS70021. What’s wrong and how do you confirm? The OIDC token’s sub claim does not match any federated identity credential’s subject — usually the workflow sets environment: prod but the FIC subject pins a branch (or vice versa), or something was renamed. Confirm by listing FIC subjects (az ad app federated-credential list) and comparing to how the job is triggered. Fix the subject to match exactly; subjects are string-sensitive.

4. What are the three fields of a federated identity credential, and what does each match? Issuer (the OIDC issuer URL, matched against the token’s iss), subject (the exact sub claim identifying the workload — a repo+environment or a Kubernetes service account), and audience (for Entra, always api://AzureADTokenExchange). All three must match the incoming token exactly or Entra returns “no matching federated identity,” not “access denied.”

5. When do you use a system-assigned versus a user-assigned managed identity? System-assigned when the identity should live and die with one resource (a standalone service). User-assigned (UAMI) when many workloads share access or the identity must survive blue/green replacement — you grant Key Vault RBAC to the UAMI once and every workload that carries it inherits access. AKS workload identity and external federation require a UAMI (or app), since system-assigned identities cannot hold FICs.

6. An app boots but a Key Vault-backed setting is empty and it crash-loops, with no exception. What do you check? A Key Vault reference resolved to nothing because the identity isn’t enabled, lacks Key Vault Secrets User, the vault firewall blocks it, or the SecretUri is wrong. Check the Environment variables blade for a red reference error, az webapp identity show, and az role assignment list --assignee <principalId>. With multiple UAMIs attached, also set keyVaultReferenceIdentity.

7. Why might an AKS pod fail to get a token even though workload identity is enabled? Two distinct causes: the pod is missing the azure.workload.identity/use: "true" label, so the webhook never injects the env vars/token (kubectl exec … env | grep AZURE_ is empty); or the FIC subject doesn’t match the pod’s service account (env vars present but AADSTS70021). Fix the label or align the subject to system:serviceaccount:<ns>:<name>.

8. You hit “maximum allowed value of 20” creating federated credentials. What now? You hit the 20-FIC ceiling because identity was modelled per repo/branch. Don’t mint more app registrations (that scatters audit identity). Consolidate with environment-scoped subjects, or a flexible federated credential matching repository_owner on an app registration, or one identity per deployment tier — model the trust boundary, not the repository.

9. How do you rotate a secret with zero downtime? Reference it versionlessly everywhere (a SecretUri ending in /), store it in exactly one vault, and let consumers follow the current version: App Service KV references re-resolve on restart/refresh, the CSI driver polls at rotation-poll-interval, and SecretNewVersionCreated via Event Grid can trigger immediate invalidation. Never pin a version or hardcode the value — that reintroduces a coordinated-deploy outage.

10. What does purge protection do, and what’s the catch? It blocks even a privileged actor from hard-deleting a vault or secret before the soft-delete retention window elapses, defeating a ransomware-style destroy. The catch: it is irreversible once enabled, and it blocks redeploys that try to recreate the same vault name within retention — so name vaults deliberately and don’t enable it in throwaway environments you recreate often.

11. How do you keep a Key Vault’s data plane off the internet without breaking App Service references? Set publicNetworkAccess: Disabled and add a private endpoint with the privatelink.vaultcore.azure.net DNS zone linked to the VNet, and enable the firewall’s trusted Azure services bypass so platform integrations (App Service KV references) still resolve. Land that private path before disabling public access, or you lock out your own pipelines.

12. What’s the difference between a control-plane 403 and a data-plane Forbidden on Key Vault? A control-plane 403 (e.g. on vaults/write) means the caller lacks an RBAC management role like Key Vault Contributor. A data-plane Forbidden (on secrets/getValue) means it lacks a data role/access policy like Key Vault Secrets User. They are governed by different planes — granting the wrong one is the classic mistake.

These map to AZ-500 (Security Engineer)manage Key Vault, secrets, keys, certificates; configure managed identities; workload identity — and AZ-204 (Developer)secure app configuration data, implement managed identities and Key Vault references. The federation and AKS angles touch AZ-400 and the Kubernetes specialty. A compact cert-mapping for revision:

Question theme Primary cert Exam objective area
Key Vault RBAC vs access policies AZ-500 Secure data and applications; Key Vault
Managed identity (system vs user) AZ-500 / AZ-204 Implement and manage identities for resources
FIC fields, GitHub OIDC AZ-400 / AZ-500 Secure pipelines; workload identity federation
AKS workload identity AKS specialty / AZ-500 Secure Kubernetes workloads
KV references, rotation AZ-204 Secure app configuration data
Networking, private endpoint AZ-500 / AZ-700 Secure the data plane; private connectivity

Quick check

  1. To read a secret from Key Vault a workload must authenticate to Entra ID. What is the name of the problem where that authentication itself needs a stored credential, and what mechanism removes it?
  2. A GitHub Actions job sets environment: prod but azure/login fails AADSTS70021. Where is the mismatch, and what one command shows you the configured value to compare against?
  3. True or false: granting a runtime web app Key Vault Contributor is the correct least-privilege way to let it read a secret.
  4. Your AKS pod has none of the AZURE_* environment variables. What single piece of Kubernetes YAML is almost certainly missing?
  5. You rotated a secret in the vault but the App Service still uses the old value. Name the most likely cause in the reference URI.

Answers

  1. The secret-zero problem. It is removed by platform-issued identity — a managed identity (inside Azure, via IMDS) or workload identity federation (outside, via an OIDC token Entra trusts) — so no durable credential is stored.
  2. The FIC subject does not match the token’s sub. The job’s sub is repo:ORG/REPO:environment:prod, so the FIC subject must be exactly that. Confirm the configured value with az ad app federated-credential list --id <appId> --query "[].subject".
  3. False. Key Vault Contributor is a control-plane role (manage the vault) and lets the identity self-grant more access. The least-privilege data-plane role to read secret values is Key Vault Secrets User.
  4. The pod (template) label azure.workload.identity/use: "true". Without it the mutating webhook does not inject the projected token or the AZURE_* env vars, so the SDK has nothing to exchange.
  5. A pinned secret version in the SecretUri (a version segment after …/secrets/<name>/). A pinned version never refreshes; use a versionless URI ending in / so the current version is resolved.

Glossary

Next steps

You can now stand up a secret-free path end to end and diagnose where a passwordless flow breaks. Build outward:

AzureKey VaultWorkload IdentityManaged IdentityEntra IDAKS
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments