Azure Identity

AKS Cluster Identity: Managed Identity vs Service Principal and Why It Matters for Day-2

Spin up an Azure Kubernetes Service (AKS) cluster and a question hides in plain sight: when your cluster needs to do something in Azure — pull an image, attach a disk to a pod, program a public IP on a load balancer — who is it acting as? Kubernetes has no Azure account. Something must hold an Azure identity on the cluster’s behalf, prove it to Microsoft Entra ID (formerly Azure AD), and carry the RBAC permissions those operations need. That something is the cluster identity, and you have two ways to provide it: an old service principal or a modern managed identity.

This sounds like a footnote until 3 a.m. one year later, when every new node fails to pull images, your LoadBalancer services stop getting public IPs, and kubectl is throwing authorization errors — all because a service-principal client secret you set at cluster-creation time silently expired. No deployment changed. The cluster just stopped being able to talk to Azure. This is the single most common day-2 AKS identity outage, and it does not exist on managed-identity clusters because there is no secret to expire. That contrast — convenient-but-fragile credential versus zero-credential platform identity — is the whole point of this article.

By the end you will hold a clear mental model of every identity an AKS cluster uses (there are three, and people conflate them constantly), know why managed identity is the default and recommended choice, and be able to convert a service-principal cluster across. You’ll also meet workload identity — the modern, secretless way your pods (not the cluster) get Azure access. This is a concepts article: mental models and decision tables first, then an architecture walkthrough and a short troubleshooting playbook.

What problem this solves

A Kubernetes cluster is not a passive box of containers. AKS continuously performs Azure control-plane operations for you: a PersistentVolumeClaim makes the cluster create and attach a managed disk; a Service of type LoadBalancer makes it allocate a public IP and program the load balancer; a pod whose image lives in Azure Container Registry (ACR) makes the kubelet pull it. Each is an authenticated Azure API call made as some identity with the right permissions. Get the identity wrong and these fail in ways that look like networking or storage bugs but are really authorization failures.

The historical way to provide that identity was a service principal (SP) — an Entra ID application identity with a client ID and a client secret (a password) you created, granted roles, and handed to AKS at creation. It worked, but carried a time bomb: the client secret has an expiry date (commonly 1–2 years). Nobody puts “rotate the AKS service-principal secret” on a calendar, so a year later it expires and the cluster loses its ability to authenticate to Azure. The symptom is confusing because your application code is fine — it’s the platform underneath that lost its credential.

Managed identity removes the credential entirely. The platform fetches and rotates the cluster’s tokens automatically; there is no secret you hold, store, or rotate, and none that can expire. This is why every new AKS cluster — portal, CLI defaults, or modern IaC — uses a managed identity, and why Microsoft recommends migrating SP clusters. Who hits the old pain: anyone who built a cluster a year-plus ago with the classic --service-principal / --client-secret flags, anyone copying an old Terraform module, anyone who treats “it deployed fine” as proof it will keep working. The fix is almost never “redeploy” — it’s “stop holding a secret the platform can hold for you.”

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should know the AKS basics: a cluster has a control plane (managed by Microsoft) and a data plane of worker nodes (the AKS Architecture Explained: Managed Control Plane, Node Pools, and the Azure Integrations That Make It Tick deep-dive covers this; identity is the glue that makes those Azure integrations work). You should be comfortable running az in Cloud Shell, reading JSON output, and have a working idea of Azure RBAC — that you grant a role (like Contributor or AcrPull) to a principal at a scope (a resource, resource group, or subscription). If RBAC scopes are fuzzy, the Azure Resource Hierarchy Explained: Subscriptions, Resource Groups and Resources article grounds the scope ladder this all hangs off.

This sits at the intersection of the Identity and Compute tracks, upstream of anything where your cluster touches another Azure service: pulling from Azure Container Registry, reading secrets from Azure Key Vault via the CSI driver, or emitting telemetry to Azure Monitor and Application Insights. You don’t need to have built a cluster before; you do need to accept that Kubernetes and Azure are two control planes that must trust each other, and identity is how.

One-paragraph orientation before the deep model: there is an identity for the cluster itself (control-plane components calling Azure), a separate identity for the kubelet on the nodes (pulling images from ACR), and an entirely different mechanism — workload identity — for your pods to reach Azure resources. Mixing these up is the root of most confusion, so we pin them down first.

Core concepts

Four mental models make every later decision obvious.

A cluster needs an Azure identity because it makes Azure API calls for you. AKS wires cloud-agnostic Kubernetes into Azure through the cloud controller manager (CCM). When a manifest asks for a LoadBalancer service or a PersistentVolumeClaim, the CCM translates that into Azure REST calls — allocate a public IP, configure load-balancer rules, create and attach a managed disk. Those calls run as the cluster identity, which must hold the right roles (typically Network Contributor on the node RG / subnet, plus disk rights) for them to succeed.

A service principal is an identity with a password you hold; a managed identity is one Azure holds the credential for. A service principal is an Entra ID application instance with a client ID (who it is) and a client secret or certificate (proof). You create, store, and rotate that secret before it expires. A managed identity is an Entra ID identity bound to an Azure resource where the platform creates, stores, and rotates the credential and issues short-lived tokens automatically. To RBAC both are just “a principal you grant roles to” — the only difference is who manages the secret, and that difference is the entire day-2 story.

System-assigned vs user-assigned is about lifecycle and reuse. A system-assigned managed identity is created with the cluster, deleted with it, and used by only that cluster. A user-assigned managed identity is a standalone resource that outlives the cluster and can be shared; you can pre-create it in IaC so roles exist before the cluster does, avoiding a chicken-and-egg ordering problem. AKS defaults to system-assigned for the cluster identity if you say nothing.

Three identities, three jobs — never conflate them. (1) The cluster (control-plane) identity authenticates the CCM’s Azure calls — load balancers, public IPs, disks, routes. (2) The kubelet identity is a separate user-assigned identity (auto-created on MI clusters) the kubelet uses to pull images from ACR — it needs AcrPull on the registry. (3) Workload identity is not a cluster identity at all: it’s how an individual pod gets its own Azure identity to read a Key Vault secret or write to Storage, with no secret, via a federated trust between the cluster’s OIDC issuer and a user-assigned managed identity. The cluster identity is the platform’s; workload identity is your application’s.

The three identities side by side

Before the deep sections, pin down each identity, what it authenticates, and what it needs:

Identity Whose calls it makes What it’s for Needs (RBAC) Created how
Cluster / control-plane Cloud controller manager LBs, public IPs, disks, routes Network/Contributor on node RG & subnet System-assigned (default) or user-assigned
Kubelet identity Kubelet on each node Pull images from ACR AcrPull on the registry User-assigned MI, auto-created
Workload identity (per pod) Your application pods App access (Key Vault, Storage, SQL) App’s role (e.g. KV Secrets User) User-assigned MI + federated credential

The trap to internalise: ImagePullBackOff from ACR is the kubelet identity; a LoadBalancer stuck pending is the cluster identity; a pod that can’t read a Key Vault secret is workload identity. Same word, three different principals.

Service principal vs managed identity: the core comparison

This is the heart of the article. Both are Entra ID principals you grant Azure roles to; the difference is operational, and decisive. A service principal makes you manage a credential — create a client secret with an expiry, store it where AKS can read it, rotate it before it lapses. Forget (almost everyone does, because nothing reminds you) and the cluster loses its ability to authenticate to Azure. A managed identity has no secret you ever see: the platform issues short-lived tokens via the instance metadata endpoint and rotates the underlying credential automatically. Nothing to expire, leak, or rotate.

The side-by-side, dimension by dimension:

Dimension Service principal (SP) Managed identity (MI)
What it is Entra ID app: client ID + secret Entra ID identity bound to a resource
Credential Secret/cert you hold Platform-managed; none you handle
Expiry / rotation Expires (1–2 yr); you rotate None to expire or rotate
Day-2 outage risk High — expired secret breaks auth Effectively none
Where the secret lives Config / secret store (leak surface) Nowhere you manage
Setup effort Create SP, secret, roles, pass to AKS Default; AKS wires it
Cross-tenant use Possible Bound to tenant resources
AKS support Legacy; not recommended Default and recommended
Cost Free Free

Three reading notes that prevent the usual mistakes:

Distinction The trap How to think about it
“Managed identity = no secret” Thinking tokens don’t exist Tokens do; the platform fetches/rotates them — you just never hold a long-lived one
SP secret vs SP certificate Certs seem exempt Certificates also expire — same trap, longer fuse
“It deployed, so it’s fine” Creation = durability An SP secret valid at create-time expires later with zero deploy changes

When does a service principal still legitimately appear? Rarely — some older automation and cross-tenant scenarios historically expected one. But for the cluster identity of a new AKS cluster, managed identity is correct in essentially every greenfield case, and Microsoft’s guidance is to migrate existing SP clusters.

Scenario SP? MI? Why
New cluster, control-plane identity No Yes Default; no secret to expire
Kubelet identity for ACR pulls No Yes Auto-created MI; grant AcrPull
Pods accessing Azure (KV, Storage) No Yes (workload id) Secretless; replaces aad-pod-identity
Inherited SP cluster Migrate Yes az aks update --enable-managed-identity
Niche legacy/cross-tenant automation Maybe Prefer MI Only if a tool genuinely requires an SP

Why managed identity is the default — stated plainly

If you remember one paragraph, remember this. A service principal puts the credential lifecycle on you — create the secret, store it, rotate it, eat the outage when it expires. A managed identity puts it on Azure — the platform owns the lifecycle end to end, so the failure mode doesn’t exist. There is no scenario where holding a long-lived secret yourself is more secure or reliable than letting the platform hold a short-lived one. That’s why AKS defaults to managed identity, why the portal no longer pushes the SP path, and why “we’re still on a service principal” is a finding in any AKS review.

System-assigned vs user-assigned managed identity

Once you’ve chosen managed identity (you have), there’s a second, smaller decision: system-assigned or user-assigned. Both are secretless; they differ in lifecycle and reuse. A system-assigned identity is born with the cluster and dies with it — the simplest path (AKS creates it, you do nothing), fine for a single self-contained cluster, but you can’t grant it roles before the cluster exists or share it. A user-assigned identity is a standalone resource you create first; because it exists before the cluster you can pre-grant its roles in IaC (so the cluster boots with AcrPull, Network Contributor already in place) and reuse it across many clusters — at the cost of one more resource to manage.

Aspect System-assigned MI User-assigned MI
Lifecycle Created/deleted with the cluster Independent; outlives the cluster
Reuse One cluster only Shareable across resources
Pre-grant roles No (doesn’t exist yet) Yes — before the cluster exists
IaC ordering Role grant comes after Clean: MI → roles → cluster
Best for A single, simple cluster Fleets, strict IaC, shared identity
Default for Cluster identity (unspecified) Kubelet identity (auto-created)

In practice, system-assigned suits a standalone cluster; user-assigned is cleaner for fleets or strict IaC ordering. The kubelet identity is user-assigned regardless — AKS creates one (named like <clustername>-agentpool) for image pulls so it persists independently of control-plane operations.

The kubelet identity and ACR pulls

The most common identity task you’ll actually perform is making ACR pulls work. The kubelet — the per-node agent that starts containers — pulls images using the kubelet identity (a user-assigned MI AKS creates), and for a private ACR that identity needs the AcrPull role on the registry. The clean way to grant it is az aks update --attach-acr, which assigns AcrPull to the kubelet identity for you — no registry username/password or imagePullSecret in your YAML.

# Attach an ACR to the cluster: grants AcrPull to the kubelet identity. No secret in YAML.
az aks update \
  --name aks-shop-prod \
  --resource-group rg-shop-prod \
  --attach-acr acrshopprod

Under the hood this is just an RBAC grant; you could do it manually, but --attach-acr finds the right kubelet identity and scope for you. To verify the pull path end to end, AKS ships a built-in check:

# Validates the kubelet identity actually has pull access to the registry
az aks check-acr \
  --name aks-shop-prod \
  --resource-group rg-shop-prod \
  --acr acrshopprod.azurecr.io

If you prefer the explicit grant (or need it in a pipeline), assign AcrPull to the kubelet object ID directly:

# Find the kubelet identity's object (principal) ID
KUBELET_OBJ=$(az aks show -n aks-shop-prod -g rg-shop-prod \
  --query identityProfile.kubeletidentity.objectId -o tsv)

# Grant AcrPull at the registry scope
ACR_ID=$(az acr show -n acrshopprod -g rg-shop-prod --query id -o tsv)
az role assignment create \
  --assignee-object-id "$KUBELET_OBJ" \
  --assignee-principal-type ServicePrincipal \
  --role AcrPull \
  --scope "$ACR_ID"

The roles that matter most for cluster and kubelet identities, with scope and what fails without them:

Role Assigned to Scope Enables Failure if missing
AcrPull Kubelet identity The ACR Pull private images ImagePullBackOff / 401
Network Contributor Cluster identity Node RG / subnet LB, public IPs, routes LoadBalancer <pending>
Contributor (node RG) Cluster identity Managed node RG Disks, NICs, scale-set ops PVC attach / scale failures
Managed Identity Operator Cluster identity Kubelet/user identities Assign identities to cluster Identity assign error at create
Key Vault Secrets User Workload identity The Key Vault Pod reads a secret CSI/SDK 403 on secret

--attach-acr is recommended because it scopes the grant precisely to the kubelet identity and the one registry, embeds no secret, and survives rotation — there’s no credential to rotate.

Workload identity: secretless access for your pods

So far we’ve covered how the cluster talks to Azure. Workload identity is how your pods do — a different mechanism, not to be confused with the cluster identity. The old pod-managed identity (aad-pod-identity) is deprecated; Microsoft Entra Workload ID is the modern, secretless replacement.

The mental model: AKS exposes an OIDC issuer (a public endpoint that signs tokens for the cluster’s service accounts). You create a user-assigned managed identity for the workload, then a federated identity credential saying “trust tokens from this OIDC issuer for this Kubernetes service account.” A pod using that service account gets a projected token, exchanges it with Entra ID for an Azure access token, and calls Azure as the managed identity — with no client secret anywhere. The trust is cryptographic and federated, not a stored password.

Aspect aad-pod-identity Workload identity (Entra Workload ID)
Status Deprecated Current, recommended
Mechanism NMI pod intercepts IMDS OIDC federation + projected SA token
Secret None, but brittle interception None; clean token exchange
Reliability Known issues at scale Robust, Kubernetes-native
You create AzureIdentity / binding CRDs User-assigned MI + federated credential

Enabling it is two flags — the OIDC issuer and the workload-identity webhook:

# Enable the OIDC issuer and workload identity on the cluster
az aks update \
  --name aks-shop-prod \
  --resource-group rg-shop-prod \
  --enable-oidc-issuer \
  --enable-workload-identity

The federated credential ties a specific service account to a specific managed identity:

# Get the cluster's OIDC issuer URL
ISSUER=$(az aks show -n aks-shop-prod -g rg-shop-prod \
  --query oidcIssuerProfile.issuerUrl -o tsv)

# Federate: trust tokens for service account 'sa-orders' in namespace 'orders'
az identity federated-credential create \
  --name fic-orders \
  --identity-name id-orders-workload \
  --resource-group rg-shop-prod \
  --issuer "$ISSUER" \
  --subject "system:serviceaccount:orders:sa-orders" \
  --audience api://AzureADTokenExchange

Workload identity is the pod-level analogue of the cluster’s managed identity: the same principle — let Azure hold the credential, you hold no secret — applied one layer up, to your application. If you’re stuffing a Key Vault client secret into a Kubernetes Secret, this is the secretless answer, and it pairs naturally with the Azure Key Vault Secrets Store CSI driver.

Architecture at a glance

Read the diagram left to right. Operators and CI/CD run kubectl apply — authenticated by your Entra ID identity, separate from the cluster’s own. They hit the Microsoft-managed control plane, where the cloud controller manager turns manifests into Azure API calls: asked for a LoadBalancer or PersistentVolumeClaim, it authenticates as the cluster identity (a system- or user-assigned MI) and calls Azure Resource Manager to program a Standard Load Balancer, allocate a public IP, or attach a managed disk. Badge 1 is the day-2 trap: on a service-principal cluster, the secret behind this identity can expire and every one of these calls fails at once.

Down in the data plane (your node pools, your VNet), two more identities work. The kubelet uses the kubelet identity to pull images from ACR — badge 2 is the ImagePullBackOff when it lacks AcrPull. Separately, your pods use workload identity: a projected service-account token is exchanged via the cluster’s OIDC issuer for an Entra ID token, letting a pod read a Key Vault secret as a user-assigned identity with no stored secret (badge 3). Badge 4 sits on the cluster identity: lacking Network Contributor on the subnet, load-balancer programming fails and the Service hangs on <pending>. The takeaway: three identities, three planes, one principle — the platform holds the credentials so you don’t.

Left-to-right AKS cluster-identity architecture: operators and CI/CD authenticate with kubectl to the Microsoft-managed control plane where the cloud controller manager acts as the cluster managed identity to program a Standard Load Balancer, public IP and managed disks via Azure Resource Manager; the data-plane node pools in your VNet use a kubelet managed identity to pull images from Azure Container Registry with AcrPull, while pods use workload identity through the cluster OIDC issuer to read secrets from Azure Key Vault; numbered badges mark the expired service-principal secret outage, ImagePullBackOff from missing AcrPull, the secretless workload-identity token exchange, and a LoadBalancer stuck pending from a missing Network Contributor role.

Real-world scenario

Northwind Retail runs its e-commerce API on a single AKS cluster, aks-nw-prod, provisioned in early 2024 by a contractor (since rolled off) whose old Terraform module used a service principal with a client secret, expiry defaulted to two years. It checked in cleanly, ran beautifully, and nobody thought about the identity again — no reminder, no alert, no runbook note. The secret was an invisible dependency.

Eighteen months later, during an autoscale event on a flash-sale Saturday, the cluster tried to add two nodes. The new nodes came up but their pods stuck in ImagePullBackOff, so the site limped rather than fell over. On-call assumed an ACR outage — checked ACR, found it healthy. Then a new LoadBalancer service hung on <pending>. Two unrelated-looking failures, image pulls and load-balancer programming, with one common cause: the service-principal secret had expired three days earlier, and the cluster could no longer authenticate to Azure for any control-plane operation. Running pods survived (they don’t re-auth to stay up); anything needing a new Azure call failed.

az aks show revealed the cluster was service-principal-based, and the SP’s credentials in Entra ID showed a past expiry. The immediate stop-gap was to reset the credential so the site could breathe:

# Emergency: reset the expired SP credential (buys time; not the real fix)
az aks update-credentials \
  --name aks-nw-prod \
  --resource-group rg-nw-prod \
  --reset-service-principal \
  --service-principal "$APP_ID" \
  --client-secret "$NEW_SECRET"

Image pulls and load-balancer programming recovered within minutes. But the team treated this as a near-miss, not a fix — a new secret just resets the same two-year fuse. The durable remediation, applied the following Tuesday in a change window, was to convert the cluster to a managed identity, eliminating the secret entirely:

# The real fix: move to managed identity — no secret to ever expire again
az aks update \
  --name aks-nw-prod \
  --resource-group rg-nw-prod \
  --enable-managed-identity

After conversion they re-attached the registry (az aks update --attach-acr acrnwprod) so the new kubelet identity held AcrPull, verified with az aks check-acr, and deleted the orphaned service principal and its secret from their secret store and Terraform. The runbook lesson: any cluster on a service principal is a scheduled outage waiting for its secret to expire; managed identity removes the failure class. Avoidable impact: ~40 minutes of degraded checkout at peak — entirely preventable by a default newer clusters get for free.

Advantages and disadvantages

The trade-off is lopsided in favour of managed identity, but state it honestly:

Managed identity Service principal
Pros No secret to expire/leak/rotate; default & recommended; platform-rotated tokens; no day-2 credential outage; simplest setup Works across tenants/some legacy tooling; familiar to teams with old automation; portable identity model
Cons Bound to Azure resources (less portable cross-tenant); user-assigned adds one resource to manage Secret expires → cluster↔Azure auth breaks; you own rotation; secret is a leak surface; legacy / not recommended

Managed identity matters for essentially every AKS cluster — eliminating the secret-expiry failure class is worth more than any flexibility a service principal offers, and setup is simpler, not harder. A service principal’s portability only counts in genuine cross-tenant or legacy-tooling edge cases a greenfield cluster doesn’t have. Choosing today, you choose managed identity; inheriting a service principal, you convert. Its only “disadvantage” — slightly less cross-tenant portability — is irrelevant to a cluster living in one tenant and subscription, which is the overwhelming majority.

Hands-on lab

This walk-through creates a managed-identity AKS cluster, attaches an ACR so pulls work with no secret, inspects the three identities, and tears everything down. Small SKUs keep it cheap; run it in Cloud Shell where az and kubectl are preinstalled.

1. Set variables and create a resource group.

RG=rg-aks-id-lab
LOC=eastus
AKS=aks-id-lab
ACR=acridlab$RANDOM   # ACR name must be globally unique, alphanumeric

az group create --name $RG --location $LOC

2. Create a registry (Basic tier — cheapest).

az acr create --name $ACR --resource-group $RG --sku Basic

3. Create a cluster with a managed identity. --enable-managed-identity is now the default, but state it explicitly so the intent is clear. One small node keeps cost down.

az aks create \
  --name $AKS \
  --resource-group $RG \
  --enable-managed-identity \
  --node-count 1 \
  --node-vm-size Standard_B2s \
  --generate-ssh-keys

Expected: the cluster provisions in a few minutes. No client secret was created or requested — that’s the point.

4. Inspect the three identities. Confirm the cluster identity type, then the kubelet identity:

# Cluster (control-plane) identity — should be 'SystemAssigned'
az aks show -n $AKS -g $RG --query "identity.type" -o tsv

# Kubelet identity (used for ACR pulls) — note its clientId/objectId
az aks show -n $AKS -g $RG \
  --query "identityProfile.kubeletidentity.{clientId:clientId, objectId:objectId, resourceId:resourceId}" -o jsonc

Expected: SystemAssigned for the cluster and a populated kubelet block — proof the kubelet has its own user-assigned identity, distinct from the cluster’s.

5. Attach the ACR (grant AcrPull to the kubelet identity).

az aks update --name $AKS --resource-group $RG --attach-acr $ACR

6. Verify the pull path.

az aks check-acr --name $AKS --resource-group $RG --acr $ACR.azurecr.io

Expected: a success message confirming the kubelet identity can authenticate and pull from the registry. No imagePullSecret, no registry password anywhere.

7. (Optional) Deploy a public image to see the cluster identity program a LoadBalancer.

az aks get-credentials --name $AKS --resource-group $RG --overwrite-existing
kubectl create deployment web --image=mcr.microsoft.com/azuredocs/aks-helloworld:v1
kubectl expose deployment web --type=LoadBalancer --port=80 --target-port=80
kubectl get service web --watch   # EXTERNAL-IP moves from <pending> to a real IP

When EXTERNAL-IP flips from <pending> to an address, you’ve just watched the cluster identity call Azure to allocate a public IP and program the Standard Load Balancer. Press Ctrl-C to stop.

8. Tear down — delete the resource group to remove everything (cluster, ACR, node RG, system-assigned identity).

az group delete --name $RG --yes --no-wait

The system-assigned cluster identity is deleted with the cluster (that’s its lifecycle); a user-assigned identity, had you used one, would survive and need separate cleanup.

The same cluster in Bicep, showing the managed-identity declaration explicitly:

resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: 'aks-id-lab'
  location: resourceGroup().location
  identity: {
    type: 'SystemAssigned'   // managed identity for the cluster; no secret
  }
  properties: {
    dnsPrefix: 'aksidlab'
    agentPoolProfiles: [
      {
        name: 'systempool'
        count: 1
        vmSize: 'Standard_B2s'
        mode: 'System'
      }
    ]
    // For a user-assigned cluster identity instead, set identity.type to
    // 'UserAssigned' and supply identity.userAssignedIdentities.
  }
}

Common mistakes & troubleshooting

The identity failures you’ll actually meet, each as symptom → root cause → confirm → fix. Scan the table, then read the detail for your row.

# Symptom Root cause Confirm Fix
1 Cluster-wide ops fail (pulls + LB + disks) after months SP secret expired az aks show --query servicePrincipalProfile + Entra creds az aks update --enable-managed-identity
2 ImagePullBackOff, 401 from ACR Kubelet identity lacks AcrPull az aks check-acr; kubectl describe pod az aks update --attach-acr <acr>
3 LoadBalancer stuck <pending> Cluster identity lacks Network Contributor kubectl describe svc; check role assignments Network Contributor on the subnet
4 PVC Pending, disk attach error Can’t manage disks in node RG kubectl describe pvc; node-RG roles Contributor on the managed node RG
5 Pod can’t read KV secret (403) Workload identity unwired / MI lacks role kubectl describe pod; check fed cred + role Create fed credential; grant KV Secrets User
6 aad-pod-identity breaks at scale Pod-managed identity deprecated Check for AzureIdentity CRDs Migrate to workload identity
7 Roles vanish after recreating cluster System-assigned ID changed Compare identity objectId before/after Use a user-assigned identity

#1 — The expired service-principal secret (the big one)

Everything dies at once — pulls, load balancers, disks — while existing pods keep running, because they don’t re-authenticate to stay up. The cluster is service-principal-based and its client secret expired, so the CCM can no longer authenticate to Entra ID for any Azure call. Confirm whether it’s even SP-based:

# A real GUID here = service principal; 'msi' = already managed identity (not your problem)
az aks show -n aks-nw-prod -g rg-nw-prod \
  --query "servicePrincipalProfile.clientId" -o tsv

A GUID means an SP — check that app’s secret expiry in Entra ID. The durable fix is converting to managed identity; resetting the credential (az aks update-credentials --reset-service-principal) is an emergency stop-gap that just rearms the same fuse.

#2 — ImagePullBackOff from ACR

Pods stick in ImagePullBackOff with a 401 from *.azurecr.io because the kubelet identity lacks AcrPull — usually the ACR was never attached, or a rebuilt cluster’s new kubelet identity wasn’t re-granted. az aks check-acr confirms it directly; the fix is az aks update --attach-acr <acr>. This is the kubelet identity, not the cluster identity — the most common mix-up in this whole area.

#3 — LoadBalancer stuck on <pending>

EXTERNAL-IP hangs on <pending> because the cluster identity lacks Network Contributor on the subnet — common with a custom (BYO) VNet AKS doesn’t own. kubectl describe svc <name> shows the Azure error in Events. Fix: grant Network Contributor to the cluster identity scoped to the subnet (or the VNet’s RG).

#4 — PVC Pending / disk attach errors

A PersistentVolumeClaim stays Pending with an authorization error because the cluster identity can’t create/attach managed disks in the managed node resource group (MC_*). kubectl describe pvc <name> shows it; ensure the cluster identity can manage disks in that RG (Contributor at node-RG scope covers the default case).

#5 — Pod can’t read a Key Vault secret

A 403 reading a Key Vault secret is workload identity, not the cluster identity: the federated credential is missing/mismatched, or the user-assigned MI lacks the role. Verify the federated credential subject exactly matches system:serviceaccount:<ns>:<sa>, then grant the MI Key Vault Secrets User on the vault.

#6 — Still on aad-pod-identity

Intermittent identity failures under load with the old add-on mean you’re on the deprecated pod-managed-identity (look for AzureIdentity CRDs). Migrate to Entra Workload ID — enable the OIDC issuer and workload identity, then create federated credentials.

#7 — Roles vanish after a cluster rebuild

A system-assigned identity gets a new principal ID on recreation, so grants to the old object ID no longer apply. Use a user-assigned identity — a stable, independent principal whose role grants survive cluster lifecycle changes.

Best practices

Security notes

Managed identity is the more secure choice because there is no long-lived secret to leak. A service-principal client secret lives in your config or secret store, can be copied, can end up in git history, and grants whatever the SP can do until rotated or revoked — exactly the leaked-credential exposure that turns a small mistake into a breach. A managed identity hands out only short-lived, platform-rotated tokens via the instance metadata endpoint; there is no static secret to exfiltrate.

Beyond that, apply ordinary RBAC hygiene. Grant least privilege at the tightest scope: AcrPull on the specific registry (not the RG), Network Contributor on the specific subnet (not the VNet or subscription), and resist granting Contributor at subscription scope to silence an authorization error. For pods, workload identity keeps application credentials out of the cluster entirely — no Key Vault secret in a Kubernetes Secret, no password baked into an image — scoped to exactly the vault, Storage account, or database it needs. Finally, treat the kubelet and cluster identities as privileged principals: whoever can change their role assignments changes what the cluster can do in Azure, so guard those grants with production-grade change control.

Cost & sizing

The good news: identity itself is free. Service principals and managed identities (system- or user-assigned) incur no charge — you pay for nodes, the load balancer, disks, ACR, not the principals. So SP vs MI is driven entirely by reliability and security, never cost. The “cost” of a service principal is the outage when its secret expires — real money in lost availability, but not a line item on your bill.

Component Cost driver Rough figure Note
Managed identity (system/user) None Free No per-identity charge
Service principal None Free “Cost” is the expiry outage, not a bill
AKS (Free tier control plane) Nodes only Node VM hourly Control plane free; pay for nodes
AKS Standard tier (SLA) Per cluster ~₹8/hr (~$0.10/hr) Adds a backed SLA; identity unchanged
Lab nodes (B2s ×1) VM hours ~₹8–12/hr (~$0.10–0.15/hr) Delete the RG when done
Standard Load Balancer Rules + data Small hourly + per-GB Allocated by a LoadBalancer svc

The identity decision doesn’t scale with cost: choose managed identity regardless of cluster size. The only sizing-adjacent call is system-assigned vs user-assigned, a management decision (lifecycle and reuse), not a billing one — user-assigned is still free, just one more resource to track. For the lab, the dominant cost is the single B2s node; deleting the resource group (step 8) stops all charges, and the system-assigned identity is cleaned up automatically with no orphan left behind.

Interview & exam questions

Q1. What is the difference between a service principal and a managed identity in AKS? Both are Entra ID principals you grant Azure roles to. A service principal has a client ID and a client secret you create, store, and must rotate before it expires. A managed identity has no secret you handle — the platform creates, rotates, and hands out short-lived tokens automatically. Managed identity is the default and recommended cluster identity. (AZ-104, AZ-500.)

Q2. Why is managed identity recommended over a service principal? Because a service principal’s secret expires (commonly 1–2 years), and when it does the cluster can’t authenticate to Azure — pulls, load balancers, disks fail at once with no code change. Managed identity removes that failure class: no secret to expire, leak, or rotate.

Q3. Name the three identities an AKS cluster uses and what each is for. The cluster (control-plane) identity for the CCM’s Azure calls (load balancers, public IPs, disks, routes); the kubelet identity for pulling images from ACR (needs AcrPull); and workload identity for individual pods to reach Azure (Key Vault, Storage) without a secret, via OIDC federation.

Q4. How do you let an AKS cluster pull images from a private ACR without a registry secret? Run az aks update --attach-acr <registry>, which grants the AcrPull role to the cluster’s kubelet identity scoped to that registry. No username/password or imagePullSecret goes into your YAML. Verify with az aks check-acr.

Q5. What is the difference between system-assigned and user-assigned managed identity? A system-assigned identity is created and deleted with the cluster and used by only that cluster. A user-assigned identity is a standalone resource that outlives the cluster, can be shared across resources, and can have roles granted before the cluster exists — which makes IaC ordering clean and gives you a stable principal across rebuilds.

Q6. A LoadBalancer service is stuck on <pending>. What identity problem might cause this? The cluster identity likely lacks Network Contributor on the subnet or node resource group, so the cloud controller manager can’t program the Standard Load Balancer or allocate the public IP. Common in custom (BYO) VNets. Confirm with kubectl describe svc and check role assignments on the subnet.

Q7. How do you convert an existing service-principal cluster to managed identity? Run az aks update --enable-managed-identity on the cluster. This switches the cluster to a managed identity; you then re-attach any ACR (so the new kubelet identity gets AcrPull) and remove the now-unused service principal and its secret from your secret store and IaC.

Q8. What is workload identity and what did it replace? Microsoft Entra Workload ID lets a pod authenticate to Azure with no secret by federating the cluster’s OIDC issuer token for a service account to a user-assigned managed identity. It replaces the deprecated aad-pod-identity, which intercepted IMDS calls and had reliability problems at scale.

Q9. If a cluster shows servicePrincipalProfile.clientId as msi, what does that mean? It means the cluster is using a managed identity, not a service principal — msi is the sentinel value. A real GUID there would indicate a service-principal-based cluster (and a potential secret-expiry exposure).

Q10. Why might role assignments stop working after you recreate a cluster? If the cluster used a system-assigned identity, recreation produces a new principal with a new object ID, so roles granted to the old object ID no longer apply. Using a user-assigned identity gives a stable principal whose role grants survive cluster lifecycle changes.

Q11. Does choosing managed identity over a service principal cost more? No. Both managed identities and service principals are free; you pay for nodes, load balancers, disks, and ACR, not for the identity. The decision is about reliability and security, never cost — the only “cost” of a service principal is the outage when its secret expires.

Q12. Where would an ImagePullBackOff from ACR send you — cluster identity or kubelet identity? The kubelet identity. Image pulls are performed by the kubelet, so a missing AcrPull grant on the kubelet identity is the cause. Fix with az aks update --attach-acr. The cluster identity governs load balancers and disks, not pulls — a common mix-up.

Quick check

  1. Which cluster identity model has a credential that can expire and cause a day-2 outage?
  2. Which of the three AKS identities needs the AcrPull role, and on what scope?
  3. What single az aks command converts a service-principal cluster to managed identity?
  4. System-assigned or user-assigned: which gives you a stable principal whose role grants survive a cluster rebuild?
  5. What modern feature lets a pod access Azure with no secret, and what deprecated thing does it replace?

Answers

  1. The service principal — its client secret expires (typically after 1–2 years), breaking the cluster’s ability to authenticate to Azure. Managed identity has no such secret.
  2. The kubelet identity, scoped to the target Azure Container Registry. Grant it with az aks update --attach-acr (or AcrPull on the registry directly).
  3. az aks update --enable-managed-identity. Afterwards, re-attach any ACR so the new kubelet identity holds AcrPull.
  4. User-assigned — it’s an independent resource with a fixed principal ID, so role grants persist across cluster lifecycle changes. A system-assigned identity gets a new ID on recreation.
  5. Workload identity (Microsoft Entra Workload ID) — federating the cluster’s OIDC issuer token to a user-assigned managed identity. It replaces the deprecated pod-managed identity (aad-pod-identity).

Glossary

Next steps

AzureAKSManaged IdentityService PrincipalEntra IDKubernetesWorkload IdentityRBAC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading