Containerization Lesson 87 of 113

Migrating EKS Workloads from IRSA to EKS Pod Identity: Mechanics, Trust, and Rollout

IRSA was the right answer for six years. You stood up an OIDC provider per cluster, annotated a service account with a role ARN, and the AWS SDK exchanged a projected token for credentials. It works. But every cluster you create is a new IAM identity provider, every role’s trust policy hard-codes a specific cluster’s OIDC issuer URL, and reusing one role across three clusters means a StringEquals condition that grows a line per cluster. EKS Pod Identity collapses that: one service principal (pods.eks.amazonaws.com), one trust policy, and the cluster/namespace/service-account binding managed entirely in the EKS API as an association resource. This is the migration I run for platform teams who have outgrown the OIDC sprawl — written to be incremental and fully reversible at every step.

The reason this migration is safe to attempt is that IRSA and Pod Identity coexist on the same role. A role can trust both sts:AssumeRoleWithWebIdentity (the OIDC federation IRSA uses) and sts:AssumeRole/sts:TagSession from pods.eks.amazonaws.com (Pod Identity) at the same time, and a pod’s effective credential source is decided at pod start by which environment variables EKS injects. So you can flip one namespace, watch CloudTrail, and roll back with a single kubectl rollout restart if anything looks wrong. Nothing is destructive until you deliberately retire the OIDC trust statement at the end.

By the end of this article you will know exactly which of the three moving parts — the Pod Identity Agent DaemonSet, the association resource, and the trust policy — is responsible for each failure you hit, and you will be able to read aws sts get-caller-identity from inside a pod and tell in one line whether the whole credential path is working. Because you will return to this mid-migration, the trust models, the session tags, the failure modes, the CLI flags and the cost deltas are all laid out as scannable tables — read the prose once, then keep the tables open during the rollout.

What problem this solves

IRSA’s trust anchor is an IAM OIDC identity provider that points at your cluster’s issuer URL. The role trust policy looks like this:

{
  "Effect": "Allow",
  "Principal": {
    "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E"
  },
  "Action": "sts:AssumeRoleWithWebIdentity",
  "Condition": {
    "StringEquals": {
      "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E:sub": "system:serviceaccount:payments:checkout",
      "oidc.eks.us-east-1.amazonaws.com/id/EXAMPLED539D4633E53DE1B716D3041E:aud": "sts.amazonaws.com"
    }
  }
}

Three structural problems show up at scale. Per-cluster identity providers: each cluster has a unique OIDC issuer, so a role meant to be shared across clusters needs every issuer registered as a provider and every sub/aud condition repeated — recreate a cluster and the issuer changes, breaking every role that trusted it. Coupled ownership: the IAM team owns OIDC providers while the platform team owns clusters, so standing up a new cluster requires an IAM change ticket before any workload can assume a role. Condition-key sprawl: multi-cluster, multi-namespace reuse turns the trust policy into a maintenance liability that few people fully understand.

Pod Identity replaces the federation anchor with a single AWS service principal, pods.eks.amazonaws.com, and moves the cluster/namespace/service-account binding out of IAM and into an EKS association resource. The result: the role trust policy is identical across every cluster and never edited per cluster, and creating a cluster is a platform-team operation with no IAM ticket.

Who hits this pain hardest: fleets of 5+ clusters (blue/green, per-tenant, per-region), teams that recycle clusters frequently (the issuer churns), and any role shared across clusters or accounts. To frame the whole field before the deep dive, here is what each mechanism costs you and where Pod Identity wins:

Concern IRSA (OIDC) EKS Pod Identity Why it matters at scale
Trust anchor One OIDC provider per cluster One service principal, all clusters N clusters → N providers to register and trust
Where the SA binding lives IAM trust-policy Condition EKS association (API resource) Binding owned by platform team, not IAM
Reuse a role across clusters New provider + sub condition each Same role, new association Trust policy never grows per cluster
Recreate a cluster Issuer changes → trust breaks Association recreated, trust untouched Cluster rebuild becomes a non-event
Credential exchange SDK calls STS in every pod EKS Auth assumes once per node/role Fewer STS calls, less throttling at scale
Cross-namespace scoping Hand-rolled per-namespace conditions Built-in session tags One role serves many namespaces safely
Cross-account access SDK role-chaining hack in app config First-class --target-role-arn No app-side config; auditable in EKS
Who must change to onboard a cluster IAM team (provider) + platform Platform team only No cross-team ticket on the critical path

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand IAM roles and trust policies (a role’s AssumeRolePolicyDocument versus its permission policies), STS assume-role mechanics, and the basics of Kubernetes service accounts and how a pod references one. You need an EKS cluster you can administer, the AWS CLI configured, kubectl context set, and (for the IaC paths) Terraform. Familiarity with how IRSA works today — the OIDC provider, the eks.amazonaws.com/role-arn annotation, and the projected token — is assumed, because this is a migration, not a from-scratch setup.

This sits in the EKS identity & security track. It builds directly on AWS IAM Fundamentals: Users, Groups, Roles, Policies & the Evaluation Logic and Kubernetes RBAC & Service Accounts, In Depth. It pairs with Running EKS at Scale: Pod Identity, Karpenter Autoscaling, and VPC CNI Networking for the fleet picture, and the cross-account patterns extend Secure Cross-Account Access: Assume-Role Patterns, External ID, Confused Deputy, and Session Policies. For the comparable mechanism on other clouds, see GKE Workload Identity Deep Dive.

A quick map of who owns what during the migration, so you route changes to the right team:

Layer What lives here Who usually owns it What it can break during migration
Service account (K8s) The SA the pod uses, annotation (IRSA) App / platform team Pod uses wrong SA → no association match
Pod Identity Agent DaemonSet on every node, link-local endpoint Platform team Missing/unhealthy → node role served instead
Association (EKS) (cluster, namespace, SA) → role mapping Platform team Wrong SA/role → AccessDenied or wrong identity
Role trust policy (IAM) pods.eks.amazonaws.com + sts:TagSession IAM / security team Missing TagSession → every assume denied
Role permission policy (IAM) What the role can actually do IAM / security team Namespace-scoped conditions stop matching if tags off
Network / proxy Egress proxy, NO_PROXY Platform / network Link-local routed to proxy → credential fetch fails

Core concepts

Five mental models make every later step and every failure obvious.

Pod Identity has exactly three moving parts. The Pod Identity Agent is a DaemonSet that serves credentials over a link-local endpoint on each node. The association is an EKS API resource that maps (cluster, namespace, service account) → IAM role. The trust policy on that role trusts the EKS service principal instead of an OIDC issuer. Every problem you will hit belongs to exactly one of these three — that is the diagnostic frame.

The credential source is decided at pod start, not at association time. Creating an association changes nothing about running pods. When a pod using an associated SA starts, EKS injects AWS_CONTAINER_CREDENTIALS_FULL_URI and AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE. The SDK’s default credential provider chain reads them and fetches credentials from the agent. So the unit of cutover is a pod restart — which is exactly why a kubectl rollout restart is both the apply mechanism and the rollback mechanism.

The assume is “once per node per role”, not “once per pod”. With IRSA, every pod calls STS itself (AssumeRoleWithWebIdentity). With Pod Identity, the agent calls the EKS Auth API (AssumeRoleForPodIdentity) and caches credentials per node per role. On a node running twenty pods of the same role, that is one assume, not twenty — the scalability win, and the reason Pod Identity throttles STS far less at fleet scale.

Session tags are the scoping lever. Every Pod Identity assume attaches six session tags (cluster ARN/name, namespace, SA, pod name, pod UID). Because of that, sts:TagSession is required in the trust policy — without it the assume is denied. Those tags let one role serve many namespaces safely: scope the permission policy with aws:PrincipalTag/kubernetes-namespace and the same role assumed from analytics is denied what payments is allowed.

Dual-trust makes it reversible. A role can carry both the OIDC AssumeRoleWithWebIdentity statement and the pods.eks.amazonaws.com statement simultaneously. During cutover you keep both live; the pod picks Pod Identity because the container-credentials variables win in the SDK chain. Roll back by deleting the association and restarting — the pod falls back to the still-present IRSA annotation. Nothing is destroyed until you remove the OIDC statement at the very end.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to the migration
OIDC provider IAM identity provider for a cluster’s issuer IAM (per cluster) The IRSA anchor you are retiring
Service principal pods.eks.amazonaws.com Role trust policy The single Pod Identity anchor
Association (cluster, ns, SA) → role mapping EKS API resource Replaces the trust-policy Condition
Pod Identity Agent DaemonSet serving creds on a node kube-system No agent → node role served instead
Link-local endpoint 169.254.170.23:80 / :2703 Each node Where the SDK fetches credentials
FULL_URI var AWS_CONTAINER_CREDENTIALS_FULL_URI Injected into every container Presence ⇒ pod is on Pod Identity
sts:TagSession Permission to attach session tags Trust policy action Missing ⇒ every assume denied
Session tag kubernetes-namespace, etc. On the assumed session The per-namespace scoping lever
--target-role-arn Chains to a role in another account Association field First-class cross-account access
Dual-trust role Trusts OIDC and pods.eks Role trust policy Makes cutover reversible
AssumeRoleForPodIdentity The EKS Auth assume call CloudTrail event Proof Pod Identity is being used

How Pod Identity works: the agent and the credential path

There are three moving parts; here is each in the order the credential travels.

1 — The Pod Identity Agent. It runs as a DaemonSet (eks-pod-identity-agent), one pod per node, on the node’s hostNetwork. It listens on a link-local address, 169.254.170.23 (and [fd00:ec2::23] for IPv6), on ports 80 and 2703. Install it as a managed add-on; EKS Auto Mode clusters already have it.

2 — The association. An EKS resource mapping (cluster, namespace, service account) → IAM role. You create it with the EKS API; nothing in Kubernetes changes except that the pod must use that service account.

3 — Credential delivery. When a pod using an associated service account starts, EKS injects AWS_CONTAINER_CREDENTIALS_FULL_URI and AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE into every container. The SDK’s default credential provider chain reads them and fetches credentials from the agent over the link-local endpoint. The agent calls the EKS Auth API (AssumeRoleForPodIdentity), which validates the association and returns temporary credentials — once per node per role.

Install the agent and confirm it is healthy:

aws eks create-addon \
  --cluster-name platform-prod \
  --addon-name eks-pod-identity-agent

kubectl get daemonset eks-pod-identity-agent -n kube-system
kubectl get pods -n kube-system -l app.kubernetes.io/name=eks-pod-identity-agent
resource "aws_eks_addon" "pod_identity_agent" {
  cluster_name = "platform-prod"
  addon_name   = "eks-pod-identity-agent"
}

If your cluster runs an HTTP proxy, add 169.254.170.23 and [fd00:ec2::23] to NO_PROXY in your workloads, or the SDK’s credential request is routed to the proxy and fails. This is the single most common Pod Identity bring-up failure.

The three moving parts, what each is responsible for, and the one command that proves it is healthy:

Moving part Responsible for Lives in Confirm it’s healthy with Failure if absent/wrong
Pod Identity Agent Serving creds on the node kube-system DaemonSet kubectl get ds eks-pod-identity-agent -n kube-system Node role served; pod gets node perms
Association The SA→role binding EKS API aws eks list-pod-identity-associations No injection; pod uses IRSA or nothing
Trust policy Allowing the assume + tags IAM role aws iam get-role --role-name <r> AccessDenied on AssumeRoleForPodIdentity
Injected env vars Telling the SDK where to fetch The container kubectl exec ... env | grep AWS_CONTAINER SDK falls through to node role
NO_PROXY Bypassing the proxy for link-local Workload env kubectl exec ... env | grep -i no_proxy Cred request hits proxy → fails

The two link-local endpoints and ports the agent uses — pin these in firewall rules and NO_PROXY:

Endpoint Protocol / port Family Used for Must be in NO_PROXY
169.254.170.23 HTTP :80 IPv4 SDK credential fetch Yes
169.254.170.23 TCP :2703 IPv4 Agent internal Yes (same IP)
[fd00:ec2::23] HTTP :80 IPv6 SDK credential fetch (IPv6) Yes, if dual-stack
169.254.169.254 HTTP :80 IPv4 IMDS (not Pod Identity) Separate concern (IMDSv2)

The two environment variables EKS injects, and the IRSA ones they supersede — knowing which a pod carries tells you its credential source instantly:

Variable Injected by Value (example) Meaning
AWS_CONTAINER_CREDENTIALS_FULL_URI Pod Identity http://169.254.170.23/v1/credentials Pod is on Pod Identity
AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE Pod Identity /var/run/secrets/pods.eks.amazonaws.com/... Token the agent validates
AWS_WEB_IDENTITY_TOKEN_FILE IRSA /var/run/secrets/eks.amazonaws.com/... Pod still has IRSA available
AWS_ROLE_ARN IRSA arn:aws:iam::...:role/... The IRSA role (from annotation)
AWS_REGION / AWS_DEFAULT_REGION Either us-east-1 Region for STS/EKS Auth

Error and limit reference

The errors and messages you will actually see during a migration, what each really means, how to confirm it, and the fix. The non-obvious ones are the blanket AccessDenied (almost always the missing sts:TagSession) and the silent node-role fallthrough (the SDK never errors — it just uses the wrong identity):

Signal / error Where it surfaces What it really means How to confirm Fix
Caller is the node role (no error) sts get-caller-identity in pod SDK fell through to instance profile ARN ends .../instance-profile or node role name Fix proxy/NO_PROXY, agent, or SA match
AccessDenied on AssumeRoleForPodIdentity CloudTrail Trust missing sts:TagSession (usually) CloudTrail event errorCode Add sts:TagSession to trust
AccessDenied on the target call App logs / CloudTrail (account B) Cross-account chain not trusted both ways CloudTrail in B shows the denied assume A allow sts:AssumeRole on B; B trusts A
AccessDenied on a namespace-scoped action App logs PrincipalTag condition not matching Worked before --disable-session-tags Restore tags or scope via --policy
No AWS_CONTAINER_* vars in pod kubectl exec ... env Pod not restarted / no association env | grep AWS_CONTAINER empty Create association + rollout restart
DaemonSet 0/N ready kubectl get ds -n kube-system Agent not scheduled (taints/add-on) kubectl describe ds eks-pod-identity-agent Install add-on; add tolerations
ResourceInUseException create-pod-identity-association Association already exists for pair list-pod-identity-associations Reuse/update existing; don’t duplicate
ThrottlingException from STS CloudWatch / app retries Per-pod IRSA assumes at scale STS metric spikes during churn Complete Pod Identity cutover
Old IRSA role in caller identity sts get-caller-identity Annotation present, no PI var injected No FULL_URI in env rollout restart to inject PI vars
Credentials expire mid-job Long-running pod logs SDK not refreshing from the endpoint Check SDK version supports container creds Upgrade SDK; it refreshes automatically

The known limits and quotas worth pinning before you design a fleet rollout — real numbers where they are fixed, the mechanism where they are not:

Limit / quota Value Scope Why it matters
Agent pods per node 1 (DaemonSet) Per node No scaling knob; size node headroom
Session tags injected per assume 6 (fixed) Per assume All count toward the STS session-tag ceiling
Credential cache Once per node per role Per node The scalability win over per-pod IRSA
Association binding granularity Exact (cluster, ns, SA) Per association One association per SA per cluster
Associations per account/cluster No practical per-association charge Per cluster Design for clarity, not to minimize count
Eventual-consistency window Seconds after create Per association Wait before rollout restart
Link-local ports 80, 2703 Per node Must be reachable + in NO_PROXY
Cross-account hops 1 (--target-role-arn) Per association A→B chain; not arbitrary depth

The CLI surface

Every aws eks ... pod-identity and supporting command you will run, grouped by phase — keep this open as your command palette during the migration:

Phase Command Purpose
Setup aws eks create-addon --addon-name eks-pod-identity-agent Install the agent DaemonSet
Setup kubectl get ds eks-pod-identity-agent -n kube-system Confirm the agent is ready on all nodes
Inventory kubectl get sa -A -o json | jq ... List IRSA service accounts to migrate
Create aws eks create-pod-identity-association ... Bind (ns, SA) → role
Create aws eks create-pod-identity-association --target-role-arn ... Cross-account binding
Inspect aws eks list-pod-identity-associations --cluster-name <c> List all associations on a cluster
Inspect aws eks describe-pod-identity-association --association-id <id> Full detail of one association
Cutover kubectl rollout restart deployment -n <ns> Switch pods to Pod Identity
Verify kubectl exec ... -- aws sts get-caller-identity Prove the effective identity
Verify aws cloudtrail lookup-events ... AssumeRoleForPodIdentity Confirm assumes + session tags
Update aws eks update-pod-identity-association --association-id <id> ... Change role/target on an association
Rollback aws eks delete-pod-identity-association --association-id <id> Remove the binding (falls back to IRSA)

Trust and session tags: one policy, many namespaces

The role’s trust policy no longer references any OIDC issuer. It trusts the EKS service principal and grants two actions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowEksPodIdentity",
      "Effect": "Allow",
      "Principal": { "Service": "pods.eks.amazonaws.com" },
      "Action": [ "sts:AssumeRole", "sts:TagSession" ]
    }
  ]
}

sts:TagSession is required, not optional. EKS Pod Identity attaches a set of session tags on every assume, and without sts:TagSession the assume is denied. The six tags EKS injects:

Session tag key Value Transitive? Typical use in a condition
eks-cluster-arn Full ARN of the cluster No Restrict a role to one cluster
eks-cluster-name Cluster name No Human-readable cluster scoping
kubernetes-namespace Pod’s namespace No Per-namespace permission scoping
kubernetes-service-account Service account name No Per-SA scoping within a namespace
kubernetes-pod-name Pod name No Forensics / fine-grained audit
kubernetes-pod-uid Pod UID No Unique per-pod correlation in logs

These tags are the lever that lets one role serve many workloads safely. Scope the permission policy per namespace with aws:PrincipalTag:

{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::tenant-data/*",
  "Condition": {
    "StringEquals": { "aws:PrincipalTag/kubernetes-namespace": "payments" }
  }
}

The same role assumed from a pod in analytics gets a different kubernetes-namespace tag and is denied. With IRSA you would have needed two roles and two trust conditions; here it is one role and a tag comparison. The trust policy is identical across all clusters — you never edit it per cluster, which is the operational point.

A cookbook of the session-tag conditions you will actually write in permission policies — copy the Condition shape for the scoping you need:

Goal Condition key Operator Example value Effect
One namespace only aws:PrincipalTag/kubernetes-namespace StringEquals payments Allow only from payments pods
One SA in a namespace aws:PrincipalTag/kubernetes-service-account StringEquals checkout Allow only the checkout SA
A set of namespaces aws:PrincipalTag/kubernetes-namespace StringEquals (list) ["payments","ledger"] Allow from either namespace
One cluster only aws:PrincipalTag/eks-cluster-name StringEquals platform-prod-use1 Pin a role to a single cluster
Namespace prefix (tenant) aws:PrincipalTag/kubernetes-namespace StringLike tenant-a-* Allow any tenant-a- namespace
Path-scope S3 by namespace s3:prefix + aws:PrincipalTag/... StringEquals key prefix == namespace Each namespace reads its own prefix
Deny a namespace explicitly aws:PrincipalTag/kubernetes-namespace StringEquals (in Deny) sandbox Hard-block a namespace regardless

The two trust models, attribute by attribute — this is the heart of what changes:

Attribute IRSA trust policy Pod Identity trust policy
Principal Federated: OIDC provider ARN Service: pods.eks.amazonaws.com
Action sts:AssumeRoleWithWebIdentity sts:AssumeRole + sts:TagSession
Condition keys <issuer>:sub, <issuer>:aud none required (binding is the association)
Per-cluster edits Yes — issuer is in the condition No — identical everywhere
Who scopes the SA The trust Condition The EKS association
Namespace scoping Hand-rolled sub string match Built-in kubernetes-namespace tag
Breaks on cluster rebuild Yes (issuer changes) No

The IAM actions involved, where each appears, and what omitting it does:

Action On which policy Granted to Effect if omitted
sts:AssumeRole Role trust pods.eks.amazonaws.com No assume at all → AccessDenied
sts:TagSession Role trust pods.eks.amazonaws.com Tagged assume denied → AccessDenied
sts:AssumeRole Account-A pod role permission The pod role Cross-account chain to B fails
sts:TagSession Account-B target trust (optional) Account-A role Tags don’t propagate cross-account
eks:CreatePodIdentityAssociation IAM (operator) Platform engineer/CI Cannot create associations

A subtle but important distinction — what scopes the binding versus what scopes the permissions:

Layer IRSA Pod Identity Owned by
Which SA may assume Trust Condition sub Association (ns, SA) Platform (assoc) / IAM (trust)
What the role may do Permission policy Permission policy IAM / security
Per-namespace limits More roles or sub matches aws:PrincipalTag/... conditions IAM / security

Step 1 — Map your IRSA service accounts to associations

Before changing anything, enumerate what you have. Every IRSA service account carries the eks.amazonaws.com/role-arn annotation:

kubectl get sa --all-namespaces -o json \
| jq -r '.items[]
  | select(.metadata.annotations["eks.amazonaws.com/role-arn"] != null)
  | [.metadata.namespace, .metadata.name,
     .metadata.annotations["eks.amazonaws.com/role-arn"]]
  | @tsv'

That gives you the exact (namespace, service account, role ARN) tuples to migrate. For each one you create an association — the role can stay the same; only its trust policy changes.

For a single service account:

aws eks create-pod-identity-association \
  --cluster-name platform-prod \
  --namespace payments \
  --service-account checkout \
  --role-arn arn:aws:iam::111122223333:role/payments-checkout

In practice you want this in IaC. Terraform:

resource "aws_eks_pod_identity_association" "checkout" {
  cluster_name    = "platform-prod"
  namespace       = "payments"
  service_account = "checkout"
  role_arn        = aws_iam_role.payments_checkout.arn
}

data "aws_iam_policy_document" "pod_identity_trust" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole", "sts:TagSession"]
    principals {
      type        = "Service"
      identifiers = ["pods.eks.amazonaws.com"]
    }
  }
}

Update each migrated role’s assume_role_policy to include data.aws_iam_policy_document.pod_identity_trust.json. If you keep the OIDC AssumeRoleWithWebIdentity statement and add the pods.eks.amazonaws.com statement, the role works under both mechanisms simultaneously — exactly what you want during cutover. See Terraform Module: AWS IAM Role and Terraform Module: AWS EKS Cluster for hardened module patterns.

Do not delete the IRSA annotation in the same change that creates the association. Pod Identity and IRSA can coexist on a role; keeping both live gives you a clean rollback.

Build the inventory as a table per service account — this is your migration tracker. Columns map one-to-one to what you need for each association:

Namespace Service account Current IRSA role Risk tier Cutover wave Cross-account?
kube-system cluster-autoscaler eks-cluster-autoscaler Low Wave 1 No
observability telemetry-shipper telemetry-firehose Low Wave 1 Yes (central)
internal-tools backstage backstage-readonly Low Wave 1 No
data-pipeline ingest pod-id-ingest Medium Wave 2 Yes
search indexer opensearch-writer Medium Wave 2 No
payments checkout payments-checkout High Wave 3 No
payments ledger payments-ledger High Wave 3 No

The create-pod-identity-association arguments, what each does, and whether it is required:

Argument Required What it sets Notes / gotcha
--cluster-name Yes Which cluster the binding applies to Per-cluster; reuse the role across clusters
--namespace Yes Pod namespace half of the binding Must exactly match the pod’s namespace
--service-account Yes SA name half of the binding Must exactly match (typos → no match)
--role-arn Yes The role the pod assumes (account A) Trust must include pods.eks + TagSession
--target-role-arn No A role in another account to chain to Enables native cross-account
--disable-session-tags No Turns off the six session tags Required when using --policy
--policy No Inline session policy to further scope Cannot combine with session tags
--tags No Tags on the association resource itself For your own inventory/cost tags

Step 2 — Incremental rollout: per-namespace cutover

The credential source a pod actually uses is decided at pod start. IRSA injects AWS_WEB_IDENTITY_TOKEN_FILE; Pod Identity injects AWS_CONTAINER_CREDENTIALS_FULL_URI. If both are present, the SDK’s default credential provider chain prefers the container credentials (Pod Identity) over web identity. So the cutover sequence per namespace is:

  1. Create the association for every service account in the namespace.
  2. Add the pods.eks.amazonaws.com statement to each role’s trust policy (keep the OIDC statement).
  3. Roll the workloads so new pods pick up the injected variables:
kubectl rollout restart deployment -n payments
  1. Confirm the pods now carry Pod Identity variables and that AWS calls still succeed (see Verify). Watch CloudTrail for AssumeRoleForPodIdentity events from the namespace.
  2. Only after a soak period, remove the eks.amazonaws.com/role-arn annotation and the OIDC trust statement.

Pick a low-risk namespace first — internal tooling, not payments. Because the association is an EKS resource and not a pod mutation, creating it has zero effect until pods restart, so you control the blast radius entirely through rollout restart.

Associations are eventually consistent — allow several seconds after create-pod-identity-association before restarting workloads, and never create associations inside a hot, high-availability code path. Do it in setup/init flows.

The exact credential-provider precedence, so you can predict which source a pod uses at any point in the cutover:

Pod has IRSA vars Pod has Pod Identity vars SDK uses State in migration
Yes No IRSA (web identity) Before cutover (baseline)
Yes Yes Pod Identity (container creds win) During cutover (dual-trust)
No Yes Pod Identity After annotation removed
No No Node instance role (or fails) Misconfigured — agent/assoc missing

The order of operations and why each step is sequenced where it is — get the order wrong and you either break traffic or lose your rollback:

# Step Effect on running pods Reversible by Why this order
1 Install agent add-on None Remove add-on Endpoint must exist before any cutover
2 Create association None until restart Delete association Pre-stage binding with zero blast radius
3 Add pods.eks to trust (keep OIDC) None Remove statement Role must accept the assume before restart
4 rollout restart namespace Pods switch to Pod Identity rollout restart after deleting assoc The actual cutover; controlled per namespace
5 Soak + watch CloudTrail None n/a Prove it before removing the safety net
6 Remove SA annotation New pods lose IRSA fallback Re-add annotation + restart Only after soak; this reduces reversibility
7 Remove OIDC trust statement None to pods; OIDC now dead Re-add statement Final, deliberate; do last
8 Retire OIDC provider (when unused) None Recreate provider Cleanup once no role uses it

The rollout restart verbs you will use per workload type — not everything is a Deployment:

Workload type Restart command Notes
Deployment kubectl rollout restart deployment -n <ns> Rolling, respects surge/unavailable
StatefulSet kubectl rollout restart statefulset -n <ns> Ordered; slower, watch readiness
DaemonSet kubectl rollout restart daemonset -n <ns> One per node; e.g. telemetry shippers
CronJob (next scheduled run picks it up) New pods get the vars automatically
Bare Pod (no controller) kubectl delete pod (it must be recreated) Anti-pattern; prefer a controller

Step 3 — Cross-account and multi-cluster access patterns

Two patterns cover almost everything.

Multi-cluster, same role. This is where Pod Identity shines. Create the identical association in each cluster pointing at the same role; the trust policy needs no edits because no issuer is referenced. Same Terraform module, different cluster_name:

resource "aws_eks_pod_identity_association" "checkout" {
  for_each        = toset(["platform-prod-use1", "platform-prod-euw1"])
  cluster_name    = each.value
  namespace       = "payments"
  service_account = "checkout"
  role_arn        = aws_iam_role.payments_checkout.arn
}

Cross-account. The cluster is in account A; the workload needs a role in account B. Pod Identity supports this natively with --target-role-arn: the association’s role-arn (in account A) is assumed first, then it assumes the target role in account B, and the target’s credentials are injected into the pod.

aws eks create-pod-identity-association \
  --cluster-name platform-prod \
  --namespace data-pipeline \
  --service-account ingest \
  --role-arn arn:aws:iam::111122223333:role/pod-id-ingest \
  --target-role-arn arn:aws:iam::444455556666:role/cross-acct-ingest

The account-A role trusts pods.eks.amazonaws.com as above and must be allowed to sts:AssumeRole on the account-B role. The account-B target role’s trust policy then trusts the account-A role ARN. This replaces the IRSA “role chaining via SDK config” hack with a first-class flag, and the chain is auditable in EKS rather than buried in app config — the deeper assume-role hygiene (External ID, confused-deputy) is covered in Secure Cross-Account Access.

You can also attach a session policy that further restricts the injected credentials with --policy. When you use --policy you must pass --disable-session-tags, because a session policy and EKS session tags cannot be combined on the same assume:

aws eks create-pod-identity-association \
  --cluster-name platform-prod \
  --namespace data-pipeline \
  --service-account ingest \
  --role-arn arn:aws:iam::111122223333:role/pod-id-ingest \
  --disable-session-tags \
  --policy '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Action":"s3:GetObject","Resource":"arn:aws:s3:::ingest-bucket/*"}]}'

Be deliberate here: disabling session tags removes the kubernetes-namespace lever, so any namespace-scoped conditions on the role stop matching. Use --policy only when you intend to scope through the inline policy instead.

The access patterns side by side — pick the row that matches your topology:

Pattern Association shape Trust on the assumed role Scoping mechanism When to use
Single cluster, one role per SA role-arn only pods.eks + TagSession Permission policy The default case
Multi-cluster, shared role Same role-arn, N associations pods.eks + TagSession eks-cluster-arn tag Fleets, blue/green, per-region
One role, many namespaces role-arn only pods.eks + TagSession kubernetes-namespace tag Tenant isolation on one role
Cross-account role-arn (A) + --target-role-arn (B) A: pods.eks; B: trusts A’s ARN Target’s permission policy Central account owns the data
Hard-scoped, no tags role-arn + --policy + --disable-session-tags pods.eks + TagSession Inline session policy Extra least-privilege per assoc

What --policy and --disable-session-tags cost you — the trade-off you are accepting:

You enable You gain You lose Net advice
Session tags (default) kubernetes-namespace scoping, rich audit Cannot use --policy on same assoc Keep for most workloads
--policy (needs tags off) Per-association least-privilege ceiling All namespace-tag conditions stop matching Use for narrow, single-namespace roles
--target-role-arn Native cross-account, auditable in EKS One extra assume hop (negligible latency) Preferred over SDK role-chaining

Verify

Confirm the migration end to end, from association down to an actual signed AWS call.

List associations and confirm the binding:

aws eks list-pod-identity-associations --cluster-name platform-prod
aws eks describe-pod-identity-association \
  --cluster-name platform-prod --association-id a-abc123def456

Confirm the pod received Pod Identity variables (not IRSA’s):

kubectl exec -n payments deploy/checkout -- env | grep AWS_CONTAINER
# AWS_CONTAINER_CREDENTIALS_FULL_URI=http://169.254.170.23/v1/credentials
# AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE=/var/run/secrets/pods.eks.amazonaws.com/serviceaccount/eks-pod-identity-token

Verify effective permissions from inside the pod — this is the only check that proves the whole path works:

kubectl exec -n payments deploy/checkout -- aws sts get-caller-identity

The returned Arn should be an assumed-role session of the associated role (an arn:aws:sts::...:assumed-role/... value), not the node instance role. If you see the node role, the agent is not serving credentials — check the proxy/NO_PROXY settings and that the pod’s service account name exactly matches the association.

Cross-check the source of truth in CloudTrail. Pod Identity assumes surface as AssumeRoleForPodIdentity calls by the EKS Auth service; the session tags appear in the event, letting you confirm the namespace and service account that triggered each assume — see AWS CloudTrail and Config: Audit and Compliance at Scale for wiring this into an org trail.

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRoleForPodIdentity \
  --max-results 10

The verification checklist as a table — each check, the exact command, and the pass/fail you are looking for:

# Check Command Pass looks like Fail looks like
1 Agent DaemonSet ready kubectl get ds eks-pod-identity-agent -n kube-system DESIRED == READY on all nodes 0 ready / not found
2 Association exists aws eks list-pod-identity-associations --cluster-name <c> Row with your ns/SA/role Empty / wrong SA
3 Pod has PI vars kubectl exec ... env | grep AWS_CONTAINER FULL_URI present Only AWS_WEB_IDENTITY...
4 Effective identity kubectl exec ... aws sts get-caller-identity assumed-role/<your-role>/... .../instance/... (node role)
5 Real API call kubectl exec ... aws s3 ls s3://<bucket> Lists objects AccessDenied
6 CloudTrail event aws cloudtrail lookup-events ... AssumeRoleForPodIdentity Events with session tags None / AccessDenied
7 No throttling CloudWatch STS/EKS Auth metrics Flat error rate ThrottlingException spikes

What the get-caller-identity ARN tells you in each state — read the ARN shape, not just success:

Arn you see Means Action
arn:aws:sts::A:assumed-role/payments-checkout/... Pod Identity working, expected role Done — soak and proceed
arn:aws:sts::A:assumed-role/<old-IRSA-role>/... IRSA still winning (annotation present, no PI var) Check association + restart
arn:aws:sts::B:assumed-role/cross-acct-ingest/... Cross-account chain working Verify target permissions
arn:aws:sts::A:assumed-role/<node-role>/... Agent not serving creds Fix NO_PROXY / agent / SA name
arn:aws:iam::A:user/... Not on a role at all Wrong credential source entirely

Architecture at a glance

The diagram traces the credential path a migrated pod takes, left to right, and maps each migration failure to the exact hop where it bites. Read it as a pipeline. A pod in payments using SA checkout (no IRSA annotation once cutover completes) starts, and the SDK reads AWS_CONTAINER_CREDENTIALS_FULL_URI and asks the Pod Identity Agent — a DaemonSet on the node’s hostNetwork listening on the link-local 169.254.170.23:80/:2703. That request must bypass any HTTP proxy, which is why NO_PROXY carries the link-local address. The agent calls EKS Auth, which looks up the association (cluster, namespace, SA) → role, attaches the six session tags (including kubernetes-namespace), and performs the assume — which requires sts:TagSession on the role’s trust policy. STS returns credentials for the pod role (and, for cross-account, chains via --target-role-arn to a target role in account B), and the pod makes a signed call to S3 or Firehose, scoped by aws:PrincipalTag/kubernetes-namespace. Every assume is recorded in CloudTrail as AssumeRoleForPodIdentity with the tags attached.

The numbered badges are the five places this path breaks during a migration, and the legend narrates each as symptom → confirm → fix. Notice they cluster on the agent and trust hops: badge 1 is the proxy swallowing the link-local request (the single most common bring-up failure); badge 2 is the agent simply not on the node; badge 3 is the missing sts:TagSession that denies every tagged assume; badge 4 is a dual-source mix-up where the old IRSA role wins or the SA name mismatches; badge 5 is the session-tag-versus---policy clash that silently breaks namespace scoping. The diagnostic method is the same every time: read aws sts get-caller-identity from inside the pod, see which role (or the node role) you got, and that tells you which hop failed.

EKS Pod Identity credential path after migrating from IRSA: a pod in the payments namespace using service account checkout reads AWS_CONTAINER_CREDENTIALS_FULL_URI and requests credentials from the Pod Identity Agent DaemonSet on the node at link-local 169.254.170.23 ports 80 and 2703, bypassing the proxy via NO_PROXY; the agent calls EKS Auth which resolves the association mapping cluster/namespace/service-account to an IAM role, attaches six session tags including kubernetes-namespace and requires sts:TagSession, then STS returns credentials for the pod role and optionally chains via --target-role-arn to a target role in another account; the pod makes a signed call to S3 or Firehose scoped by aws:PrincipalTag/kubernetes-namespace, with every assume recorded in CloudTrail as AssumeRoleForPodIdentity. Five numbered badges mark the migration failure points: creds routed to the proxy, agent not on the node, AccessDenied from missing sts:TagSession, wrong role from a dual-source mix-up, and the session-tag versus --policy clash

Real-world scenario

Northwind Pay, a fintech platform team, ran 11 EKS clusters across two regions for blue/green and tenant isolation. A shared “telemetry shipper” DaemonSet on every cluster needed firehose:PutRecordBatch to a central account. Under IRSA, that meant 11 OIDC providers registered as trusted in the central account’s role, and an 11-clause StringEquals block in the trust policy keyed on each cluster’s issuer URL. Every cluster rebuild changed an issuer and silently broke shipping until someone updated the trust policy — they had been paged for it twice, and the second incident lost 40 minutes of telemetry during a release.

The constraint: they could not coordinate an IAM change every time the platform team recycled a cluster, and security would not approve a wildcard trust. The team’s first instinct was to script the trust-policy update into the cluster-rebuild pipeline, but security rejected it — a pipeline with iam:UpdateAssumeRolePolicy on a cross-account role was a bigger risk than the problem. Pod Identity removed the need entirely.

The fix was Pod Identity with a single cross-account target role and one association per cluster, all generated from the same module. The central role’s trust policy stopped referencing any cluster at all:

resource "aws_eks_pod_identity_association" "telemetry" {
  for_each        = toset(var.cluster_names) # all 11
  cluster_name    = each.value
  namespace       = "observability"
  service_account = "telemetry-shipper"
  role_arn        = aws_iam_role.pod_id_telemetry.arn          # local per-account
  target_role_arn = "arn:aws:iam::999988887777:role/firehose-writer"
}

The firehose-writer role in the central account trusts only the per-account pod_id_telemetry role ARN — a single, static principal — and scopes writes to the namespace using aws:PrincipalTag/kubernetes-namespace. They ran the cutover one cluster at a time: created the association, added pods.eks to the local role’s trust (keeping OIDC), rollout restarted the DaemonSet, and watched CloudTrail for AssumeRoleForPodIdentity from observability before touching the next cluster. The whole fleet took three afternoons.

Cluster rebuilds became a non-event: the new cluster’s association is created by the same for_each, the trust policy never changes, and security reviews one static cross-account trust instead of an issuer list. The 11-clause condition block went to zero, the pipeline lost its dangerous IAM permission, and the telemetry-loss pages stopped. The lesson on the wall: “If your trust policy has a line per cluster, you are one rebuild away from an outage — move the binding out of IAM.”

The migration as a timeline, because the order of moves is the lesson:

Stage Action Effect Reversible by
Before 11 OIDC providers + 11-clause trust Pages on every rebuild n/a (the problem)
Day 1 Install agent add-on on all 11 Endpoint ready; no pod change Remove add-on
Day 1 Add pods.eks to local roles (keep OIDC) Roles accept either assume Remove statement
Day 2 Create associations via for_each No effect until restart Delete associations
Day 2 rollout restart DaemonSet, cluster by cluster Shippers switch to Pod Identity rollout restart after deleting assoc
Day 3 Soak + CloudTrail confirms all 11 Telemetry flowing on PI n/a
+2 weeks Remove SA annotations + OIDC trust OIDC retired Re-add (kept in git)
+2 weeks Delete 11 OIDC providers Sprawl gone Recreate from IaC

Advantages and disadvantages

Pod Identity is the right default for new clusters and the right destination for most IRSA fleets, but it is not free of trade-offs. Weigh it honestly:

Advantages (why to migrate) Disadvantages (what it costs / where it bites)
One trust policy across all clusters — never edited per cluster Adds a DaemonSet to operate, patch, and monitor on every node
Cluster rebuild no longer breaks trust (no issuer in the policy) A new failure mode: proxy/NO_PROXY swallowing the link-local request
Platform team onboards clusters with no IAM ticket sts:TagSession is mandatory and easy to forget → silent AccessDenied
Built-in kubernetes-namespace session tag → one role, many namespaces Session tags and --policy are mutually exclusive on an assume
Assume is once-per-node-per-role → far less STS throttling at scale Older SDK versions may not read the container-credentials vars
Native cross-account via --target-role-arn, auditable in EKS Cross-account adds an extra assume hop to reason about
Fully reversible during cutover (dual-trust + rollout restart) Reversibility ends once you remove the annotation/OIDC statement
Associations are first-class API/IaC resources, easy to inventory Eventually consistent — must wait before restarting workloads

A head-to-head decision matrix — for each situation, which mechanism wins and why:

Situation Choose Why
New (greenfield) cluster Pod Identity No OIDC provider to stand up; simpler from day one
Single long-lived cluster, static roles Either (no urgency) IRSA is set-and-forget; migrate opportunistically
5+ clusters Pod Identity One trust policy beats N issuer registrations
Clusters recycled frequently Pod Identity Issuer churn breaks IRSA trust on every rebuild
Role shared across clusters Pod Identity Same role, new association — no trust edits
Role shared across accounts Pod Identity Native --target-role-arn, auditable in EKS
Many namespaces, one role Pod Identity kubernetes-namespace session tag scoping
Platform team must self-serve identity Pod Identity Association needs no IAM ticket
Very large fleet hitting STS throttling Pod Identity Once-per-node assume cuts STS calls
Air-gapped / no agent allowed on nodes IRSA Pod Identity requires the agent DaemonSet
SDK too old to read container creds IRSA (until upgraded) Pod Identity needs container-credentials support
Need zero added node components IRSA No DaemonSet; OIDC is control-plane only

Pod Identity is the right choice when you run more than a handful of clusters, recycle clusters often, share roles across clusters or accounts, or want the platform team to own workload identity without IAM tickets. IRSA remains acceptable for a single, long-lived cluster with a small, static set of roles where the OIDC provider is set-and-forget — there is no urgency to migrate a stable single cluster, though new clusters should default to Pod Identity. The disadvantages are all operational and knowable: run the agent, remember sts:TagSession, and respect the session-tag/--policy rule, and none of them surprises you.

Hands-on lab

Migrate one service account from IRSA to Pod Identity end to end on an existing cluster, prove it in CloudTrail, then roll back — all using a low-cost S3-read role. Run in a shell with aws, kubectl, and jq. Assumes a cluster platform-prod with at least one IRSA service account; adjust names.

Step 1 — Environment and inventory.

CLUSTER=platform-prod
NS=internal-tools
SA=backstage
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# Find the IRSA role this SA uses today
ROLE_ARN=$(kubectl get sa $SA -n $NS -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}')
echo "Migrating $NS/$SA -> $ROLE_ARN"

Expected: the role ARN prints. If empty, that SA is not IRSA-backed — pick another.

Step 2 — Install the Pod Identity Agent add-on (idempotent).

aws eks create-addon --cluster-name $CLUSTER --addon-name eks-pod-identity-agent 2>/dev/null || true
kubectl rollout status daemonset eks-pod-identity-agent -n kube-system --timeout=120s

Expected: daemon set "eks-pod-identity-agent" successfully rolled out.

Step 3 — Add the Pod Identity trust statement to the existing role (keep OIDC).

ROLE_NAME=$(echo $ROLE_ARN | awk -F/ '{print $NF}')
# Append a pods.eks statement; in real life merge with the existing OIDC statement in IaC
cat > /tmp/pi-trust.json <<'EOF'
{ "Version":"2012-10-17","Statement":[
  {"Effect":"Allow","Principal":{"Service":"pods.eks.amazonaws.com"},
   "Action":["sts:AssumeRole","sts:TagSession"]} ]}
EOF
echo "Merge /tmp/pi-trust.json into $ROLE_NAME's trust policy (keep the OIDC statement)."

In production this is a reviewed Terraform change; for the lab, edit the role’s trust policy in the console to add the statement above alongside the existing OIDC one.

Step 4 — Create the association.

aws eks create-pod-identity-association \
  --cluster-name $CLUSTER --namespace $NS --service-account $SA \
  --role-arn $ROLE_ARN
aws eks list-pod-identity-associations --cluster-name $CLUSTER \
  --query "associations[?namespace=='$NS' && serviceAccount=='$SA']"

Expected: one association row with your namespace, SA, and role.

Step 5 — Roll the workload and verify the switch.

sleep 10   # associations are eventually consistent
kubectl rollout restart deployment -n $NS
kubectl rollout status deployment -n $NS --timeout=120s

POD=$(kubectl get pod -n $NS -l app=$SA -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n $NS $POD -- env | grep AWS_CONTAINER
kubectl exec -n $NS $POD -- aws sts get-caller-identity

Expected: AWS_CONTAINER_CREDENTIALS_FULL_URI is present, and get-caller-identity returns arn:aws:sts::<account>:assumed-role/<role>/<session>not the node role.

Step 6 — Confirm in CloudTrail.

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRoleForPodIdentity \
  --max-results 5 --query "Events[].CloudTrailEvent" --output text | head

Expected: events from the EKS Auth service; the session tags include kubernetes-namespace=internal-tools.

Step 7 — Roll back (prove reversibility), then clean up.

ASSOC_ID=$(aws eks list-pod-identity-associations --cluster-name $CLUSTER \
  --query "associations[?namespace=='$NS' && serviceAccount=='$SA'].associationId" --output text)
aws eks delete-pod-identity-association --cluster-name $CLUSTER --association-id $ASSOC_ID
kubectl rollout restart deployment -n $NS   # falls back to IRSA (annotation still present)
kubectl exec -n $NS $(kubectl get pod -n $NS -l app=$SA -o jsonpath='{.items[0].metadata.name}') \
  -- aws sts get-caller-identity

Expected after rollback: get-caller-identity again shows the IRSA role via web identity — proving the migration is reversible. Leave the role trust as-is for re-runs, or remove the pods.eks statement to fully restore the original state.

Common mistakes & troubleshooting

This is the section you will return to mid-migration. Eight real failure modes — each as symptom → root cause → how to confirm → fix. Scan the playbook table, then read the detail for the row that matches.

# Symptom Root cause Confirm (exact command) Fix
1 Pod gets node role, not the associated role Agent request routed to HTTP proxy kubectl exec ... aws sts get-caller-identity → node role; env | grep -i proxy Add 169.254.170.23,[fd00:ec2::23] to NO_PROXY
2 Same as #1, no proxy in play Agent DaemonSet missing/not ready kubectl get ds eks-pod-identity-agent -n kube-system → 0 ready aws eks create-addon --addon-name eks-pod-identity-agent
3 AccessDenied on every call Trust policy omits sts:TagSession CloudTrail AssumeRoleForPodIdentity = AccessDenied Add sts:TagSession to the pods.eks trust statement
4 Pod still uses the old IRSA role Pod not restarted after association env | grep AWS_CONTAINER shows no FULL_URI kubectl rollout restart the workload
5 Association exists but no effect SA name/namespace mismatch aws eks describe-pod-identity-association ... vs pod’s SA Recreate association with exact (ns, SA)
6 Namespace-scoped call denied after adding --policy Session tags disabled, PrincipalTag no longer set Call worked before --policy; aws:PrincipalTag/... condition now fails Scope via the inline policy, or drop --policy
7 Cross-account call AccessDenied Account-A role can’t assume B, or B doesn’t trust A CloudTrail in B shows no/AccessDenied assume Allow A sts:AssumeRole on B; B trusts A’s ARN
8 Intermittent ThrottlingException from STS at scale Still on IRSA per-pod assumes on huge fleet CloudWatch STS ThrottlingException count rising Finish Pod Identity cutover (once-per-node assume)

A faster triage table — start from what you observe and jump to the likely cause and first move:

If you see… It’s probably… Do this first
Node-role ARN in the pod, proxy vars set Proxy swallowing link-local Add link-local to NO_PROXY
Node-role ARN, no proxy, agent 0/N Agent not on node Install/repair the add-on
AccessDenied on every call, fresh setup Missing sts:TagSession Add sts:TagSession to trust
No FULL_URI env var in the pod Pod not restarted rollout restart the workload
FULL_URI present but old role in caller Association SA mismatch Recreate with exact (ns, SA)
Worked, then broke after tightening --policy disabled session tags Scope via inline policy or restore tags
Cross-account call denied, local fine One side of the chain untrusted Fix A→B allow and B-trusts-A
STS throttling under scale events IRSA per-pod assumes Finish the Pod Identity cutover
Creds expire on a long job SDK too old to refresh Upgrade the AWS SDK

1 — The proxy swallows the link-local request

The single most common bring-up failure. The SDK reads AWS_CONTAINER_CREDENTIALS_FULL_URI=http://169.254.170.23/... and issues an HTTP request — which your cluster-wide HTTP_PROXY/HTTPS_PROXY env then routes to the egress proxy, which has no idea what 169.254.170.23 is. The request fails, the SDK falls through to the node instance role, and your pod silently gets the node’s permissions.

Confirm. aws sts get-caller-identity inside the pod returns the node role’s assumed-role ARN, and env | grep -i proxy shows HTTP_PROXY/HTTPS_PROXY set without the link-local in NO_PROXY. Fix. Add both addresses to NO_PROXY everywhere proxy vars are set:

# In the workload's env (Deployment spec, ConfigMap, or base image)
NO_PROXY=169.254.170.23,[fd00:ec2::23],169.254.169.254,localhost,127.0.0.1,.svc,.cluster.local

2 — The agent is not on the node

If the eks-pod-identity-agent add-on was never installed (or the DaemonSet failed to schedule on some nodes), there is no link-local endpoint to answer, and you get the same node-role fallthrough as #1 — but without a proxy in the picture.

Confirm. kubectl get ds eks-pod-identity-agent -n kube-system shows 0 ready or not found; kubectl describe ds reveals scheduling problems (taints, node selectors). Fix. Install the add-on and wait for the DaemonSet to roll out on every node; if some nodes are tainted, ensure the agent tolerates them.

3 — Missing sts:TagSession denies every assume

You added pods.eks.amazonaws.com to the trust policy with sts:AssumeRole but forgot sts:TagSession. Because every Pod Identity assume is tagged, the assume is denied — and the error is a blanket AccessDenied, not “you forgot TagSession”, so it looks like a permissions problem on the permission policy.

Confirm. CloudTrail shows AssumeRoleForPodIdentity with errorCode: AccessDenied. Fix. Add sts:TagSession alongside sts:AssumeRole in the pods.eks trust statement. This is the most common “I set everything up and it still won’t work” cause.

4 — The pod was never restarted

Creating an association does nothing to running pods. If you create the association and check immediately, the still-running pod has only the IRSA variables and keeps using IRSA — or, if the annotation was already removed, gets the node role.

Confirm. kubectl exec ... env | grep AWS_CONTAINER returns nothing (no FULL_URI). Fix. kubectl rollout restart the workload; new pods get the injected variables. Remember associations are eventually consistent — wait several seconds after creating before restarting.

5 — Service-account name or namespace mismatch

The association binds an exact (namespace, service account) pair. A typo, a pod using a different SA than you assumed, or the wrong namespace means no association matches and the pod gets the node role.

Confirm. Compare aws eks describe-pod-identity-association output against the pod’s actual SA: kubectl get pod <p> -n <ns> -o jsonpath='{.spec.serviceAccountName}'. Fix. Recreate the association with the exact pair, or fix the pod spec to use the SA the association names.

6 — Session-tag scoping broke after adding --policy

You added --policy to tighten an association and (correctly) paired it with --disable-session-tags — but the role’s permission policy scopes access with aws:PrincipalTag/kubernetes-namespace. With tags disabled, that tag is no longer present, so the namespace condition never matches and the call is denied.

Confirm. The exact call worked before the --policy change; the role’s permission policy contains an aws:PrincipalTag/kubernetes-namespace condition. Fix. Either drop --policy and rely on session tags, or move the scoping into the inline --policy itself (it is already namespace-specific because it is attached to one association).

7 — Cross-account chain denied

With --target-role-arn, two trusts must line up: the account-A pod role must be allowed to sts:AssumeRole the account-B target, and the account-B target’s trust policy must trust the account-A role ARN. Miss either and you get AccessDenied on the chained assume.

Confirm. CloudTrail in account B shows either no assume or AccessDenied. Fix. Add an sts:AssumeRole allow on the B role to A’s permission policy, and add A’s role ARN as a trusted principal in B’s trust policy.

8 — STS throttling at fleet scale (the reason to finish migrating)

On a very large fleet still on IRSA, every pod assumes via STS independently; under churn (mass restarts, scale events) STS can throttle. This is not a Pod Identity bug — it is the IRSA model you are leaving.

Confirm. CloudWatch shows STS ThrottlingException climbing during scale events. Fix. Completing the Pod Identity cutover collapses per-pod assumes into once-per-node-per-role, sharply cutting STS call volume.

Best practices

Security notes

The security-relevant controls and how Pod Identity changes them versus IRSA:

Control IRSA Pod Identity Net effect
Standing credentials None (STS) None (STS) Same — both keyless
Trust surface Per-issuer, per-sub Whole pods.eks principal + association Broader trust, tighter binding elsewhere
Binding enforcement Trust Condition Association + session tags Moves to EKS/IAM tags
Cross-account trust Issuer list or role-chain Single static role ARN Smaller, auditable surface
Audit signal AssumeRoleWithWebIdentity AssumeRoleForPodIdentity + tags Richer (namespace/SA in event)
“Grant a pod a role” permission Edit trust policy (IAM) eks:CreatePodIdentityAssociation New permission to govern

The AssumeRoleForPodIdentity CloudTrail fields worth alerting on, and the rule to write for each:

CloudTrail field What it tells you Alert / detection rule
eventName The assume call itself Baseline volume; spike = mass restart/scale
errorCode = AccessDenied A denied assume Alert on any sustained AccessDenied (misconfig)
requestParameters (session tags) namespace / SA / pod Alert on assumes from unexpected namespaces
resources (role ARN) Which role was assumed Alert if a sensitive role is assumed unexpectedly
sourceIPAddress EKS Auth service Should be the service; anomalies are suspicious
recipientAccountId Account the assume landed in Cross-account assumes into B you didn’t expect
userIdentity The EKS service principal Confirms it’s Pod Identity, not a human/role
eventTime clustering Timing of assumes Bursts correlate with deploys/scale events

Cost & sizing

Both IRSA and Pod Identity are free AWS features — you pay for neither the OIDC provider nor the associations nor the EKS Auth calls. The cost deltas are indirect and small, and Pod Identity is generally the cheaper, lower-toil option at fleet scale.

A rough picture for a 50-node, 11-cluster fleet:

Cost driver IRSA Pod Identity Rough delta
AWS feature charge ₹0 ₹0 None
Agent DaemonSet compute n/a ~50 nodes × few mCPU/tens MB Tiny (absorb in node headroom)
STS call volume at scale Per-pod assumes Per-node/role assumes Lower (fewer calls, less throttling)
OIDC provider management 11 providers to track 0 Lower toil
Trust-policy maintenance Per-cluster edits, rebuild pages Zero per-cluster edits Much lower toil
Incident cost (rebuild breakage) Real (paged twice) ~Zero Removes a toil class

Sizing guidance: the agent needs no tuning for typical fleets; ensure it tolerates any node taints so it schedules everywhere, and confirm your node groups have the few mCPU of headroom. There is no “scale the agent” knob — it is one pod per node by design.

Interview & exam questions

1. Why does IRSA become painful at fleet scale, and how does Pod Identity fix it? Each cluster has a unique OIDC issuer, so a shared role needs every issuer registered as a provider and a sub condition per cluster, and a cluster rebuild changes the issuer and breaks trust. Pod Identity replaces the per-cluster OIDC anchor with one service principal (pods.eks.amazonaws.com) and moves the SA binding into an EKS association, so the trust policy is identical across clusters and never edited per cluster.

2. What are the three moving parts of Pod Identity? The Pod Identity Agent DaemonSet (serves credentials over the node’s link-local 169.254.170.23), the association resource (maps (cluster, namespace, SA) → role), and the trust policy that trusts pods.eks.amazonaws.com. Every failure traces to exactly one of these.

3. Why is sts:TagSession required and what happens if you omit it? Every Pod Identity assume attaches six session tags, so the trust policy must allow sts:TagSession in addition to sts:AssumeRole. Omit it and every tagged assume is denied with a blanket AccessDenied — the most common “set up correctly but still broken” cause.

4. During cutover, both IRSA and Pod Identity variables are present in a pod. Which wins, and why does that matter? The SDK’s default credential provider chain prefers the container credentials (AWS_CONTAINER_CREDENTIALS_FULL_URI, Pod Identity) over web identity (AWS_WEB_IDENTITY_TOKEN_FILE, IRSA). That is what makes the cutover safe and reversible: keep both live, the pod uses Pod Identity, and deleting the association + restart drops back to IRSA.

5. How do you make one role serve many namespaces safely under Pod Identity? Use the kubernetes-namespace session tag: write one role whose permission policy scopes resources with aws:PrincipalTag/kubernetes-namespace. The same role assumed from another namespace gets a different tag value and is denied — no extra roles or trust conditions needed.

6. A migrated pod returns the node instance role from sts:get-caller-identity. Name three causes. (a) A proxy is swallowing the link-local request because 169.254.170.23 is not in NO_PROXY; (b) the eks-pod-identity-agent DaemonSet is missing or unhealthy on that node; © the association’s (namespace, SA) does not match the pod’s actual service account.

7. How does Pod Identity handle cross-account access, and how is it better than the IRSA approach? Natively, via --target-role-arn: the account-A association role is assumed first, then it assumes the account-B target, whose credentials are injected. The B role trusts a single static A-role ARN. This replaces IRSA’s SDK role-chaining hack with a first-class, EKS-auditable flag and a smaller trust surface.

8. When must you pass --disable-session-tags, and what is the consequence? When you attach an inline session policy with --policy, because session tags and a session policy cannot be combined on the same assume. The consequence is that the six session tags (including kubernetes-namespace) are gone, so any permission-policy conditions on aws:PrincipalTag/... stop matching — scope via the inline policy instead.

9. Why is the Pod Identity assume more scalable than IRSA’s? IRSA has every pod call STS itself (AssumeRoleWithWebIdentity), so STS call volume scales with pod count. Pod Identity has the agent call EKS Auth (AssumeRoleForPodIdentity) and cache credentials once per node per role, so a node running twenty pods of one role does one assume — far less STS pressure and throttling at scale.

10. What is the safe rollback if a namespace’s cutover goes wrong? Delete the association and kubectl rollout restart the workload; because you kept the IRSA annotation and OIDC trust statement during cutover, the pod falls back to IRSA. Reversibility holds right up until you deliberately remove the annotation and OIDC statement after a soak.

11. How do you confirm Pod Identity is actually being used, not just configured? Two checks: aws sts get-caller-identity from inside the pod must return arn:aws:sts::...:assumed-role/<your-role>/... (not the node role), and CloudTrail must show AssumeRoleForPodIdentity events carrying the expected kubernetes-namespace/kubernetes-service-account session tags.

12. What new IAM-equivalent permission does Pod Identity introduce that you must govern? eks:CreatePodIdentityAssociation — creating an association effectively grants a service account an IAM role, so it must be restricted to the platform pipeline and reviewed like an IAM trust change.

These map to the AWS Certified Security – Specialty (identity federation, least privilege, cross-account access) and the Certified Kubernetes Security Specialist (CKS) (workload identity, secrets-free credentials) domains. A compact cert-mapping for revision:

Question theme Primary cert Domain area
OIDC vs service-principal trust AWS Security Specialty Identity & Access Management
Session tags, PrincipalTag scoping AWS Security Specialty Fine-grained authorization
Cross-account --target-role-arn AWS Security Specialty Cross-account access patterns
Workload identity (keyless creds) CKS Cluster hardening / supply chain
Reversible rollout, dual-trust (architecture) Migration & operational safety
CloudTrail audit of assumes AWS Security Specialty Logging & monitoring

Quick check

  1. A migrated pod’s aws sts get-caller-identity returns the node instance role. Name the single most common cause and the exact fix.
  2. You added pods.eks.amazonaws.com to the role’s trust with sts:AssumeRole and still get AccessDenied on every call. What did you forget?
  3. During cutover a pod has both IRSA and Pod Identity environment variables. Which credential source does the SDK use, and why is that the desired behaviour?
  4. You want one role to serve payments and analytics with different S3 access. What Pod Identity feature makes this possible without two roles?
  5. You attach --policy to an association and a namespace-scoped call that used to work now returns AccessDenied. What happened?

Answers

  1. An HTTP proxy is swallowing the link-local request to 169.254.170.23, so the SDK falls through to the node role. Fix: add 169.254.170.23 and [fd00:ec2::23] to NO_PROXY wherever proxy variables are set.
  2. sts:TagSession. Every Pod Identity assume is tagged, so the trust policy must allow sts:TagSession alongside sts:AssumeRole; without it the tagged assume is denied with a blanket AccessDenied.
  3. The SDK uses Pod Identity — the container-credentials variables (AWS_CONTAINER_CREDENTIALS_FULL_URI) win over web identity in the default provider chain. This is desirable because it lets you keep IRSA live as a fallback, making the cutover reversible with a single rollout restart after deleting the association.
  4. Session tags — specifically kubernetes-namespace. Write one role and scope its permission policy with aws:PrincipalTag/kubernetes-namespace; the same role assumed from each namespace carries a different tag value and is allowed/denied accordingly.
  5. Using --policy requires --disable-session-tags, so the six session tags are gone. The role’s permission policy scopes with aws:PrincipalTag/kubernetes-namespace, which no longer matches because the tag is absent. Scope via the inline policy instead, or drop --policy and rely on tags.

Glossary

Next steps

You can now migrate any cluster’s workload identity from IRSA to Pod Identity safely and reversibly. Build outward:

awsekspod-identityirsakubernetes
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments