Exported service account keys are the single most common credential-leak vector on Google Cloud, and on GKE you do not need them at all. Workload Identity Federation for GKE lets a Kubernetes service account (KSA) impersonate or directly act as a Google IAM principal, with short-lived tokens minted on demand by the cluster metadata server. This is a deep dive into how that machinery actually works, how to wire it up correctly, and how to debug it when a pod starts throwing 403s from the metadata path.
1. The internals: metadata server, KSA-to-GSA mapping, and token minting
When code inside a pod calls a Google API through a client library, the library looks for Application Default Credentials. With Workload Identity enabled, ADC resolves to the GKE metadata server reachable at http://metadata.google.internal (the link-local address 169.254.169.254). This is not the raw GCE metadata server; on Workload Identity node pools it is a per-node gke-metadata-server DaemonSet pod that intercepts metadata traffic and scopes it to the calling pod’s KSA.
The flow when a pod requests a token:
pod app -> client library (ADC)
-> GET metadata.google.internal/.../token
-> gke-metadata-server (per node)
1. identifies the calling pod + its KSA
2. exchanges the KSA token at the STS endpoint
for a federated access token
3. (optional) impersonates a GSA via IAM Credentials API
-> returns a short-lived OAuth2 access token
The KSA identity is expressed as a federated principal of the form:
serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]
That string PROJECT_ID.svc.id.goog is the workload identity pool automatically provisioned for the cluster’s project. Every pod running under a given KSA in a given namespace federates to exactly that principal. There are two models for what happens next:
- Impersonation model (classic): the federated principal is granted
roles/iam.workloadIdentityUseron a Google service account (GSA), and the metadata server impersonates that GSA. The KSA is annotated to point at the GSA. - Direct model (KSA-only): you grant IAM roles directly to the federated principal. No GSA, no annotation, no impersonation hop. This is the newer pattern and the one to prefer for greenfield work.
Callout: tokens are short-lived (minted per request and cached briefly). There is nothing on disk to rotate, exfiltrate, or forget about. That is the entire security win.
2. Enable Workload Identity on the cluster and node pools
Workload Identity is a cluster-level setting and a node-pool-level setting. Enabling it on the cluster alone is the number-one reason “I configured everything and it still doesn’t work.”
Enable it on the cluster (sets the workload pool):
gcloud container clusters update CLUSTER_NAME \
--location=REGION \
--workload-pool=PROJECT_ID.svc.id.goog
Then enable the metadata server on each node pool. New node pools should set it at creation; existing ones need an update (which recreates nodes):
# Existing node pool
gcloud container node-pools update NODE_POOL \
--cluster=CLUSTER_NAME \
--location=REGION \
--workload-metadata=GKE_METADATA
# New node pool
gcloud container node-pools create NODE_POOL \
--cluster=CLUSTER_NAME \
--location=REGION \
--workload-metadata=GKE_METADATA
In Terraform, both halves are explicit:
resource "google_container_cluster" "primary" {
name = "prod-cluster"
location = "us-central1"
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# ... remaining cluster config
}
resource "google_container_node_pool" "primary" {
name = "primary"
cluster = google_container_cluster.primary.name
location = "us-central1"
node_config {
workload_metadata_config {
mode = "GKE_METADATA"
}
}
}
On Autopilot clusters, Workload Identity is enabled by default and the node-pool step does not apply. The KSA/IAM wiring below is identical.
3. Link a KSA to a GSA with IAM bindings (impersonation model)
This is the classic pattern you will meet in most existing clusters. Three pieces must line up: a GSA, an IAM policy binding granting the KSA principal workloadIdentityUser on that GSA, and an annotation on the KSA.
Create the GSA and grant it whatever application roles it needs (example: read objects from a bucket):
gcloud iam service-accounts create app-gsa \
--display-name="App workload identity GSA"
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:app-gsa@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
Bind the federated KSA principal to the GSA via workloadIdentityUser:
gcloud iam service-accounts add-iam-policy-binding \
app-gsa@PROJECT_ID.iam.gserviceaccount.com \
--role="roles/iam.workloadIdentityUser" \
--member="serviceAccount:PROJECT_ID.svc.id.goog[apps/app-ksa]"
Create the KSA and annotate it to point at the GSA:
apiVersion: v1
kind: ServiceAccount
metadata:
name: app-ksa
namespace: apps
annotations:
iam.gke.io/gcp-service-account: app-gsa@PROJECT_ID.iam.gserviceaccount.com
kubectl apply -f ksa.yaml
Finally, make pods actually use the KSA. A pod that omits serviceAccountName runs as the namespace default KSA, not yours:
spec:
serviceAccountName: app-ksa
containers:
- name: app
image: REGION-docker.pkg.dev/PROJECT_ID/repo/app:1.0
4. Fine-grained access with IAM conditions and custom roles
Predefined roles like roles/storage.objectViewer are project-wide and almost always too broad. Tighten them two ways.
IAM Conditions scope a grant to specific resources or contexts using CEL. For example, restrict object reads to one bucket:
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:app-gsa@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer" \
--condition='expression=resource.name.startsWith("projects/_/buckets/my-app-bucket"),title=only-app-bucket'
Custom roles pare permissions down to the exact set of API calls a workload makes. Define them in YAML and create at project or org level:
title: "App Object Reader"
stage: "GA"
includedPermissions:
- storage.objects.get
- storage.objects.list
gcloud iam roles create appObjectReader \
--project=PROJECT_ID \
--file=role.yaml
Conditions are evaluated on the GSA’s access in the impersonation model, and on the federated principal directly in the KSA-only model. Either way, condition the application grant, not the
workloadIdentityUsergrant.
5. The KSA-only federation model (no GSA)
Newer GKE versions let you skip the GSA entirely and grant IAM roles straight to the federated principal. This removes the impersonation hop, the extra identity to manage, and the annotation. For a KSA app-ksa in namespace apps:
gcloud projects add-iam-policy-binding PROJECT_ID \
--role="roles/storage.objectViewer" \
--member="principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/apps/sa/app-ksa"
Note the differences from the impersonation binding:
| Aspect | Impersonation (GSA) | Direct (KSA-only) |
|---|---|---|
| Member format | serviceAccount:PROJECT_ID.svc.id.goog[ns/ksa] |
principal://.../subject/ns/NS/sa/KSA |
Uses PROJECT_NUMBER |
No | Yes (in the principal path) |
| KSA annotation | Required | Not required |
| Extra GSA to manage | Yes | No |
With the direct model there is no annotation on the KSA. The pod still needs serviceAccountName: app-ksa, and that is the whole configuration on the Kubernetes side. Some Google API client paths still expect a GSA email (and a handful of integrations require one), so verify your specific APIs, but for the common cases the KSA-only model is cleaner and is the right default going forward.
6. Per-namespace isolation and the default SA trap
Treat the namespace as your identity boundary. Each team/app gets a dedicated KSA in its own namespace, bound to its own least-privilege IAM. Never bind sensitive roles to a default KSA, because every pod that forgets serviceAccountName silently inherits it.
Defang the default KSA in each namespace so an unannotated pod gets nothing rather than ambient access:
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: apps
automountServiceAccountToken: false
Then audit which pods run as which KSA:
kubectl get pods -A \
-o custom-columns='NS:.metadata.namespace,POD:.metadata.name,KSA:.spec.serviceAccountName'
Anything showing KSA: default (or <none>) is a pod that is not using a scoped identity. Fix it before it ships.
7. Debugging: metadata 403s, missing annotations, DNS and firewall
When access fails, work the path in order from the most common failure to the rarest. First, run an interactive pod as the target KSA and probe the metadata server:
kubectl run -it --rm wi-debug \
--image=google/cloud-sdk:slim \
--namespace=apps \
--overrides='{"spec":{"serviceAccountName":"app-ksa"}}' \
-- bash
Inside the pod, confirm which identity the metadata server reports and that a token can be minted:
# Which identity does the pod actually resolve to?
curl -s -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email"
# Can it mint a token? (200 = good; 403/404 = misconfig)
curl -s -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token"
# What gcloud sees
gcloud auth list
Map the symptom to the cause:
| Symptom | Most likely cause |
|---|---|
| Email returns the default GCE SA, not your GSA | Node pool not on GKE_METADATA, or pod using wrong KSA |
email correct but token request 403 |
Missing/incorrect workloadIdentityUser binding |
curl: could not resolve metadata.google.internal |
DNS / NetworkPolicy blocking the metadata server |
| API call 403 but token mints fine | Token works; the GSA/principal lacks the application role |
| Annotation present but ignored | Typo in annotation key iam.gke.io/gcp-service-account |
Specific gotchas to check:
- Annotation key and value. The key is exactly
iam.gke.io/gcp-service-accountand the value is the full GSA email, not the short name. A trailing space or wrong project here fails silently. - Member string mismatch. In the impersonation binding, namespace and KSA name inside the brackets must match the pod’s actual namespace and
serviceAccountNamecharacter-for-character. - NetworkPolicy / firewall. A default-deny egress NetworkPolicy will block
169.254.169.254. Egress to the metadata server must be allowed. Likewise, hardened nodes that block link-local traffic break the metadata path. - Node-pool propagation. After
--workload-metadata=GKE_METADATA, nodes are recreated. Pods scheduled on old nodes during rollout still hit the GCE metadata server. Confirm the pod’s node is new. - Eventual consistency. Fresh IAM bindings can take a short while to propagate; a 403 that resolves itself in a minute or two is propagation, not your config.
8. Auditing effective permissions
Verify what an identity can actually do, not what you think you granted. Test a specific permission against a resource:
gcloud projects get-ancestors-iam-policy PROJECT_ID # context
gcloud iam service-accounts get-iam-policy \
app-gsa@PROJECT_ID.iam.gserviceaccount.com # who can impersonate
Use Policy Analyzer to ask “who has access to what” across the resource hierarchy:
gcloud asset analyze-iam-policy \
--organization=ORG_ID \
--identity="serviceAccount:app-gsa@PROJECT_ID.iam.gserviceaccount.com"
Then confirm real usage in Cloud Audit Logs. Data Access logs show the impersonation and the downstream API calls carrying the GSA (or federated principal) as the authentication info:
gcloud logging read \
'protoPayload.authenticationInfo.principalEmail="app-gsa@PROJECT_ID.iam.gserviceaccount.com"' \
--limit=20 \
--format='table(timestamp, protoPayload.methodName, resource.type)'
Enterprise scenario
A payments platform team ran a multi-tenant GKE cluster where each tenant got its own namespace, and they’d standardized on the KSA-only direct model. A new tenant’s pods could mint a token (/token returned 200) but every Cloud Storage call came back 403 PERMISSION_DENIED, even though the principal:// binding looked identical to working tenants. The grant had been applied with the literal string PROJECT_ID.svc.id.goog instead of the cluster’s actual workload pool, and worse, the principal path embedded the wrong PROJECT_NUMBER (the team had copy-pasted from a sibling project). Because the direct-model member is an opaque string, IAM accepts a malformed principal happily and silently grants nothing.
The fix was to derive both values programmatically instead of hand-editing them, then re-apply:
PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe "$PROJECT_ID" --format='value(projectNumber)')
gcloud projects add-iam-policy-binding "$PROJECT_ID" \
--role="roles/storage.objectViewer" \
--member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/tenant-42/sa/app-ksa" \
--condition='expression=resource.name.startsWith("projects/_/buckets/tenant-42-data"),title=tenant-42-bucket'
They then added a CI guard that rejects any IAM diff whose principal:// path doesn’t resolve to the live project number. The broader lesson: in the KSA-only model there is no annotation and no GSA email to typo-check against, so the principal string itself becomes the single point of failure. Generate it; never type it.
Verify
Run this end-to-end check from a pod bound to the KSA. All four should succeed:
# 1. Metadata server returns the intended identity
curl -s -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email"
# 2. A token mints (HTTP 200)
curl -s -o /dev/null -w "%{http_code}\n" -H "Metadata-Flavor: Google" \
"http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token"
# 3. gcloud auth shows the active principal
gcloud auth list --filter=status:ACTIVE --format="value(account)"
# 4. A real, least-privilege API call succeeds (and nothing more does)
gcloud storage ls gs://my-app-bucket
If all four pass and an out-of-scope call (for example, listing a different bucket) is denied, your least-privilege wiring is correct.
Checklist
Pitfalls and next steps
The two failure modes that account for most lost hours are forgetting the node-pool GKE_METADATA flag and a namespace/KSA mismatch in the IAM member string. Both produce confident-looking config that simply does not work, so when in doubt, exec into a pod and read the identity straight from the metadata server rather than reasoning about it.
For next steps: migrate any remaining mounted SA-key secrets off the cluster (search for secretKeyRef entries feeding GOOGLE_APPLICATION_CREDENTIALS), prefer the KSA-only direct model for new workloads, and disable service account key creation at the org level with the iam.disableServiceAccountKeyCreation constraint so the leak vector cannot reappear. Workload Identity is only as strong as the least-privilege IAM behind it, so treat each KSA as a first-class principal with its own scoped, audited grants.