GKE Workload Identity Deep Dive: Secure Pod-to-Google-API Access Without Keys

Exported service account keys are the single most common credential-leak vector on Google Cloud, and on GKE you do not need them at all. Workload Identity Federation for GKE lets a Kubernetes service account (KSA) impersonate or directly act as a Google IAM principal, with short-lived tokens minted on demand by the cluster metadata server. This is a deep dive into how that machinery actually works, how to wire it up correctly, and how to debug it when a pod starts throwing 403s from the metadata path.

1. The internals: metadata server, KSA-to-GSA mapping, and token minting

When code inside a pod calls a Google API through a client library, the library looks for Application Default Credentials. With Workload Identity enabled, ADC resolves to the GKE metadata server reachable at http://metadata.google.internal (the link-local address 169.254.169.254). This is not the raw GCE metadata server; on Workload Identity node pools it is a per-node gke-metadata-server DaemonSet pod that intercepts metadata traffic and scopes it to the calling pod’s KSA.

The flow when a pod requests a token:

pod app -> client library (ADC)
        -> GET metadata.google.internal/.../token
        -> gke-metadata-server (per node)
           1. identifies the calling pod + its KSA
           2. exchanges the KSA token at the STS endpoint
              for a federated access token
           3. (optional) impersonates a GSA via IAM Credentials API
        -> returns a short-lived OAuth2 access token

The KSA identity is expressed as a federated principal of the form:

serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA_NAME]

That string PROJECT_ID.svc.id.goog is the workload identity pool automatically provisioned for the cluster’s project. Every pod running under a given KSA in a given namespace federates to exactly that principal. There are two models for what happens next:

Impersonation model (classic): the federated principal is granted roles/iam.workloadIdentityUser on a Google service account (GSA), and the metadata server impersonates that GSA. The KSA is annotated to point at the GSA.
Direct model (KSA-only): you grant IAM roles directly to the federated principal. No GSA, no annotation, no impersonation hop. This is the newer pattern and the one to prefer for greenfield work.

Callout: tokens are short-lived (minted per request and cached briefly). There is nothing on disk to rotate, exfiltrate, or forget about. That is the entire security win.

2. Enable Workload Identity on the cluster and node pools

Workload Identity is a cluster-level setting and a node-pool-level setting. Enabling it on the cluster alone is the number-one reason “I configured everything and it still doesn’t work.”

Enable it on the cluster (sets the workload pool):

gcloud container clusters update CLUSTER_NAME \
  --location=REGION \
  --workload-pool=PROJECT_ID.svc.id.goog

Then enable the metadata server on each node pool. New node pools should set it at creation; existing ones need an update (which recreates nodes):

# Existing node pool
gcloud container node-pools update NODE_POOL \
  --cluster=CLUSTER_NAME \
  --location=REGION \
  --workload-metadata=GKE_METADATA

# New node pool
gcloud container node-pools create NODE_POOL \
  --cluster=CLUSTER_NAME \
  --location=REGION \
  --workload-metadata=GKE_METADATA

In Terraform, both halves are explicit:

resource "google_container_cluster" "primary" {
  name     = "prod-cluster"
  location = "us-central1"

  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
  # ... remaining cluster config
}

resource "google_container_node_pool" "primary" {
  name     = "primary"
  cluster  = google_container_cluster.primary.name
  location = "us-central1"

  node_config {
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
  }
}

On Autopilot clusters, Workload Identity is enabled by default and the node-pool step does not apply. The KSA/IAM wiring below is identical.

3. Link a KSA to a GSA with IAM bindings (impersonation model)

This is the classic pattern you will meet in most existing clusters. Three pieces must line up: a GSA, an IAM policy binding granting the KSA principal workloadIdentityUser on that GSA, and an annotation on the KSA.

Create the GSA and grant it whatever application roles it needs (example: read objects from a bucket):

gcloud iam service-accounts create app-gsa \
  --display-name="App workload identity GSA"

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:app-gsa@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

Bind the federated KSA principal to the GSA via workloadIdentityUser:

gcloud iam service-accounts add-iam-policy-binding \
  app-gsa@PROJECT_ID.iam.gserviceaccount.com \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:PROJECT_ID.svc.id.goog[apps/app-ksa]"

Create the KSA and annotate it to point at the GSA:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-ksa
  namespace: apps
  annotations:
    iam.gke.io/gcp-service-account: app-gsa@PROJECT_ID.iam.gserviceaccount.com

kubectl apply -f ksa.yaml

Finally, make pods actually use the KSA. A pod that omits serviceAccountName runs as the namespace default KSA, not yours:

spec:
  serviceAccountName: app-ksa
  containers:
    - name: app
      image: REGION-docker.pkg.dev/PROJECT_ID/repo/app:1.0

4. Fine-grained access with IAM conditions and custom roles

Predefined roles like roles/storage.objectViewer are project-wide and almost always too broad. Tighten them two ways.

IAM Conditions scope a grant to specific resources or contexts using CEL. For example, restrict object reads to one bucket:

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:app-gsa@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer" \
  --condition='expression=resource.name.startsWith("projects/_/buckets/my-app-bucket"),title=only-app-bucket'

Custom roles pare permissions down to the exact set of API calls a workload makes. Define them in YAML and create at project or org level:

title: "App Object Reader"
stage: "GA"
includedPermissions:
  - storage.objects.get
  - storage.objects.list

gcloud iam roles create appObjectReader \
  --project=PROJECT_ID \
  --file=role.yaml

Conditions are evaluated on the GSA’s access in the impersonation model, and on the federated principal directly in the KSA-only model. Either way, condition the application grant, not the workloadIdentityUser grant.

5. The KSA-only federation model (no GSA)

Newer GKE versions let you skip the GSA entirely and grant IAM roles straight to the federated principal. This removes the impersonation hop, the extra identity to manage, and the annotation. For a KSA app-ksa in namespace apps:

gcloud projects add-iam-policy-binding PROJECT_ID \
  --role="roles/storage.objectViewer" \
  --member="principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/apps/sa/app-ksa"

Note the differences from the impersonation binding:

Aspect	Impersonation (GSA)	Direct (KSA-only)
Member format	`serviceAccount:PROJECT_ID.svc.id.goog[ns/ksa]`	`principal://.../subject/ns/NS/sa/KSA`
Uses `PROJECT_NUMBER`	No	Yes (in the principal path)
KSA annotation	Required	Not required
Extra GSA to manage	Yes	No

With the direct model there is no annotation on the KSA. The pod still needs serviceAccountName: app-ksa, and that is the whole configuration on the Kubernetes side. Some Google API client paths still expect a GSA email (and a handful of integrations require one), so verify your specific APIs, but for the common cases the KSA-only model is cleaner and is the right default going forward.

6. Per-namespace isolation and the default SA trap

Treat the namespace as your identity boundary. Each team/app gets a dedicated KSA in its own namespace, bound to its own least-privilege IAM. Never bind sensitive roles to a default KSA, because every pod that forgets serviceAccountName silently inherits it.

Defang the default KSA in each namespace so an unannotated pod gets nothing rather than ambient access:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: default
  namespace: apps
automountServiceAccountToken: false

Then audit which pods run as which KSA:

kubectl get pods -A \
  -o custom-columns='NS:.metadata.namespace,POD:.metadata.name,KSA:.spec.serviceAccountName'

Anything showing KSA: default (or <none>) is a pod that is not using a scoped identity. Fix it before it ships.

7. Debugging: metadata 403s, missing annotations, DNS and firewall

When access fails, work the path in order from the most common failure to the rarest. First, run an interactive pod as the target KSA and probe the metadata server:

kubectl run -it --rm wi-debug \
  --image=google/cloud-sdk:slim \
  --namespace=apps \
  --overrides='{"spec":{"serviceAccountName":"app-ksa"}}' \
  -- bash

Inside the pod, confirm which identity the metadata server reports and that a token can be minted:

# Which identity does the pod actually resolve to?
curl -s -H "Metadata-Flavor: Google" \
  "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email"

# Can it mint a token? (200 = good; 403/404 = misconfig)
curl -s -H "Metadata-Flavor: Google" \
  "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token"

# What gcloud sees
gcloud auth list

Map the symptom to the cause:

Symptom	Most likely cause
Email returns the default GCE SA, not your GSA	Node pool not on `GKE_METADATA`, or pod using wrong KSA
`email` correct but token request 403	Missing/incorrect `workloadIdentityUser` binding
`curl: could not resolve metadata.google.internal`	DNS / NetworkPolicy blocking the metadata server
API call 403 but token mints fine	Token works; the GSA/principal lacks the application role
Annotation present but ignored	Typo in annotation key `iam.gke.io/gcp-service-account`

Specific gotchas to check:

Annotation key and value. The key is exactly iam.gke.io/gcp-service-account and the value is the full GSA email, not the short name. A trailing space or wrong project here fails silently.
Member string mismatch. In the impersonation binding, namespace and KSA name inside the brackets must match the pod’s actual namespace and serviceAccountName character-for-character.
NetworkPolicy / firewall. A default-deny egress NetworkPolicy will block 169.254.169.254. Egress to the metadata server must be allowed. Likewise, hardened nodes that block link-local traffic break the metadata path.
Node-pool propagation. After --workload-metadata=GKE_METADATA, nodes are recreated. Pods scheduled on old nodes during rollout still hit the GCE metadata server. Confirm the pod’s node is new.
Eventual consistency. Fresh IAM bindings can take a short while to propagate; a 403 that resolves itself in a minute or two is propagation, not your config.

8. Auditing effective permissions

Verify what an identity can actually do, not what you think you granted. Test a specific permission against a resource:

gcloud projects get-ancestors-iam-policy PROJECT_ID  # context
gcloud iam service-accounts get-iam-policy \
  app-gsa@PROJECT_ID.iam.gserviceaccount.com  # who can impersonate

Use Policy Analyzer to ask “who has access to what” across the resource hierarchy:

gcloud asset analyze-iam-policy \
  --organization=ORG_ID \
  --identity="serviceAccount:app-gsa@PROJECT_ID.iam.gserviceaccount.com"

Then confirm real usage in Cloud Audit Logs. Data Access logs show the impersonation and the downstream API calls carrying the GSA (or federated principal) as the authentication info:

gcloud logging read \
  'protoPayload.authenticationInfo.principalEmail="app-gsa@PROJECT_ID.iam.gserviceaccount.com"' \
  --limit=20 \
  --format='table(timestamp, protoPayload.methodName, resource.type)'

Enterprise scenario

A payments platform team ran a multi-tenant GKE cluster where each tenant got its own namespace, and they’d standardized on the KSA-only direct model. A new tenant’s pods could mint a token (/token returned 200) but every Cloud Storage call came back 403 PERMISSION_DENIED, even though the principal:// binding looked identical to working tenants. The grant had been applied with the literal string PROJECT_ID.svc.id.goog instead of the cluster’s actual workload pool, and worse, the principal path embedded the wrong PROJECT_NUMBER (the team had copy-pasted from a sibling project). Because the direct-model member is an opaque string, IAM accepts a malformed principal happily and silently grants nothing.

The fix was to derive both values programmatically instead of hand-editing them, then re-apply:

PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects describe "$PROJECT_ID" --format='value(projectNumber)')

gcloud projects add-iam-policy-binding "$PROJECT_ID" \
  --role="roles/storage.objectViewer" \
  --member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/tenant-42/sa/app-ksa" \
  --condition='expression=resource.name.startsWith("projects/_/buckets/tenant-42-data"),title=tenant-42-bucket'

They then added a CI guard that rejects any IAM diff whose principal:// path doesn’t resolve to the live project number. The broader lesson: in the KSA-only model there is no annotation and no GSA email to typo-check against, so the principal string itself becomes the single point of failure. Generate it; never type it.

Verify

Run this end-to-end check from a pod bound to the KSA. All four should succeed:

# 1. Metadata server returns the intended identity
curl -s -H "Metadata-Flavor: Google" \
  "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email"

# 2. A token mints (HTTP 200)
curl -s -o /dev/null -w "%{http_code}\n" -H "Metadata-Flavor: Google" \
  "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token"

# 3. gcloud auth shows the active principal
gcloud auth list --filter=status:ACTIVE --format="value(account)"

# 4. A real, least-privilege API call succeeds (and nothing more does)
gcloud storage ls gs://my-app-bucket

If all four pass and an out-of-scope call (for example, listing a different bucket) is denied, your least-privilege wiring is correct.

Checklist

Cluster has --workload-pool=PROJECT_ID.svc.id.goog
Every node pool runs with --workload-metadata=GKE_METADATA
KSA exists in the workload’s own namespace (not default)
Impersonation: KSA principal granted roles/iam.workloadIdentityUser on the GSA, KSA annotated with the GSA email
Direct model: federated principal:// granted application roles, no annotation
Application roles scoped with IAM Conditions or custom roles
Pods set serviceAccountName; default KSA token automount disabled
Metadata token-mint check passes from a pod bound to the KSA
Effective access reviewed in Policy Analyzer and confirmed in Audit Logs

Pitfalls and next steps

The two failure modes that account for most lost hours are forgetting the node-pool GKE_METADATA flag and a namespace/KSA mismatch in the IAM member string. Both produce confident-looking config that simply does not work, so when in doubt, exec into a pod and read the identity straight from the metadata server rather than reasoning about it.

For next steps: migrate any remaining mounted SA-key secrets off the cluster (search for secretKeyRef entries feeding GOOGLE_APPLICATION_CREDENTIALS), prefer the KSA-only direct model for new workloads, and disable service account key creation at the org level with the iam.disableServiceAccountKeyCreation constraint so the leak vector cannot reappear. Workload Identity is only as strong as the least-privilege IAM behind it, so treat each KSA as a first-class principal with its own scoped, audited grants.

GKE Workload Identity Deep Dive: Secure Pod-to-Google-API Access Without Keys

1. The internals: metadata server, KSA-to-GSA mapping, and token minting

2. Enable Workload Identity on the cluster and node pools

3. Link a KSA to a GSA with IAM bindings (impersonation model)

4. Fine-grained access with IAM conditions and custom roles

5. The KSA-only federation model (no GSA)

6. Per-namespace isolation and the default SA trap

7. Debugging: metadata 403s, missing annotations, DNS and firewall

8. Auditing effective permissions

Enterprise scenario

Verify

Checklist

Pitfalls and next steps

Written by Vinod

Comments

Keep Reading

BigQuery Fine-Grained Security: Column-Level, Row-Level, and Data Masking

Cloud DNS at Scale: Private Zones, Peering, Forwarding, and Response Policies

Event-Driven Architecture with Cloud Functions 2nd Gen and Eventarc