eksctl create cluster gives you a control plane and some nodes. It does not give you a platform. The gap between a demo cluster and one that runs hundreds of services across thousands of pods comes down to four decisions you make early and rarely revisit cheaply: how identity flows to workloads, how the data plane allocates IPs, how nodes appear and disappear, and how you keep the whole thing current. This guide walks each one with the commands and manifests I actually ship.
Beyond eksctl create: the four decisions
A production EKS platform on AWS lives or dies on these:
| Decision | Legacy default | What scales |
|---|---|---|
| Cluster auth | aws-auth ConfigMap |
Access entries (EKS access-management API) |
| Workload identity | IRSA (OIDC + per-SA role) | EKS Pod Identity (association API) |
| Pod networking | One ENI per IP, low pod density | VPC CNI prefix delegation |
| Node lifecycle | Managed node groups + Cluster Autoscaler | Karpenter with consolidation |
None of these are exotic. They are the boring, correct defaults for a cluster you intend to operate for years. Assume EKS 1.31+ throughout.
Step 1 — Cluster provisioning with access entries
The aws-auth ConfigMap was the original way to map IAM principals to Kubernetes RBAC. It is a single YAML blob with no validation: one bad edit locks every admin out of the cluster. The access-management API replaces it with first-class AWS resources you manage via the API, CLI, or IaC.
Create the cluster with the API-based authentication mode. With eksctl:
# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: platform-prod
region: us-east-1
version: "1.31"
accessConfig:
authenticationMode: API_AND_CONFIG_MAP
bootstrapClusterCreatorAdminPermissions: true
vpc:
clusterEndpoints:
publicAccess: true
privateAccess: true
addons:
- name: vpc-cni
- name: coredns
- name: kube-proxy
- name: eks-pod-identity-agent
eksctl create cluster -f cluster.yaml
API_AND_CONFIG_MAP lets both mechanisms coexist while you migrate; flip to API once nothing reads the ConfigMap. Grant a role cluster-admin via an access entry plus an access policy association:
aws eks create-access-entry \
--cluster-name platform-prod \
--principal-arn arn:aws:iam::111122223333:role/platform-admins
aws eks associate-access-policy \
--cluster-name platform-prod \
--principal-arn arn:aws:iam::111122223333:role/platform-admins \
--policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
--access-scope type=cluster
AWS-managed access policies (
AmazonEKSClusterAdminPolicy,AmazonEKSAdminPolicy,AmazonEKSViewPolicy, and others) map to predictable RBAC. For namespace-scoped grants, set--access-scope type=namespace,namespaces=team-a,team-b. For anything bespoke, create an access entry of typeSTANDARDand bind your own RBAC by Kubernetes group.
The payoff: access is auditable in CloudTrail, expressible in Terraform (aws_eks_access_entry / aws_eks_access_policy_association), and a typo returns an API error instead of bricking RBAC.
Step 2 — Workload identity: IRSA to EKS Pod Identity
IRSA works: annotate a ServiceAccount with a role ARN, the pod gets a projected token, and the SDK exchanges it via the cluster’s OIDC provider. The operational cost shows up at scale. Every cluster needs its own IAM OIDC provider, and every role’s trust policy hardcodes that provider’s URL plus the SA sub. Replicate a workload across ten clusters and you maintain ten trust policies per role.
EKS Pod Identity removes the OIDC plumbing. A node-level agent (the eks-pod-identity-agent add-on) vends credentials, and a single API call associates a role with a (namespace, ServiceAccount) pair. The role’s trust policy points at the EKS service, not a cluster-specific OIDC URL.
The trust policy is identical across every cluster:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "pods.eks.amazonaws.com" },
"Action": ["sts:AssumeRole", "sts:TagSession"]
}
]
}
Create the association:
aws eks create-pod-identity-association \
--cluster-name platform-prod \
--namespace payments \
--service-account checkout-sa \
--role-arn arn:aws:iam::111122223333:role/checkout-app
The ServiceAccount needs no annotation — the binding lives in EKS, not on the SA. Application code is unchanged: the AWS SDK (a recent version) resolves Pod Identity credentials transparently.
A practical migration sequence:
- Install the
eks-pod-identity-agentadd-on. - For one workload, retarget its IAM role trust policy to
pods.eks.amazonaws.comand create the association. - Roll the pods, confirm AWS calls still succeed, then remove the IRSA SA annotation.
- Repeat per workload; decommission the IAM OIDC provider only after the last IRSA consumer is gone.
Keep IRSA where you genuinely need cross-account
sts:AssumeRolechains or non-EKS consumers of the same role. For in-cluster workloads, Pod Identity is the lower-maintenance default.
Step 3 — VPC CNI tuning: prefix delegation and beyond
The AWS VPC CNI gives every pod a routable VPC IP — great for native security groups and flow logs, brutal for IP exhaustion. By default each ENI carries one IP per pod, so pod density per node is capped by ENI/IP limits, and large nodes burn through a /24 fast.
Prefix delegation assigns each ENI a /28 prefix (16 IPs) instead of single IPs, multiplying pod density and slashing EC2 API calls during scale-up. Enable it on the add-on:
kubectl set env daemonset aws-node -n kube-system \
ENABLE_PREFIX_DELEGATION=true
# Warm capacity so pod scheduling never blocks on a slow ENI attach
kubectl set env daemonset aws-node -n kube-system \
WARM_PREFIX_TARGET=1
Prefix delegation also changes how you size the --max-pods value on each node — derive it from the instance’s ENI and prefix limits rather than leaving the old per-IP default. AWS publishes a max-pods-calculator helper for this; bake the result into your node bootstrap.
Two adjacent features worth knowing:
- Custom networking places pods on a different subnet (and security group) than the node’s primary ENI, via
ENIConfigCRDs. Reach for it when your node subnets are small and you want pods in a separate, larger CIDR — often a secondary VPC CIDR like100.64.0.0/16. - Security groups for pods lets you attach EC2 security groups directly to pods through a
SecurityGroupPolicy, so database access rules target the pod, not the whole node. It requiresENABLE_POD_ENI=trueon the CNI and is supported on a subset of (mostly Nitro) instance types.
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
name: payments-db-access
namespace: payments
spec:
podSelector:
matchLabels:
app: checkout
securityGroups:
groupIds:
- sg-0abc123def4567890
Prefix delegation is the one almost everyone needs; custom networking and security-groups-for-pods are situational. Turn them on only when a real constraint demands it — each adds moving parts to the data plane.
Step 4 — Node lifecycle with Karpenter
Cluster Autoscaler scales node groups you predefine: it can only add nodes of a shape you already declared, and it bin-packs poorly across many instance types. Karpenter watches for unschedulable pods and provisions right-sized nodes directly against EC2, picks instance types from a broad pool, and consolidates — replacing or removing nodes when workloads no longer justify them.
Two CRDs drive it. EC2NodeClass is the AWS-specific template (AMI, subnets, security groups, IAM role). NodePool is the scheduling policy and constraints.
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: "KarpenterNodeRole-platform-prod"
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "platform-prod"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "platform-prod"
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidationAfter: 1m
limits:
cpu: "1000"
Design notes from running this in anger:
- Let the pool be wide. Listing many instance families and both
spotandon-demandgives Karpenter room to bin-pack cheaply and to ride out Spot interruptions by falling back to on-demand. Constrain only what the workload actually requires (arch, GPU, local NVMe). WhenEmptyOrUnderutilizedis where the savings live. Karpenter will proactively replace a lightly-loaded node with a smaller/cheaper one. Protect pods that must not be evicted withkarpenter.sh/do-not-disrupt: "true"and rely on PodDisruptionBudgets.- Spot is safe for stateless tiers. Karpenter consumes the EC2 interruption signal and cordons/drains ahead of reclamation. Keep stateful or long-running jobs on on-demand via a separate NodePool.
- Use
limitsas a guardrail. A runaway controller creating pods can otherwise provision unbounded capacity; a CPU cap on the pool is your circuit breaker.
Install Karpenter via its Helm chart, ensuring the controller has its own IAM permissions (a Pod Identity association is the clean way) and that the node role is registered as an EKS access entry of type EC2_LINUX so nodes can join.
Step 5 — Managing core add-ons and the upgrade cadence
CoreDNS, kube-proxy, the VPC CNI, and the EBS CSI driver are EKS managed add-ons — version them through EKS rather than as loose manifests, so the control plane tracks compatibility.
List what an add-on supports for your cluster version, then update:
aws eks describe-addon-versions \
--addon-name aws-ebs-csi-driver \
--kubernetes-version 1.31 \
--query 'addons[].addonVersions[].addonVersion'
aws eks update-addon \
--cluster-name platform-prod \
--addon-name aws-ebs-csi-driver \
--addon-version v1.35.0-eksbuild.1 \
--resolve-conflicts PRESERVE
--resolve-conflicts PRESERVEkeeps your field-level customizations (replica counts, tolerations) instead of clobbering them with add-on defaults. UseOVERWRITEdeliberately, when you want to reset to defaults.
The EBS CSI driver needs IAM permissions to manage volumes — wire it with a Pod Identity association to its controller ServiceAccount rather than node-instance-profile permissions, so the blast radius stays narrow.
Upgrade cadence: EKS ships a new Kubernetes minor roughly every quarter, and each version has a support window after which extended support charges apply. Plan one planned upgrade per quarter rather than a panicked annual jump across four versions. Control-plane upgrades are one minor at a time and non-skippable.
Step 6 — Ingress with the AWS Load Balancer Controller
The AWS Load Balancer Controller reconciles Kubernetes Ingress objects into ALBs and Service type: LoadBalancer into NLBs, with target-type ip registering pod IPs directly (no extra node hop). Give its controller an IAM role via Pod Identity, then drive everything with annotations:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: checkout
namespace: payments
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:111122223333:certificate/abcd-1234
alb.ingress.kubernetes.io/healthcheck-path: /healthz
spec:
ingressClassName: alb
rules:
- host: checkout.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: checkout
port:
number: 80
Use IngressGroups (alb.ingress.kubernetes.io/group.name) to merge multiple Ingress resources onto one shared ALB — otherwise every Ingress spins up its own load balancer and the bill (and ENI consumption) climbs fast.
Enterprise scenario
A fintech platform team ran 40+ services on a single EKS cluster and started seeing pods stuck Pending during morning traffic ramps — but only on their m6i.4xlarge nodes, never the smaller ones. The constraint wasn’t compute; CPU and memory sat at 50%. It was IP exhaustion masked by a subtle interaction: they had enabled ENABLE_PREFIX_DELEGATION=true on the VPC CNI but never recalculated --max-pods, which Karpenter was still deriving from the old per-IP ENI formula. So a node advertised capacity for ~110 pods, but the CNI could only attach enough /28 prefixes for ~58 before hitting the per-instance ENI limit. The kubelet kept scheduling; the CNI kept failing IP allocation, leaving pods wedged.
The fix was to make Karpenter compute --max-pods consistently with prefix delegation by setting maxPods explicitly in the EC2NodeClass kubelet config, derived from AWS’s max-pods-calculator --cni-version 1.x --instance-type m6i.4xlarge --cni-prefix-delegation-enabled:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
kubelet:
maxPods: 110
After applying it, Karpenter drifted the old nodes out under PDBs and the Pending storm disappeared. The lesson: prefix delegation and --max-pods are one decision, not two — and Karpenter’s advertised capacity must agree with what the CNI can physically allocate, or the scheduler will happily overcommit IPs you don’t have.
Verify
Confirm each layer before declaring the platform ready:
# Auth: access entries resolve, no stale aws-auth dependency
aws eks list-access-entries --cluster-name platform-prod
# Pod Identity: agent running, associations present
kubectl get daemonset eks-pod-identity-agent -n kube-system
aws eks list-pod-identity-associations --cluster-name platform-prod
# VPC CNI: prefix delegation active
kubectl get daemonset aws-node -n kube-system -o yaml | grep -i ENABLE_PREFIX_DELEGATION
# Karpenter: pools healthy, nodes claimed
kubectl get nodepool,ec2nodeclass
kubectl get nodeclaim
# Add-ons: all ACTIVE on compatible versions
aws eks list-addons --cluster-name platform-prod
aws eks describe-addon --cluster-name platform-prod --addon-name vpc-cni \
--query 'addon.{v:addonVersion,status:status}'
# Ingress: ALB provisioned and address assigned
kubectl get ingress -A
A fast end-to-end identity smoke test: schedule a debug pod under a Pod-Identity-bound ServiceAccount and call STS.
kubectl run sts-check --rm -it --restart=Never \
--image=public.ecr.aws/aws-cli/aws-cli \
--overrides='{"spec":{"serviceAccountName":"checkout-sa"}}' \
-n payments -- sts get-caller-identity
The returned ARN should be the assumed role you associated — proof the credential chain works without any SA annotation.
Production checklist
Cost visibility, scaling limits, and the upgrade runbook
Cost visibility. Enable Split Cost Allocation Data for EKS in the billing console to attribute shared node cost down to pods by namespace and label — this is what turns “the cluster costs $X” into per-team chargeback. Tag NodePool-provisioned instances (via EC2NodeClass tags) so Cost Explorer can group by team. Karpenter consolidation is the single biggest lever on the compute line item; measure node utilization before and after enabling it.
Scaling limits to respect. IP space is the usual wall — even with prefix delegation, plan VPC CIDRs (and secondary CIDRs / custom networking) for peak pod count, not today’s. Watch per-node --max-pods, per-ENI prefix limits, and Karpenter’s own controller throughput when scaling thousands of nodes. Service quotas (ENIs per region, EBS volumes, ELBs) bite at the data-plane edges before the control plane does.
Cluster-upgrade runbook (one minor at a time):
- Read the EKS release notes and Kubernetes deprecation guide for the target minor; scan workloads for removed APIs (
kubectldeprecation warnings, or a tool likepluto). - Upgrade add-ons first to versions compatible with the target Kubernetes version.
- Upgrade the control plane:
aws eks update-cluster-version --name platform-prod --kubernetes-version 1.32. - Roll the data plane: for Karpenter, bump the
EC2NodeClassAMI alias and let consolidation/drift recycle nodes gracefully under PDBs; for managed node groups, do a rolling update. - Re-run the Verify section end to end.
- Confirm no workload is pinned to a now-removed API and that HPA/Karpenter still react to load.
Pitfalls to avoid
- Flipping
authenticationModetoAPItoo early. Anything still readingaws-auth(some older controllers, bootstrap scripts) loses access. Migrate, verify, then drop CONFIG_MAP. - Leaving IRSA annotations alongside a Pod Identity association. Mixed signals on the same SA cause confusing credential precedence. Pick one per workload.
- Skipping the
--max-podsrecalculation after prefix delegation. You either under-utilize big nodes or oversubscribe IPs and stall scheduling. - Per-Ingress ALBs. Without IngressGroups, dozens of load balancers appear silently and dominate the bill.
- Karpenter with no
limitsand no PDBs. One is a cost safety net, the other prevents consolidation from evicting pods that can’t tolerate it.
Get identity, networking, node lifecycle, and add-on hygiene right at the start, and EKS becomes a platform your teams build on without thinking about it — which is exactly the point.