Autoscaling on Kubernetes is three independent control loops stacked on top of each other, and most outages happen at the seams between them. This guide wires up all three — pod-level HPA on custom/external metrics, KEDA for event-driven and scale-to-zero workloads, and node autoscaling with both Cluster Autoscaler and Karpenter — then tunes them so they cooperate instead of fight.
The three layers, and why the order matters
| Layer | Controller | Scales | Reacts to |
|---|---|---|---|
| Pod replicas | HPA / KEDA | replica count of a Deployment | CPU, memory, custom, external metrics |
| Pod requests | VPA | per-pod CPU/memory requests | historical usage |
| Nodes | Cluster Autoscaler / Karpenter | the node count / shape | unschedulable (Pending) pods |
The causal chain runs top-down: a metric crosses a threshold, the HPA (or KEDA-managed HPA) adds replicas, those replicas go Pending because the cluster is full, and only then does the node autoscaler add capacity. Your end-to-end scale-up latency is the sum of all three loops — typically HPA sync (15s default) + scheduler + node provisioning (30s–several minutes). Internalizing that sum is the whole game.
Prerequisite:
metrics-servermust be running for any CPU/memory HPA. On AKS/GKE/EKS it ships managed; verify withkubectl top nodesreturning numbers, not an error.
1. HPA beyond CPU: memory, custom, and external metrics
The v2 HPA API (autoscaling/v2) takes a list of metrics and scales to satisfy the most demanding one. Start with the two built-in resource metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout
namespace: shop
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
minReplicas: 3
maxReplicas: 40
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization # % of the pod's CPU *request*
averageUtilization: 65
- type: Resource
resource:
name: memory
target:
type: AverageValue # absolute, not %, for memory
averageValue: 600Mi
Utilization targets are a percentage of the resource request, not the limit. If your requests are wrong, your HPA math is wrong. This is the single most common HPA misconfiguration.
CPU and memory rarely correlate with what users actually feel. To scale on a real signal — requests-per-second, p95 latency, queue depth — you need the custom metrics or external metrics API, served by an adapter. The canonical choice is the Prometheus Adapter.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
-n monitoring --create-namespace \
--set prometheus.url=http://prometheus-server.monitoring.svc \
--set prometheus.port=80
The adapter exposes a rule-defined PromQL series as a Kubernetes metric. Scale on per-pod RPS:
# adapter rule (values.yaml -> rules.custom)
rules:
custom:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: { resource: namespace }
pod: { resource: pod }
name:
matches: "http_requests_total"
as: "http_requests_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
# the HPA consuming it
metrics:
- type: Pods
pods:
metric: { name: http_requests_per_second }
target:
type: AverageValue
averageValue: "50" # aim for ~50 rps per pod
Use type: Pods when the metric is per-replica (HPA divides total by replica count for you). Use type: External for a metric that is not attached to your pods — a cloud queue length, a third-party SLO — where the adapter (or KEDA, below) talks to the source directly.
2. Event-driven scaling with KEDA
HPA is a closed loop on a steady-state metric. KEDA is the right tool when work arrives as discrete events — a queue backlog, Kafka consumer lag, a cron window — and especially when you want scale-to-zero, which a plain HPA cannot do (HPA minReplicas is >= 1).
KEDA installs an operator plus a metrics adapter. Under the hood it creates and manages an HPA for you from a ScaledObject; you do not write the HPA by hand.
helm repo add kedacore https://kedacore.github.io/charts
helm upgrade --install keda kedacore/keda -n keda --create-namespace
A queue-driven worker that idles at zero and bursts on backlog (Azure Service Bus shown; the pattern is identical for SQS, Pub/Sub, RabbitMQ):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-worker
namespace: shop
spec:
scaleTargetRef:
name: order-worker # the Deployment
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 100
pollingInterval: 15 # how often KEDA checks the source (s)
cooldownPeriod: 120 # wait before scaling back to zero (s)
triggers:
- type: azure-servicebus
metadata:
queueName: orders
messageCount: "20" # target backlog per replica
authenticationRef:
name: sb-auth # TriggerAuthentication (workload identity / secret)
Kafka consumer lag is the other workhorse trigger:
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.svc:9092
consumerGroup: order-consumers
topic: orders
lagThreshold: "100" # desired max lag per replica
Two more KEDA patterns worth knowing:
crontrigger to pre-warm before a known peak (market open, batch window) instead of reacting after latency already spiked.ScaledJobinstead ofScaledObjectwhen each message should map to a finite Job run rather than a long-lived Deployment replica — ideal for non-idempotent batch processing.
Scale-to-zero cuts cost but adds cold-start latency: the first event must wait for a node (maybe), a pull, and app start. For latency-sensitive paths keep
minReplicaCount: 1. Reserve zero for genuinely bursty, latency-tolerant work.
3. Tuning behavior: stabilization, policies, no flapping
Default HPA behavior scales up fast and down slow (a 300s downscale stabilization window). That asymmetry is deliberate — over-provisioning briefly is cheap; thrashing is expensive. Tune it explicitly via spec.behavior:
spec:
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # react immediately on the way up
policies:
- type: Percent
value: 100 # at most double
periodSeconds: 30
- type: Pods
value: 8 # ...or +8 pods
periodSeconds: 30
selectPolicy: Max # take the more aggressive of the two
scaleDown:
stabilizationWindowSeconds: 300 # consider the last 5 min of recommendations
policies:
- type: Percent
value: 20 # shed at most 20% per minute
periodSeconds: 60
The downscale stabilization window makes the HPA pick the highest recommendation it computed over the window before acting — that is what kills flapping. If your traffic is spiky and pods still oscillate, widen scaleDown.stabilizationWindowSeconds and lower the per-period Percent before you touch thresholds. KEDA passes a advanced.horizontalPodAutoscalerConfig.behavior block straight through to the HPA it manages, so the same knobs apply to event-driven workloads.
4. Node autoscaling: Cluster Autoscaler vs Karpenter
Both react to the same trigger — Pending pods the scheduler cannot place — but they differ fundamentally in how they pick capacity.
| Cluster Autoscaler (CA) | Karpenter | |
|---|---|---|
| Unit of scaling | a node group (ASG / VMSS / MIG) you pre-define | individual nodes, instance type chosen at provision time |
| Instance selection | fixed per group | from a flexible set; picks cheapest that fits |
| Speed | slower (group scale, then schedule) | faster (provisions the node the pod needs) |
| Bin-packing | limited | active consolidation built in |
| Availability | every managed K8s | EKS first-class; expanding to others |
Cluster Autoscaler is the universal default. On AKS it’s a cluster toggle; the autoscaler watches your node pools’ min/max:
az aks nodepool update -g rg-shop --cluster-name aks-shop -n apps \
--enable-cluster-autoscaler --min-count 3 --max-count 30
CA only scales node groups it owns and assumes all nodes in a group are interchangeable, so it works best with a handful of well-sized, single-instance-type pools. For node consolidation it removes a node only when its pods can be rescheduled elsewhere and it has sat under-utilized past --scale-down-unneeded-time.
Karpenter discards the node-group abstraction. You declare constraints (a NodePool) and a provisioning template (EC2NodeClass on AWS); Karpenter computes the cheapest instance(s) that satisfy pending pods and launches them directly.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # prefer spot, fall back
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"] # let it pick Graviton when it fits
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "1000" # hard ceiling across this pool
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidationAfter: 1m
5. Bin-packing, consolidation, and Spot safety
Karpenter’s real value is consolidation: it continuously re-evaluates whether the current fleet is the cheapest way to host current pods, and will replace several small nodes with one larger node, or swap an on-demand node for a cheaper instance, draining the old one safely. That is bin-packing as a live process, not a one-time placement.
Two guardrails make this safe in production:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
budgets:
- nodes: "10%" # never voluntarily disrupt >10% of nodes at once
- nodes: "0" # ...and zero during business hours
schedule: "0 9 * * mon-fri"
duration: 8h
And on the workload side, a PodDisruptionBudget is non-negotiable once you run Spot or enable consolidation — it is what stops a node drain from taking your service below quorum:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: checkout, namespace: shop }
spec:
minAvailable: 2
selector: { matchLabels: { app: checkout } }
Spot capacity can be reclaimed with ~30s notice (interruption) or evaporate (no capacity). Mitigate by: spreading across many instance types (let Karpenter choose), keeping critical singletons on on-demand, setting PDBs, and using topology spread so a single AZ/instance-type pull can’t drain a whole tier.
6. Combining VPA with HPA safely
The Vertical Pod Autoscaler right-sizes requests; the HPA scales replica count. They collide when both act on the same resource: VPA raises the CPU request, which lowers CPU utilization (same usage / bigger request), which tells the HPA to scale down — a feedback loop that defeats both.
Rules that keep them from fighting:
- Never let VPA and HPA control the same metric. If the HPA scales on CPU, do not let VPA manage CPU.
- The clean split: HPA scales horizontally on a custom/external metric (RPS, queue depth); VPA right-sizes CPU and memory underneath it. No shared dimension, no loop.
- The narrower split: HPA on CPU; VPA on memory only, in
recommendation/automode for memory while leaving CPU to the HPA. - Run VPA in
updateMode: "Off"first to observe recommendations before letting it evict pods.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: checkout, namespace: shop }
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
updatePolicy:
updateMode: "Initial" # set requests at pod creation; don't evict running pods
resourcePolicy:
containerPolicies:
- containerName: "*"
controlledResources: ["memory"] # HPA owns CPU; VPA owns memory only
7. Load-test the whole stack and read the timeline
A config that looks right on paper means nothing until you watch all three loops fire under load. Drive synthetic traffic and observe.
# generate load (k6 is convenient; hey/wrk/vegeta all work)
kubectl run k6 --rm -it --image=grafana/k6 -- run - <<'EOF'
import http from 'k6/http';
export const options = { stages: [
{ duration: '2m', target: 200 }, // ramp
{ duration: '5m', target: 800 }, // sustained peak
{ duration: '3m', target: 0 }, // drain -> watch scale-down
]};
export default function () { http.get('https://checkout.shop.svc/health'); }
EOF
In separate panes, watch each loop and timestamp the transitions:
kubectl get hpa checkout -n shop -w # metric vs target, replica deltas
kubectl get pods -n shop -w # Pending -> ContainerCreating -> Running
kubectl get nodes -w # new nodes joining
kubectl get events -n shop --sort-by=.lastTimestamp | tail -30
kubectl describe hpa checkout -n shop # the why behind each decision
Reading the timeline end to end, you should be able to attribute every second: metric crossed at T+0 → HPA bumped replicas at T+~15s → pods Pending at T+18s → node autoscaler reacted → node Ready → pods Running → metric back under target. If a stage is slow, you now know exactly which loop to tune.
Enterprise scenario
A payments platform ran KEDA scale-to-zero on its settlement-batch workers (SQS-driven) backed by a Karpenter Spot pool. Every weekday at 17:00 a fan-out job dumped ~40k messages into the queue. KEDA correctly scaled the Deployment from 0 to ~120 replicas, but p99 settlement time blew past the SLA on the first few thousand messages. The cause was additive cold-start, not throughput: 0→1 forced a Karpenter node launch, a 1.2 GB image pull, JVM warmup, and SQS ApproximateNumberOfMessages lags ~20–30s, so KEDA itself reacted late. The Spot pool made it worse — diversified instance types meant variable boot times, and one launch hit InsufficientInstanceCapacity.
The fix was to stop reacting and start pre-warming. They added a second KEDA trigger with a cron schedule so capacity was in place before the 17:00 dump, while keeping the queue trigger for actual backlog:
minReplicaCount: 0
triggers:
- type: cron
metadata:
timezone: America/New_York
start: "55 16 * * 1-5" # warm up at 16:55
end: "30 18 * * 1-5"
desiredReplicas: "30" # floor during the window
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.../settlements
queueLength: "50"
They also pinned the first 30 replicas to on-demand via a separate NodePool (Spot only above the floor) and pre-pulled the image with a DaemonSet. End-to-end p99 dropped back under SLA, and Spot still covered the long tail.
Verify
kubectl top nodes # metrics-server returns data
kubectl get apiservices | grep metrics # custom/external metrics API registered
kubectl get hpa -A # TARGETS column shows current/target, not <unknown>
kubectl get scaledobject -A # KEDA objects; READY/ACTIVE = True
kubectl get hpa -n keda -A # the HPAs KEDA generated exist
kubectl get nodepool,nodeclaim # Karpenter intent + provisioned nodes
kubectl get pdb -A # disruption budgets present for critical apps
A <unknown> in the HPA TARGETS column means the metrics pipeline is broken (adapter down, bad PromQL, or wrong label overrides) — fix that before tuning anything else, because the HPA is flying blind.
Production checklist
Pitfalls
- Wrong requests poison everything. Utilization HPAs and bin-packing both key off requests. Get them right (use VPA in observe mode to find them) before trusting any autoscaler.
- Forgetting the latency is additive. Three loops in series. If scale-up feels slow, profile each hop rather than blindly lowering thresholds.
- No PDB + Spot/consolidation = self-inflicted outage. The node layer will happily drain you to zero healthy replicas if nothing stops it.
maxReplicasas a silent ceiling. Hitting the cap looks identical to “scaling is broken.” Alert on it.- VPA evicting under load.
updateMode: Autocan evict pods mid-spike. PreferInitial/Offfor anything user-facing until you trust the recommendations.
Get the three loops cooperating and the cluster becomes self-managing: it absorbs traffic spikes, drains queues to zero cost, and packs nodes tightly — without a human in the loop. The work is almost entirely in the tuning and the testing, not the YAML.