Containerization Containers

Right-Sizing Kubernetes Workloads: Vertical Pod Autoscaler, Resource Recommendations, and Bin-Packing Efficiency

Most Kubernetes clusters are simultaneously over-provisioned and unreliable: aggregate node utilization sits at 25-35% on a typical billing dashboard, yet pods still get OOMKilled and throttled. Both symptoms have the same root cause — requests and limits set by copy-paste, never by measurement. This guide fixes that with the Vertical Pod Autoscaler (VPA): how to gather recommendations safely, how to read the three numbers it produces, where it conflicts with the HPA, and how right-sizing feeds directly into better bin-packing and a smaller bill.

1. Requests vs limits, QoS, and the two failure modes

Before touching VPA, internalize what the scheduler and kubelet actually do with these numbers, because VPA only ever changes one of them.

That asymmetry produces the two failure modes:

Mis-set value Consequence Who pays
Requests too high Capacity reserved but idle; nodes fill on paper at 30% real use The bill
Requests too low Pods crammed onto nodes, then evicted under node pressure Reliability
Memory limit too low OOMKill, restart, CrashLoopBackOff Reliability
CPU limit too low Silent CFS throttling, latency spikes Latency SLOs

QoS class is derived from these values and decides eviction order when a node runs out of memory:

QoS class Condition Eviction priority
Guaranteed requests == limits for every container, CPU and memory Evicted last
Burstable at least one request set, but not Guaranteed Middle
BestEffort no requests or limits at all Evicted first

The single most important rule on this page: set memory requests == memory limits for anything you care about. It pins the pod to Guaranteed for memory, removes the burst headroom that lures you into OOMKills, and makes scheduling deterministic. For CPU, leave the limit off or set it generously — CPU is compressible, and a low CPU limit throttles you for no capacity benefit. VPA’s job is to find the right request number; you decide the request/limit relationship.

2. VPA architecture: recommender, updater, admission controller

VPA is not built into Kubernetes. You install it from the autoscaler repo, and it ships as three independent components plus a CRD:

Install the released manifests:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
# Generates webhook TLS certs and applies recommender + updater + admission controller
./hack/vpa-up.sh

kubectl get pods -n kube-system | grep vpa
# vpa-recommender-...           1/1   Running
# vpa-updater-...               1/1   Running
# vpa-admission-controller-...  1/1   Running

Prerequisite: metrics-server must be healthy — kubectl top pods has to return numbers. The recommender also benefits from a Prometheus history source for cold-start accuracy, but the default in-cluster checkpoint store works out of the box.

3. Run VPA in Off mode first — recommendation only

Never start with Auto. Deploy the VPA object in updateMode: "Off" so the recommender observes and reports, but nothing evicts or mutates your pods. This is pure, zero-risk telemetry.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout
  namespace: shop
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: "Off"          # recommend only; do not touch pods

Let it run across at least one full traffic cycle — a week is sensible so it sees weekday peaks, the weekend trough, and any batch jobs. The recommender keeps a decaying histogram of usage, so longer is strictly better for the first pass.

4. Reading target, lowerBound, upperBound

After it has data, the recommendation lives in status:

kubectl describe vpa checkout -n shop
status:
  recommendation:
    containerRecommendations:
    - containerName: checkout
      lowerBound:
        cpu: 110m
        memory: 262144k
      target:
        cpu: 250m
        memory: 410Mi
      uncappedTarget:
        cpu: 250m
        memory: 410Mi
      upperBound:
        cpu: 1200m
        memory: 980Mi

Read these precisely — they are not min/typical/max of raw usage, they are percentile estimates with safety margin:

The decision rule for a manual first pass: set your request to target, set memory limit == memory request, drop the CPU limit.

You can constrain the recommender with a resourcePolicy so it never proposes something absurd — essential for sidecars and JVMs that need a memory floor:

spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: checkout
      minAllowed:
        cpu: 100m
        memory: 256Mi
      maxAllowed:
        cpu: "2"
        memory: 2Gi
      controlledResources: ["cpu", "memory"]
    - containerName: istio-proxy
      mode: "Off"               # never right-size the sidecar

Tune the target percentile only if you have evidence. The recommender’s defaults (memory target near peak, CPU near p90) are deliberately conservative because under-sizing memory kills pods. Lowering the memory target percentile to save money is how teams reintroduce the OOMKills they just fixed.

5. Update modes — and the hard HPA conflict

VPA supports four updateMode values:

Mode Behavior
Off Recommend only. Never mutates pods.
Initial Applies recommendations only at pod creation. No eviction of running pods.
Recreate Evicts and recreates pods whenever requests drift out of [lowerBound, upperBound].
Auto Currently behaves like Recreate; intended to use in-place resize as it matures.

Initial is the underrated safe default for production: new pods get right-sized requests, but you never suffer surprise mid-day evictions. You pick up correct values naturally on every rollout.

Now the rule you cannot violate:

Do not run VPA in Auto/Recreate mode and an HPA on the same resource metric for the same workload. If the HPA scales replicas on CPU utilization while VPA simultaneously rewrites the CPU request, they enter a feedback loop — VPA raises the request, which lowers measured utilization, which makes the HPA scale down, and the controllers fight. The official guidance is explicit: VPA must not be used with the HPA on CPU or memory.

This is the most common way teams break themselves with VPA. Memorize it.

6. Combining VPA and HPA correctly

You can absolutely use both — just keep them on different signals. Let the HPA scale replicas on a custom or external metric (queue depth, requests-per-second, p95 latency) and let VPA own CPU/memory requests. They no longer overlap.

HPA on a custom metric (replicas only — no CPU/memory resource metric here):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout
  namespace: shop
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "200"

VPA owning requests, scoped to CPU and memory only:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout
  namespace: shop
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: "Initial"
  resourcePolicy:
    containerPolicies:
    - containerName: checkout
      controlledResources: ["cpu", "memory"]

The HPA decides how many pods; VPA decides how big each one is — orthogonal axes, no feedback loop.

7. Right-sizing is half the battle: fix bin-packing too

Correct requests only pay off if the scheduler can pack them densely. Three levers:

Scheduler scoring. The default kube-scheduler NodeResourcesFit plugin uses LeastAllocated scoring, which spreads pods for resilience. For cost-driven node pools, switch to MostAllocated so the scheduler fills nodes before opening new ones — this is what makes the autoscaler able to drain and remove a node:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1

Node sizing. Bin-packing is a geometry problem. If your largest pod requests 6 GiB and your nodes are 8 GiB, you waste the remainder on every node. Match node shape to the request distribution you measured in step 4. On Karpenter, let it choose instance types from the actual pending-pod requirements rather than pinning one family:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["4", "8", "16"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Consolidation. Karpenter’s WhenEmptyOrUnderutilized policy actively recomputes whether the current pods would fit on fewer or cheaper nodes and replaces them when they would. Right-sized requests are the input that makes consolidation aggressive — shrink the requests and Karpenter discovers it can delete nodes. Cluster Autoscaler offers a weaker version via --scale-down-utilization-threshold.

Verify

Prove the change end to end rather than trusting the dashboard.

# 1. Recommendations exist and have stabilized (upperBound no longer huge)
kubectl describe vpa checkout -n shop | sed -n '/Recommendation/,/Events/p'

# 2. Pods actually picked up the new requests after a rollout
kubectl get pods -n shop -l app=checkout \
  -o custom-columns=NAME:.metadata.name,\
CPU_REQ:.spec.containers[0].resources.requests.cpu,\
MEM_REQ:.spec.containers[0].resources.requests.memory

# 3. QoS is Guaranteed for memory-sensitive pods
kubectl get pod -n shop -l app=checkout \
  -o jsonpath='{.items[*].status.qosClass}{"\n"}'

# 4. No new OOMKills since the change
kubectl get events -n shop --field-selector reason=OOMKilling

# 5. Real allocation vs capacity per node (the bin-packing payoff)
kubectl describe nodes | grep -A6 "Allocated resources"

A successful right-sizing shows: requests near target, qosClass: Guaranteed, zero fresh OOMKilling events, and node “Allocated resources” climbing toward 70-80% of allocatable while node count drops.

Enterprise scenario

A payments platform team ran ~140 microservices on EKS, every Deployment copied from one Helm template with requests.cpu: 1 and requests.memory: 2Gi. Cluster cost was roughly 38,000 USD/month at 22% average CPU utilization — and despite that slack, three JVM services OOMKilled nightly because 2Gi was below their actual heap-plus-metaspace peak. Classic dual failure: massively over-provisioned on aggregate, under-provisioned where it mattered.

The constraint: those same services already ran HPAs on CPU utilization, so they could not simply flip VPA to Auto — that would have pitted the two controllers against each other on the CPU metric.

The fix, staged over three weeks:

  1. Deployed VPA in updateMode: "Off" fleet-wide for one week to collect recommendations with zero production risk.
  2. Re-platformed the HPAs off CPU onto a Prometheus custom metric (in-flight requests per pod via the Prometheus Adapter), freeing CPU/memory for VPA to own.
  3. Switched VPA to updateMode: "Initial" so requests right-sized on each rollout without surprise evictions, with a minAllowed.memory floor on the JVM services so the recommender never proposed below their measured heap peak.
  4. Set the cost node pool’s scheduler profile to MostAllocated and enabled Karpenter WhenEmptyOrUnderutilized consolidation.

The result over the next billing cycle: average CPU utilization rose from 22% to 61%, node count fell by 44%, monthly spend dropped from ~38,000 to ~21,000 USD, and the nightly OOMKills went to zero because the JVM services finally got Guaranteed memory at their real footprint. The custom-metric HPA snippet that unblocked everything:

metrics:
  - type: Pods
    pods:
      metric:
        name: http_inflight_requests
      target:
        type: AverageValue
        averageValue: "50"

The non-obvious lesson: the savings did not come from VPA alone. VPA produced correct requests, but the money only materialized once MostAllocated scheduling plus Karpenter consolidation could act on those smaller requests and physically delete nodes. Right-sizing without consolidation just leaves the freed capacity stranded.

Checklist

kubernetesvparesource-managementfinopsrightsizing

Comments

Keep Reading