Migrating to Graviton: arm64 Builds, Multi-Arch Pipelines, and Performance Benchmarking

Graviton is the cheapest performance win most AWS estates are leaving on the table. The pitch — “up to ~40% better price-performance over comparable x86 instances” — is real for a large class of workloads, but it is not a checkbox. arm64 (the 64-bit Arm instruction set, also written aarch64) is a different ISA from x86-64, and the migration risk lives in the long tail: a native Python wheel with no aarch64 build, an EDR agent your security team mandates that only ships x86, a base image that silently pulls linux/amd64 and runs your service under QEMU emulation at a third of the throughput. The status you think you’re in — “we’re on Graviton, we’re saving money” — and the status you’re actually in — “half the fleet is emulating x86 and burning the gain” — can differ for weeks if you never run uname -m under load.

This is the migration runbook I actually use, written as a reference you keep open during the cutover. We treat the migration not as one flip but as a sequence of gates, each with a confirming command: audit portability, build honest multi-arch images, stand up arm64 CI on real silicon, roll out on EC2/EKS/Lambda with mixed-architecture scheduling, and prove the win with controlled benchmarks before you commit production traffic. Every decision — instance family, build strategy, scheduling affinity, rollback trigger — is laid out as a scannable table next to the prose and the aws/Terraform/YAML that implements it, because at 02:00 during a canary ramp you want the matrix, not a paragraph.

By the end you will stop migrating on faith. You will know which workloads are Graviton candidates and which need a benchmark first; how to find the single x86-only dependency that can veto an entire tier before week three; how to build one Dockerfile that produces an architecture-correct manifest list; how to keep the x86 path alive so rollback is a scheduling change, not a rebuild; and how to report price-performance (sustained throughput per dollar at your latency SLO) instead of a misleading raw-speed number. The decisive discipline is the same one that separates a clean migration from a stalled one: treat every agent, sidecar, and native binary as a first-class migration dependency, audited up front, not discovered in production.

What problem this solves

The pain is concrete and financial. Compute is the largest line on most AWS bills, and Graviton offers a roughly 20% lower hourly price for comparable capacity plus, on throughput-bound workloads, more work per core — compounding into a price-performance gap that lands on the CFO’s spreadsheet. A platform team told to cut compute spend by a third has, in Graviton, a lever that does not require re-architecting a single service. But the lever has a catch the pitch deck omits: arm64 is a real ISA boundary, and anything with compiled code must have been built for it.

What breaks without a disciplined migration: a team flips an instance type to m7g, the launch “works,” and nobody notices the container image only published linux/amd64, so it runs under QEMU at 30-40% of native throughput — the bill went up (more instances to carry the load) while everyone celebrates the “Graviton win.” Or the EDR DaemonSet that security mandates has no certified arm64 build at the exact version policy requires, and the migration is vetoed in week three after the API tier is already half-ported. Or a single internal library still pulls an x86-only .so for a legacy client, and the service segfaults on an arm64 node in a way that looks like a random crash loop. Each of these is diagnosable in minutes and preventable in the audit — if you know to look.

Who hits this: anyone running more than a handful of EC2 instances, EKS nodes, ECS tasks, Lambda functions, or managed-service nodes (RDS/Aurora, ElastiCache, OpenSearch) and wanting the savings. It bites hardest on native-heavy stacks (Python with C extensions, Node with native addons, anything with hand-written x86 intrinsics), agent-laden fleets (mandated EDR/observability sidecars), and container shops where a wrong base image silently downgrades you to emulation. The fix is almost never “abandon Graviton” — it’s “find the one dependency that isn’t ported, and decide on it deliberately.”

To frame the whole field before the deep dive, here is every migration surface this article covers, the risk that lives there, and the one check that tells you the truth:

Migration surface	What the front of the migration is saying	First question to ask	First place to look	Most common single blocker
Native dependencies	“everything compiles, ship it”	Does every compiled package have an `aarch64` build?	Lockfile audit (`pip download --platform manylinux2014_aarch64`)	One wheel with no aarch64 tag
Container images	“the image runs”	Is it native arm64 or QEMU-emulated?	`docker buildx imagetools inspect`; `uname -m` in-container	Base image only publishes `linux/amd64`
Agents & sidecars	“the agent is installed”	Is the exact mandated version GA on arm64?	Vendor release notes; canary node group	EDR/security sensor not certified
CI / build farm	“CI is green”	Is arm64 built on real silicon or emulated?	CodeBuild `ARM_CONTAINER`; GHA `*-arm` runner	Emulated builds too slow, adoption stalls
Managed services	“we changed the class”	Did you benchmark on a clone before failover?	`describe-db-instances` class; blue/green	Engine/version doesn’t offer the Graviton class
Benchmark / cutover	“Graviton is faster”	Faster, or better price-performance at SLO?	RPS-at-SLO ÷ on-demand price, like-for-like	Comparing raw speed, not throughput/$

Learning objectives

By the end of this article you can:

Decide whether a given workload is a Graviton candidate or needs a benchmark first, using a portability and price-performance screen rather than the marketing number.
Audit native dependencies, language runtimes, agents, and sidecars for aarch64 support and produce a portability matrix you can gate the migration on.
Build a single Dockerfile that produces a multi-arch manifest list covering linux/amd64 and linux/arm64, using $TARGETPLATFORM/$BUILDPLATFORM so cross-builds are explicit and never accidental QEMU emulation.
Stand up arm64 CI on native silicon (CodeBuild ARM_CONTAINER, GitHub Actions ubuntu-*-arm runners) and stitch a manifest list from per-arch digests.
Roll out on EC2 (arm64 AMIs), EKS (mixed-architecture node groups, nodeAffinity, Karpenter NodePools), and Lambda (architectures = ["arm64"]) with not-yet-ported workloads safely pinned to x86.
Migrate managed services (RDS/Aurora, ElastiCache, OpenSearch) to Graviton classes via clone/blue-green with a tested rollback.
Design and run a controlled benchmark that reports price-performance (sustained RPS at your latency SLO per on-demand dollar), like-for-like sizes, and read the result correctly.
Run a phased, canary-gated cutover where rollback is a nodeAffinity/target-weight flip with no rebuild, and confirm at every layer that you are running native arm64, not emulation.

Prerequisites & where this fits

You should already be comfortable with the AWS compute building blocks: an EC2 instance type names a family + generation + size (m7g.xlarge = general-purpose, 7th-gen Graviton, 4 vCPU); an AMI is architecture-specific; ECR stores container images and can hold a multi-arch image index under one tag; EKS schedules pods onto nodes and exposes the well-known kubernetes.io/arch label; and Lambda runs a function on a managed x86_64 or arm64 execution environment. You should know how to run the aws CLI, read JSON output, write a basic Dockerfile, and apply a Terraform resource. Familiarity with docker buildx, Kubernetes nodeAffinity, and a load tool (k6, wrk, vegeta) helps.

This sits in the Compute & Cost-Optimization track. It assumes the compute fundamentals from AWS Compute: EC2 vs Lambda vs ECS vs EKS and the EC2 mechanics in Amazon EC2 Deep Dive: Instances, AMIs, EBS, User Data & IMDS. It pairs tightly with EC2 Spot, Mixed Instances & Capacity-Optimized ASGs (Graviton + Spot is the deepest discount stack) and Deploy Karpenter on EKS: Consolidation, Spot & Disruption Budgets (Karpenter provisions Graviton on demand). The container-build half builds on Docker Container Images for CI/CD: Dockerfiles & Registries and the CI on GitHub Actions Fundamentals: Workflows, Jobs, Runners & Secrets.

A quick map of who owns which migration surface, so you pull the right person into the cutover bridge:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Application code	Compiled binaries, native intrinsics	App / dev team	Segfault on arm64; no aarch64 build
Container image	Base image, build args, manifest	Platform / build team	Emulation, wrong-arch pull, slow start
Agents / sidecars	EDR, observability, mesh proxy	Security + SRE	Migration veto; sidecar crash on arm64
Scheduling	AMI, node groups, affinity, NodePool	Platform / SRE	Capacity stall; pod stranded on wrong arch
Managed data services	RDS/Aurora, ElastiCache, OpenSearch	DBA / data team	Class unavailable; failover risk
CI/CD	Build farm, runners, registry push	DevOps / platform	Slow emulated builds; manifest not assembled
FinOps	Pricing, savings plans, benchmark sign-off	FinOps + leadership	Wrong metric; savings overstated

Core concepts

Six mental models make every later decision obvious.

arm64 is a real ISA boundary, not a flag. x86-64 and arm64 (aarch64) are different instruction sets. Source code in a managed/JIT runtime (Go, Rust, Java, .NET, Node, Python) is portable because the toolchain or runtime targets the architecture. But anything compiled to native machine code — a C extension, a .so, a statically-linked Go binary, a prebuilt npm addon — exists per-architecture and must have been built for arm64. The migration’s entire risk surface is “which compiled things do I depend on, and does each have an aarch64 build?”

Graviton competes on throughput-per-dollar, not single-thread clock. Graviton cores (Neoverse-based) are not faster per core than the latest x86 at single-threaded, latency-bound work tuned for x86. They win on aggregate throughput per dollar: more cores at a lower price, strong memory bandwidth, excellent scaling for horizontally parallel work. The decision metric is therefore price-performance (sustained throughput at your SLO ÷ price), never raw latency of one request. A workload that scales out cleanly and runs more than one instance is a candidate; a single fat box tuned for x86 single-thread is not, until proven.

A multi-arch image is one manifest list, not two tags. A correct container artifact is a manifest list (OCI image index): one tag (app:1.4.0) pointing at per-architecture manifests. docker pull and Kubernetes resolve the matching architecture automatically. The failure mode is shipping a single-arch image (only linux/amd64) to an arm64 node — Docker will run it under QEMU user-mode emulation, correct but 30-60% slower, silently burning the price-performance gain. “It runs” is not “it runs native.”

Cross-compilation beats emulated building. Building an arm64 image has two strategies: emulate the arm64 environment on an x86 builder via QEMU (correct, slow), or cross-compile from the builder’s native arch to the target arch, or build on native arm64 hardware. For compiled languages (Go, Rust) cross-compilation via $TARGETARCH is fast and clean. For interpreted/native-heavy stacks (Python wheels, Node addons) cross-compiling is painful, so build that arch on a native arm64 runner (CodeBuild ARM_CONTAINER, GHA ubuntu-24.04-arm) and stitch the manifest from digests. Emulated builds are the fallback, not the default — slow CI kills adoption.

The scheduler decides the architecture, so the scheduler is how you control and roll back. On EKS the kubelet sets kubernetes.io/arch on every node automatically. Your image being multi-arch means a pod scheduled to either arch pulls the right layer. nodeAffinity on kubernetes.io/arch is how you pin a not-yet-ported workload to amd64 (so it never lands on Graviton) and how you roll back instantly (flip the affinity, pods reschedule, no rebuild). Karpenter expresses the same intent in a NodePool’s requirements. This is why keeping the x86 node group alive makes rollback trivial.

The bill driver is the instance-hour, and Graviton lowers it two ways. You pay per instance-hour. Graviton lowers the bill through a ~20% lower hourly price for comparable capacity and, on suitable workloads, more throughput per instance (so you run fewer of them). On Lambda you pay per GB-second and arm64 is priced lower per GB-second — often the lowest-risk, highest-ROI flip in the whole program. The savings are only real if you’re running native; emulation can erase them by needing more instances.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to the migration
arm64 / aarch64	The 64-bit Arm instruction set	The whole stack	The ISA boundary; compiled code is per-arch
Graviton	AWS-designed Arm Neoverse server CPU	`*g` instance families	The thing you’re migrating to
Manifest list	One tag → per-arch image manifests	ECR / registry	Right layer pulled per node arch
QEMU emulation	Running x86 binaries on arm64 (or vice-versa)	Container runtime / build	Correct but slow; silent gain-killer
`$TARGETPLATFORM` / `$BUILDPLATFORM`	buildx args naming target vs builder arch	Dockerfile	Makes cross-builds explicit
`kubernetes.io/arch`	Auto-set node label (`amd64`/`arm64`)	Every EKS node	Scheduling key for affinity
`nodeAffinity`	Rule pinning pods to matching nodes	Pod spec	Pin not-ported pods; instant rollback
Karpenter NodePool	Just-in-time node provisioning intent	EKS cluster	Provisions Graviton on demand
CodeBuild `ARM_CONTAINER`	Native Arm build compute	CI	Builds arm64 on real silicon
Price-performance	Throughput at SLO ÷ price	The benchmark	The only honest migration metric
arm64 AMI	Architecture-matched machine image	EC2 launch template	Wrong-arch AMI → launch fails
Native addon / wheel	Compiled dependency artifact	Lockfile	Needs an aarch64 build or it breaks

The architecture & error reference

Before the per-surface detail, here is the lookup table you scan first: every error, symptom, or limit you realistically hit during a Graviton migration, what it means, the likely cause, how to confirm it, and the fix. The non-obvious ones are the silent failures — emulation and wrong-arch pulls that “work” while destroying the gain.

Symptom / error	What it means	Likely cause	How to confirm	First fix
`exec format error`	Wrong-arch binary executed	x86 binary on arm64 node, no emulation installed	`uname -m` on host; `file ./binary`	Build/pull the arm64 artifact; install qemu only as stopgap
Throughput ~⅓ of expected on arm64	Running under QEMU emulation	Single-arch image pulled to arm64 node	`docker buildx imagetools inspect` (one platform only)	Publish a multi-arch manifest list
`no matching manifest for linux/arm64`	Registry has no arm64 variant	Image pushed amd64-only	`docker manifest inspect <tag>`	Rebuild with `--platform linux/amd64,linux/arm64`
`ERROR: no matching distribution found` (pip)	No aarch64 wheel	Native package x86-only or pin too old	`pip download --platform manylinux2014_aarch64`	Unpin / source-build with toolchain / swap package
Pod `Pending`, `node(s) didn't match node affinity`	No node of required arch	`required` affinity `amd64` but only arm64 nodes (or vice-versa)	`kubectl get nodes -L kubernetes.io/arch`	Add matching node group; or fix affinity
ASG “capacity stall,” instances never `InService`	Launch fails silently	x86 AMI on arm64 instance type	Activity history; `describe-images` Architecture	Use an arm64 AMI (AL2023/Bottlerocket/Ubuntu)
Container segfaults / `SIGILL` on arm64 only	Illegal instruction	Hand-written x86 intrinsics / AVX path	Crash on arm64, fine on amd64; `dmesg`	Use a portable build flag / arm64 codepath / library
EDR/agent DaemonSet `CrashLoopBackOff` on Graviton nodes	Agent not arm64-ready	Sensor version lacks aarch64 build	`kubectl logs`; vendor matrix	Pin certified arm64 build; canary one node group
Node native addon `Error: ... invalid ELF header`	x86 prebuilt addon under arm64	`node_modules` baked on x86, copied to arm64	`npm rebuild` on target; check addon arch	Rebuild on arm64 runner; multi-arch image
Lambda cold-start failure on arm64	Bundled binary is x86	Layer/zip native dep compiled for x86	`aws lambda get-function-configuration` Architectures	Rebuild bundled native dep on arm64
Slower than x86 even when native	Genuinely x86-favored hot path	No Arm-optimized library; single-thread bound	Benchmark native vs native	Profile; swap library; keep on x86 if it loses
`docker manifest inspect` returns 1 entry	Image is single-arch	`--load` used, or `--platform` had one arch	Inspect the tag’s manifests	Rebuild with both platforms; `--push`
ECS task stuck `PROVISIONING`/stops	Task def `runtimePlatform` arch mismatch	`cpuArchitecture` set to wrong arch for the capacity	`describe-task-definition` runtimePlatform	Set `cpuArchitecture: ARM64`; arm64 capacity provider
Spot interruptions spike on `*g`	Narrow Graviton instance-type pool	Too few instance types in the pool	Spot allocation; capacity-optimized	Broaden the `*g` type list; mixed sizes

Three reading notes that save the most time, because the silent failures cost the most:

Distinction	The trap	How to tell them apart
Native arm64 vs QEMU-emulated	“It runs” hides 60% lost throughput	`uname -m` = `aarch64` AND throughput matches benchmark; emulation passes the first, fails the second
Single-arch pull vs multi-arch image	A working pod that’s secretly emulated	`docker buildx imagetools inspect` shows BOTH `linux/amd64` and `linux/arm64`, not one
Launch failure vs capacity shortage	Wrong-arch AMI looks like Spot/capacity stall	ASG activity says the launch failed (bad AMI) vs no capacity; the AMI’s Architecture field is the tell

Surface 1 — Assess portability before you touch infrastructure

The migration fails or succeeds in the dependency audit. Everything compiled must have an aarch64 build, and the one thing that doesn’t will veto a tier in week three if you find it late. Inventory three layers and gate on a matrix.

Native dependencies — audit the lockfile, not the requirements

Anything with compiled code needs an aarch64 build. Audit your lockfiles (resolved, pinned versions), not your top-level requirements.txt/package.json, because a transitive native dependency is exactly what bites.

# Python: find wheels that are x86-only (no aarch64/universal tag)
pip download -r requirements.txt -d /tmp/wheels --only-binary=:all: \
  --platform manylinux2014_aarch64 --python-version 312 --implementation cp \
  --abi cp312 2>&1 | tee /tmp/aarch64-audit.log
# Any package that errors "no matching distribution" needs a source build or a swap.

# Node: native addons surface as prebuilt binaries or node-gyp rebuilds
npm ls --all 2>/dev/null | grep -Ei 'sharp|bcrypt|grpc|canvas|node-sass|re2|argon2'

# Go/Rust: confirm the target triple builds clean
GOARCH=arm64 GOOS=linux go build ./...           # Go: trivial cross-compile
cargo build --target aarch64-unknown-linux-gnu   # Rust: add the target first

The per-language portability picture, because the audit command and the fix differ by ecosystem:

Ecosystem	How native code surfaces	aarch64 status (2026)	Audit command	If missing, fix
Go	Static binary; rare cgo	First-class (`GOARCH=arm64`)	`GOARCH=arm64 go build ./...`	Cross-compile; avoid cgo or build native
Rust	Native binary; some `-sys` crates	First-class (`aarch64-unknown-linux-gnu`)	`cargo build --target aarch64-...`	Add target; native build for C-linked crates
Java / JVM	JIT; rare JNI libs	First-class (Corretto ships aarch64)	`java -XshowSettings:properties` os.arch	Use current OpenJDK/Corretto; rebuild JNI
.NET	JIT; rare native interop	First-class (arm64 runtime)	`dotnet --info` RID	Target `linux-arm64`; rebuild native interop
Node.js	Prebuilt addons / node-gyp	Mostly GA; check addons	`npm rebuild` on arm64	`npm rebuild` on arm64; multi-arch image
Python	C extensions as wheels	Most major wheels GA	`pip download --platform manylinux2014_aarch64`	Unpin to a version with a wheel; source-build

Common native offenders and their typical resolution — the packages that show up in real audits:

Package	Ecosystem	Why it’s native	Typical resolution
`grpcio`	Python	C++ core	Pin to a version with an aarch64 manylinux wheel
`cryptography`	Python	Rust/OpenSSL	Unpin old pins; modern versions ship aarch64 wheels
`numpy` / `scipy` / `pandas`	Python	BLAS/LAPACK	aarch64 wheels GA; ensure recent versions
`psycopg2`	Python	libpq	Use `psycopg2-binary` aarch64 wheel or build libpq
`sharp`	Node	libvips	aarch64 prebuilt available; `npm rebuild` on arm64
`bcrypt` / `argon2`	Node	C crypto	`npm rebuild` on arm64 runner
`re2` / `node-grpc`	Node	C++	Rebuild on arm64; prefer pure-JS where viable
`lxml` / `Pillow`	Python	libxml2 / libjpeg	aarch64 wheels GA; ensure recent versions
`confluent-kafka`	Python	librdkafka C	Use a version with an aarch64 wheel; or build librdkafka
Legacy HSM/PKCS#11 `.so`	Any	Vendor C lib	Get vendor aarch64 build; or keep tier on x86

Language runtimes and toolchains

The major managed runtimes are first-class on arm64: Go (GOARCH=arm64), Rust (aarch64-unknown-linux-gnu), Java (a current OpenJDK; Amazon Corretto ships aarch64), .NET (arm64 runtime), Node, and Python. The traps are pinned old runtimes (an ancient JDK or Python with no arm64 build at that exact patch) and base images that only publish linux/amd64. The runtime decision table:

Runtime decision	x86-only risk	Recommended arm64 path	Gotcha
Old pinned JDK 8u-early	Some early arm64 gaps	Corretto 11/17/21 aarch64	Match the exact build your app needs
Python 3.7 EOL	Fewer aarch64 wheels	Move to 3.11/3.12 (rich wheels)	Bumping Python is the real work
Node 16 EOL	Older prebuilt addons	Node 20/22 LTS	Some addons need `npm rebuild`
Distroless/Alpine base	Tag may be amd64-only	Use a multi-arch base tag	Verify the base publishes arm64
Self-managed toolchain image	Built amd64-only	Rebuild toolchain image multi-arch	Build farm itself must be multi-arch

ISV, agents, and sidecars — where production migrations actually stall

This is where it dies if you find it late. Confirm aarch64 support for everything that runs next to your app, at the exact version your policy mandates:

Sidecar / agent class	Examples	arm64 readiness (verify version!)	How to validate before fleet-wide
Observability agent	Datadog, Dynatrace, New Relic, OTel Collector	GA on arm64	Deploy to a single canary node group
Security / EDR sensor	CrowdStrike Falcon, SentinelOne, etc.	GA — but pin the mandated build	Security sign-off on certified arm64 version
Service mesh sidecar	Envoy/App Mesh, Istio, Linkerd	GA on arm64	Confirm proxy image is multi-arch
Log shipper	Fluent Bit, Vector	GA on arm64	Multi-arch DaemonSet image
Init / secrets sidecar	Vault agent, ESO, secrets-store CSI	GA on arm64	Multi-arch; test secret injection
Vendor licensing/HSM agent	PKCS#11 daemons, license managers	Often the laggard	Vendor matrix; may gate the tier

One mandated x86-only agent can veto an entire tier. Find it in week one with a single canary node group, not in week three with half the API tier ported. The EDR sensor is the most common single blocker — treat it as a first-class dependency with explicit security sign-off on the certified arm64 build.

The portability matrix you gate on

Produce a simple matrix per service and refuse to start the rollout until every row is green or has an explicit waiver:

Layer	Component	aarch64 status	Action	Owner	Gate
Runtime	Go 1.22	Native	none	dev	PASS
Native dep	`grpcio` 1.x	Wheel available	pin ≥ version with aarch64 wheel	dev	PASS
Native dep	legacy `cryptography` pin	No aarch64 wheel at pin	unpin / source-build w/ Rust toolchain	dev	FIX
Agent	EDR sensor	Vendor GA on arm64	validate mandated version; security sign-off	security	GATE
Sidecar	Envoy	Native	none	platform	PASS
Base image	distroless:nonroot	Multi-arch tag	confirm arm64 manifest present	platform	PASS
Internal lib	HSM client `.so`	x86-only	rebuild w/ aarch64 toolchain	dev	FIX

Surface 2 — Build multi-arch container images with buildx and ECR

Do not maintain two Dockerfiles. Build one image as a multi-arch manifest list so docker pull / Kubernetes resolves the right architecture automatically. The correctness rule: use $TARGETPLATFORM/$BUILDPLATFORM and $TARGETARCH so cross-builds are explicit, never accidental emulation.

# syntax=docker/dockerfile:1
FROM --platform=$BUILDPLATFORM golang:1.22 AS build
ARG TARGETOS TARGETARCH
WORKDIR /src
COPY . .
# Cross-compile from the builder's native arch to the target arch (fast, no QEMU)
RUN CGO_ENABLED=0 GOOS=$TARGETOS GOARCH=$TARGETARCH go build -o /out/app ./cmd/app

FROM public.ecr.aws/docker/library/alpine:3.20
COPY --from=build /out/app /usr/local/bin/app
ENTRYPOINT ["/usr/local/bin/app"]

Create a builder and push a manifest list covering both architectures in one command:

# One-time: a buildx builder backed by the docker-container driver
docker buildx create --name multiarch --driver docker-container --use
docker buildx inspect --bootstrap

aws ecr get-login-password --region ap-south-1 \
  | docker login --username AWS --password-stdin \
    111122223333.dkr.ecr.ap-south-1.amazonaws.com

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag 111122223333.dkr.ecr.ap-south-1.amazonaws.com/app:1.4.0 \
  --provenance=false \
  --push .

ECR stores this as a single tag pointing at an image index. Verify both platforms are present — this is the check that catches the silent emulation trap:

docker buildx imagetools inspect \
  111122223333.dkr.ecr.ap-south-1.amazonaws.com/app:1.4.0
# Expect Platform: linux/amd64 AND linux/arm64 in the output.
# If only one appears, every node of the other arch will emulate or fail.

The build-strategy decision — the single most consequential choice, because it sets your CI speed and correctness:

Strategy	How it works	Speed	Best for	Trade-off / gotcha
Cross-compile (`$TARGETARCH`)	Builder’s native arch compiles for target	Fast	Go, Rust, static binaries	Painful for native-heavy interpreted stacks
Native arm64 runner	Build arm64 on Graviton CI	Fast	Python wheels, Node addons	Needs an arm64 runner / fleet
QEMU emulation (`buildx` default cross)	Emulate target arch on x86 builder	Slow (2-10×)	Last resort, rare arch	Slow CI erodes adoption; CPU-heavy
Per-arch + manifest merge	Build each arch on its silicon, merge digests	Fast	Mixed/heavy stacks	Two jobs + a merge step

The buildx flags that matter and what each controls:

Flag / arg	What it does	Default	When to set
`--platform linux/amd64,linux/arm64`	Targets both arches → manifest list	builder arch	Always, for multi-arch
`$BUILDPLATFORM`	The builder’s native platform	auto	`FROM --platform=$BUILDPLATFORM` for cross-builds
`$TARGETPLATFORM` / `$TARGETARCH`	The platform being built	auto	Drive `GOARCH`/conditional steps
`--provenance=false`	Skip SLSA provenance attestation	true (newer)	Avoid an extra unexpected manifest entry
`--push`	Push the manifest list to the registry	off	Publish (vs `--load`, single-arch local)
`--cache-to/from type=registry`	Layer cache in the registry	off	Speed repeat multi-arch builds
`push-by-digest=true`	Push by digest only (no tag)	off	Per-arch jobs that a merge step assembles

Common multi-arch build failures and their cause:

Build symptom	Cause	Fix
Only `linux/amd64` in `imagetools inspect`	Forgot `--platform` arm64, or `--load` used	Add arm64 to `--platform`; use `--push`
Build extremely slow on one arch	QEMU emulating that arch	Cross-compile or use a native runner
Extra unexpected manifest entries	Provenance/SBOM attestations	`--provenance=false --sbom=false` if undesired
`npm rebuild` fails in cross-build	Native addon can’t cross-compile	Build that arch on a native arm64 runner
Image pulls but `exec format error`	Manifest list wrong / single-arch	Verify both platforms; rebuild
Cache never hits across arches	Per-arch layers, no registry cache	`--cache-to/from type=registry`
Push denied to ECR	CI role lacks repo `ecr:Put*`	Scope OIDC role to the repository

Surface 3 — arm64 CI: native runners and cross-compilation

Emulated arm64 builds under QEMU are correct but slow, and slow CI erodes adoption. Build arm64 artifacts on arm64 hardware.

CodeBuild native Arm compute

CodeBuild offers native Arm compute. Select an ARM_CONTAINER environment with an aarch64 image:

# buildspec.yml -- runs natively on an ARM_CONTAINER compute fleet
version: 0.2
phases:
  pre_build:
    commands:
      - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $REPO_HOST
  build:
    commands:
      - docker build --platform linux/arm64 -t $REPO_URI:$IMAGE_TAG-arm64 .
      - docker push $REPO_URI:$IMAGE_TAG-arm64

resource "aws_codebuild_project" "app_arm" {
  name         = "app-arm64"
  service_role = aws_iam_role.codebuild.arn

  artifacts { type = "NO_ARTIFACTS" }
  source { type = "CODEPIPELINE" } # or GITHUB / CODECOMMIT

  environment {
    type            = "ARM_CONTAINER"
    compute_type    = "BUILD_GENERAL1_LARGE"
    image           = "aws/codebuild/amazonlinux2-aarch64-standard:3.0"
    privileged_mode = true # required for docker build
  }
}

The CodeBuild Arm knobs and how to reason about each:

Setting	What it controls	Values / note	When to change
`type`	Compute platform	`ARM_CONTAINER` for native Arm	Always, for native arm64 builds
`image`	Build image arch	`-aarch64-standard:`	Match arm64; x86 image would emulate
`compute_type`	vCPU/RAM size	`GENERAL1_SMALL`→`2XLARGE`	Larger for heavy native compiles
`privileged_mode`	Docker-in-Docker	`true` for `docker build`	Required to build images
Reserved-capacity fleet	Dedicated warm Arm capacity	optional	Cut cold-start build latency at scale

GitHub Actions native arm64 runners

GitHub Actions provides Linux arm64 hosted runners; build each architecture on native hardware and stitch the manifest from the digests. A clean pattern is a matrix that pushes per-arch digests, then a merge job:

jobs:
  build:
    strategy:
      matrix:
        include:
          - platform: linux/amd64
            runner: ubuntu-24.04
          - platform: linux/arm64
            runner: ubuntu-24.04-arm     # native arm64 runner
    runs-on: ${{ matrix.runner }}
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::111122223333:role/gha-ecr-push
          aws-region: ap-south-1
      - uses: aws-actions/amazon-ecr-login@v2
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v6
        with:
          platforms: ${{ matrix.platform }}
          # Push by digest only; the merge job assembles the manifest list
          outputs: type=image,name=111122223333.dkr.ecr.ap-south-1.amazonaws.com/app,push-by-digest=true,name-canonical=true,push=true

The merge job then runs docker buildx imagetools create -t <repo>:<tag> <digest-amd64> <digest-arm64> to publish the final manifest list. The CI-platform options compared:

CI platform	Native arm64 path	Auth to ECR	Notes
CodeBuild	`ARM_CONTAINER` fleet	Service-role IAM	Tight AWS integration; reserved capacity
GitHub Actions	`ubuntu-*-arm` hosted runner	OIDC → `configure-aws-credentials`	Matrix + merge job pattern
GitLab CI	`saas-linux-*-arm64` runner / self-hosted	OIDC / role	Per-arch jobs, manifest merge
Self-hosted on EC2	Graviton runner host	Instance profile	Cheapest at high volume; you operate it
Jenkins	Graviton agent label	Instance profile / creds	Label-route arm64 builds to Arm agents

The two ways to assemble the final image, side by side:

Assembly method	Command	When it fits	Trade-off
Single buildx build	`buildx build --platform a,b --push`	One runner, cross-compile or QEMU	Simplest; emulation if not cross-compiling
Per-arch digests + merge	`imagetools create -t tag d1 d2`	Native runner per arch	Architecture-correct, fast; two jobs + merge

Surface 4 — Roll out on EC2, EKS, and Lambda

With portable images in ECR and arm64 CI, the rollout is a scheduling and instance-type exercise. Match the AMI/runtime to the arch, keep not-yet-ported workloads on x86, and let the scheduler place pods.

EC2 — the arm64 AMI is the whole trap

On EC2 the change is the instance type plus an arm64 AMI (Amazon Linux 2023, Ubuntu, Bottlerocket all publish aarch64). The trap is pulling an x86 AMI for an arm64 instance type — the launch fails, but in an ASG it can look like a capacity stall.

# Resolve the LATEST arm64 AL2023 AMI from SSM Parameter Store (never hardcode)
aws ssm get-parameter --region ap-south-1 \
  --name /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64 \
  --query 'Parameter.Value' --output text
# The x86 equivalent ends in -x86_64; using it on a *g instance type fails the launch.

data "aws_ssm_parameter" "al2023_arm64" {
  name = "/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64"
}

resource "aws_launch_template" "graviton" {
  name_prefix   = "graviton-"
  image_id      = data.aws_ssm_parameter.al2023_arm64.value
  instance_type = "m7g.xlarge"
}

The arm64 AMI sources and how to pick:

AMI family	arm64 availability	How to resolve	Best for
Amazon Linux 2023	Yes	SSM `.../al2023-...-arm64`	General EC2 workloads
Bottlerocket	Yes	SSM `.../bottlerocket/.../arm64/...`	EKS nodes, minimal/immutable
Ubuntu	Yes	Canonical SSM / AMI lookup	Familiar tooling, broad packages
EKS-optimized AL2023	Yes	SSM EKS AMI param (arm64)	Self-managed EKS node groups
Windows	Not on Graviton	n/a	Keep Windows workloads on x86

EKS — mixed-architecture node groups and affinity

On EKS, run mixed-architecture node groups during the transition and let the scheduler place pods on matching nodes. Two non-negotiables: (1) your images must be multi-arch manifest lists so a pod on either arch pulls the right layer; (2) pods that are not yet arm64-clean must be pinned to x86 with nodeAffinity so they never land on a Graviton node.

apiVersion: apps/v1
kind: Deployment
metadata: { name: app }
spec:
  replicas: 6
  template:
    spec:
      affinity:
        nodeAffinity:
          # Prefer arm64 once the image is validated; flip to required to enforce
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: kubernetes.io/arch
                    operator: In
                    values: ["arm64"]
      containers:
        - name: app
          image: 111122223333.dkr.ecr.ap-south-1.amazonaws.com/app:1.4.0

For a workload still pinned to x86, invert it with a required affinity on kubernetes.io/arch: amd64. With Karpenter, express the same intent in the NodePool so it provisions Graviton capacity on demand:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata: { name: graviton }
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["c7g.xlarge", "m7g.xlarge", "r7g.xlarge"]

The scheduling-control matrix — the exact knob for each intent and how to roll it back:

Intent	Mechanism	Rollback move	Gotcha
Prefer arm64, allow x86	`preferred...` `nodeAffinity` weight	Lower/remove weight	A bad pull can’t strand the pod (preferred)
Force arm64 only	`required...` `nodeAffinity` In `arm64`	Flip to `amd64`	No arm64 nodes → pod `Pending`
Keep a pod on x86	`required...` In `amd64`	Remove once ported	Must exist while the pod isn’t ported
Provision Graviton on demand	Karpenter NodePool `arch In arm64`	Disable/scale NodePool	Mind instance-type list breadth
Taint Graviton nodes	`taint` + pod `tolerations`	Remove taint	Opt-in migration per workload
Spread across arches	Two node groups, no affinity	Drain one	Image MUST be multi-arch
Weighted traffic canary	ELB target-group weights	Shift weight to x86	Independent of pod scheduling

The well-known label kubernetes.io/arch is set automatically by the kubelet on every node, so you can rely on it without custom labeling.

Managed services — modify the class, but benchmark on a clone first

Most managed services let you flip to Graviton by changing the instance/node class — the heavy lifting is benchmarking, not plumbing:

Service	Graviton class examples	Migration mechanism	Risk / rollback
RDS / Aurora	`db.r7g.`, `db.r8g.`, `db.m7g.*`	Modify instance class → failover	Low; storage untouched; test on a clone
Aurora (blue/green)	same	Blue/green deployment switchover	Reversible; validate green first
ElastiCache (Redis/Valkey)	`cache.r7g.`, `cache.m7g.`	Scale / node-type change	Validate with real key/value sizes
OpenSearch	`r7g..search`, `m7g..search`	Blue/green domain update	Rolls nodes; watch shard rebalancing
Lambda	`architectures = ["arm64"]`	Set the architecture	Lowest risk if bundled deps are aarch64
MSK / others	Graviton broker types where offered	Rolling broker update	Per-engine availability varies

resource "aws_lambda_function" "worker" {
  function_name = "worker"
  role          = aws_iam_role.lambda.arn
  package_type  = "Image"
  image_uri     = "111122223333.dkr.ecr.ap-south-1.amazonaws.com/worker:1.4.0"
  architectures = ["arm64"] # the entire migration for a packaged-correctly function
  memory_size   = 1024
  timeout       = 30
}

For zip-based Lambdas, the only requirement is that any bundled native dependency is an aarch64 build. Layer-packaged binaries compiled for x86 fail at cold start — rebuild them on arm64. For RDS/Aurora, always rehearse on a clone or the blue/green green-side before you fail production over.

The Graviton instance-family landscape, so you pick the right family per workload profile:

Family	Profile	Graviton gens	Typical workload
`C*g` (`c7g`, `c8g`)	Compute-optimized	G3, G4	CPU-bound services, encoding, gaming servers
`M*g` (`m6g`, `m7g`, `m8g`)	General-purpose	G2, G3, G4	Web/API tiers, microservices, app servers
`R*g` (`r6g`, `r7g`, `r8g`)	Memory-optimized	G2, G3, G4	Caches, in-memory DBs, large heaps
`*gd` suffix	+ local NVMe	per gen	Local-storage-heavy workloads
`*gn` suffix	+ enhanced network	per gen	Network-bound, high-PPS workloads
`X2g*`	Extra-large memory	G2	SAP HANA-class, very large in-memory
`T4g`	Burstable	G2	Dev, low-traffic, free-trial-eligible
`Im4gn` / `Is4gen`	Storage + dense local NVMe	G2	Storage-dense, high-IOPS local
`Hpc7g`	HPC-optimized	G3	Tightly-coupled HPC

Surface 5 — Benchmarking methodology

Never migrate on faith. Run a controlled comparison and report price-performance, not raw speed.

Identical software, different arch. Same image (multi-arch), same config, same data set. The only variable is instance family — compare like-for-like sizes (m6i.xlarge vs m7g.xlarge).
Representative load. Replay production-shaped traffic, not synthetic hello-world. Measure at a fixed, sustained request rate and report p50/p95/p99 latency and max sustained throughput before SLO breach.
Warm and steady. Discard warm-up; let JITs compile and caches fill. Run long enough to see GC/compaction behaviour.
Compute the ratio that matters. Price-performance = sustained RPS at your latency SLO ÷ the On-Demand hourly price of each instance. Compare the ratios, not the raw RPS.

# Fixed-rate, fixed-duration load with a constant-arrival-rate model (k6)
k6 run --vus 200 --duration 10m \
  -e TARGET=https://app.internal/api/checkout load.js

# price-perf = sustained_rps_at_SLO / on_demand_price_per_hour
# Compare the m7g (Graviton) ratio against the m6i (x86) ratio.

# Confirm you are NOT emulating before trusting any number:
ssh ec2-user@<arm-node> 'uname -m'   # expect: aarch64

The benchmark controls — what to hold fixed and why, because an uncontrolled benchmark lies:

Control	Hold fixed	Why	Failure if you don’t
Image	Same multi-arch tag	Only arch should vary	Comparing two different builds
Instance size	Like-for-like (`m6i` vs `m7g`, same size)	Fair vCPU/RAM	Apples-to-oranges sizing
Load model	Constant arrival rate	Stable comparison point	Open-loop skews tail latency
Warm-up	Discard first N minutes	JIT/caches must settle	Cold numbers favour neither fairly
Duration	Long enough for GC/compaction	See steady state	Misses periodic stalls
Native check	`uname -m` = aarch64	Rule out emulation	Benchmarking QEMU, not Graviton
Metric	RPS-at-SLO ÷ price	Price-performance	Raw speed misleads the decision

How to read the result — the decision table:

Benchmark result	It means	Do this
Graviton higher RPS, lower price	Clear price-perf win	Migrate; ramp the canary
Similar RPS, ~20% lower price	Price-performance win	Migrate; the savings are in the price
Lower RPS but cheaper, ratio still wins	Net price-perf win	Migrate on the ratio, not the latency
Lower RPS, ratio loses, native confirmed	Genuinely x86-favored hot path	Profile; swap library; or keep on x86
“Slow” but `uname -m` ≠ aarch64	You benchmarked QEMU	Fix the image; re-run native

A correct result reads: “m7g.xlarge sustained 9,400 RPS at p99 < 120 ms vs 7,800 RPS on m6i.xlarge, at ~20% lower hourly price — ~45% better price-performance.” If Graviton loses while confirmed native, you have found a workload that needs profiling (often a hot path with no Arm-optimized library), not a reason to abandon the program.

Surface 6 — Phased cutover, canary, and rollback

Migrate one tier at a time, in increasing order of blast radius: batch/async consumers and dev environments first, then stateless API tiers, then anything stateful. For each tier, run a canary on Graviton behind the same load balancer / service and watch SLOs.

The cutover order and why it’s sequenced this way:

Phase	Tier	Why this order	Rollback cost
1	Dev / staging	Catch build & agent issues cheaply	Trivial
2	Batch / async consumers (SQS, jobs)	No user-facing latency SLO	Re-queue; restart on x86
3	Lambda functions	One-flag change, lowest risk	Set `architectures` back
4	Stateless API tier	Bulk of the savings; canary-gated	`nodeAffinity`/weight flip
5	Caches (ElastiCache)	Validated K/V sizes	Node-type revert
6	Databases (RDS/Aurora)	Highest blast radius; blue/green	Switch back to x86 (blue)

The canary ramp and the SLO gate at each step:

Step	Graviton traffic share	Watch for a full traffic cycle	Promote if	Abort if
1	5-10%	p99, error rate, saturation	within x86 baseline	p99 drift > threshold
2	25%	+ GC/compaction behaviour	stable across peak	error-rate spike
3	50%	+ cost/throughput trend	price-perf confirmed	any SLO breach
4	100%	full peak soak	clean for one business cycle	regression at scale
5	Drain x86	residual emulation / stragglers	zero x86 pods needed	keep x86 if unsure

Rollback is trivial when you keep the x86 path alive. Because the image is multi-arch and the x86 node group still exists, rollback is a scheduling change: flip nodeAffinity back to amd64 (or shift target-group weights), and pods reschedule onto x86 with no rebuild and no image change. Keep both node groups until a tier has soaked at 100% Graviton for at least one full business cycle. The rollback triggers and the corresponding move:

Rollback trigger	Signal	Rollback move	Time to safe
p99 regression on canary	Latency dashboard vs baseline	Flip affinity/weight to `amd64`	Seconds (reschedule)
Error-rate spike	5xx / app errors climb	Shift ELB target weight to x86	Seconds
Agent crash-loop on Graviton	DaemonSet `CrashLoopBackOff`	Cordon Graviton nodes; pin to x86	Minutes
Emulation discovered	`uname -m` ≠ aarch64 under load	Fix image; meanwhile pin to x86	Minutes
DB failover regression	Aurora metrics degrade	Blue/green switch back to blue	Minutes

Architecture at a glance

The diagram traces the migration as it actually flows, left to right, as a pipeline from source to running fleet, with the failure point on each hop marked. Read it as four zones. In SOURCE & AUDIT, your repository and lockfiles go through the portability audit — the gate that catches a missing aarch64 wheel or an x86-only agent before anything is built (badge 1). In BUILD (multi-arch), the buildx builder cross-compiles or uses a native arm64 runner and pushes a manifest list to ECR; the failure here is a single-arch image that will silently emulate downstream (badge 2). The SCHEDULE & PLACE zone is where EKS (with nodeAffinity on kubernetes.io/arch) and Karpenter place pods onto Graviton or x86 nodes, and where an arm64 instance launched with an x86 AMI stalls (badge 3) or a not-yet-ported pod lands on Graviton and emulates (badge 4). Finally RUN & PROVE is the canary behind the load balancer, benchmarked for price-performance, with the x86 path kept alive for instant rollback (badge 5).

Notice the spine running through every zone: the same kubernetes.io/arch label and the same multi-arch manifest are what make placement correct and rollback a scheduling flip rather than a rebuild. The first question on every step is the same one that governs the whole migration — “am I running native arm64, or did something quietly fall back to emulation?” — and the diagram marks the exact hop where each silent fallback bites.

Real-world scenario

Paykit, a fintech platform team, ran a Java (Spring Boot) payments API on ~200 m6i.xlarge instances across three EKS clusters in ap-south-1 and wanted Graviton’s savings to hit a board-level cost target: cut platform compute spend by a third. Monthly EKS compute was roughly ₹52 lakh. The platform team was six engineers; the constraint was non-negotiable: a mandated EDR agent ran as a DaemonSet on every node, and the security team would not approve the migration until that exact sensor version was certified on arm64. They also suspected, but had not confirmed, that one internal library still pulled an x86-only native .so for a legacy HSM client.

They sequenced it deliberately, gating on the portability matrix. Week one’s audit caught both blockers — exactly as designed. The pip/npm-equivalent Maven dependency scan flagged the HSM client’s native .so as x86-only at the pinned version; the agent matrix showed the EDR sensor had a GA arm64 build but two patch versions ahead of the mandated one. Finding these in week one, not week three, was the whole point: they opened a vendor ticket for the certified EDR build and rebuilt the HSM client with an aarch64-unknown-linux-gnu toolchain, then published the service as a multi-arch manifest list and verified both platforms with docker buildx imagetools inspect.

The build farm was the next obstacle. Their existing CodeBuild project built amd64-only, and the first attempt to add arm64 via QEMU emulation made the image build take 22 minutes — unacceptable for a team that deployed a dozen times a day. They switched to a CodeBuild ARM_CONTAINER fleet building arm64 natively and a small merge step (imagetools create) to assemble the manifest from two digests; build time dropped back to under 5 minutes per arch in parallel. Slow CI would have stalled adoption regardless of how good Graviton looked on paper.

For the rollout they stood up a Graviton Karpenter NodePool alongside the existing x86 one and started with a 5% weighted canary, using preferred (not required) nodeAffinity so a bad pull could never strand a pod:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 90
        preference:
          matchExpressions:
            - key: kubernetes.io/arch
              operator: In
              values: ["arm64"]

Before trusting a single number they confirmed native execution — kubectl exec deploy/app -- uname -m returned aarch64 on the canary pods, ruling out the silent-emulation trap. The canary held p99 within 4% of the x86 baseline across a full peak cycle, so they ramped 5 → 25 → 50 → 100% over two weeks, draining the x86 node group last and keeping it alive until the API tier had soaked at 100% for a full business week. Benchmarking the API tier showed ~43% better price-performance; combined with a parallel flip of their async workers to arm64 Lambda and the Aurora reader fleet to db.r7g (rehearsed on a blue/green green-side first), the program cut the platform’s monthly compute bill by roughly a third, from ₹52 lakh toward the board target.

The decisive move was treating the EDR agent and the HSM .so as first-class migration dependencies caught by a gating audit, not afterthoughts discovered in production — either one, found late, would have blocked the whole effort after the tier was half-ported. The timeline, because the order of moves is the lesson:

Week	Step	Action	Effect	What it would have been if skipped
1	Portability audit	Scan lockfiles + agent matrix	Caught EDR + HSM `.so` blockers	Discovered in prod, week-3 veto
1	Vendor tickets	Request certified EDR arm64 build	Unblocked security sign-off	Tier stalled awaiting approval
2	Rebuild deps	aarch64 HSM client; multi-arch image	`imagetools inspect` shows both	Segfault on first arm64 node
2	Build farm	QEMU build = 22 min → reject	Adoption-killing CI	Slow CI erodes the rollout
3	Native CI	CodeBuild `ARM_CONTAINER` + merge	< 5 min/arch parallel	Teams avoid arm64 builds
4	Canary 5%	Karpenter NodePool, `preferred` affinity	`uname -m` = aarch64; p99 +4%	Bad pull strands a pod (if `required`)
5-6	Ramp 25→100%	SLO-gated weighted ramp	Clean through peak	Big-bang risk, hard rollback
6	Adjacent flips	Lambda arm64 + Aurora `db.r7g`	~⅓ bill cut	Savings left on the table

Advantages and disadvantages

The Graviton migration model — portable artifacts placed by the scheduler with the x86 path kept alive — both delivers the savings and contains the risk. Weigh it honestly:

Advantages (why this approach works)	Disadvantages (why it bites)
~20% lower hourly price + more throughput/$ on suitable workloads — savings land without re-architecting	The headline ~40% is workload-dependent; single-thread-bound x86-tuned code may lose
A multi-arch manifest means one tag serves both arches; the scheduler picks correctly	Ship a single-arch image and it silently emulates — “it runs” hides 60% lost throughput
Rollback is a `nodeAffinity`/weight flip — no rebuild, seconds to safe	You must keep the x86 node group alive (extra cost) for the soak window
Lambda arm64 is a one-flag change, lowest-risk highest-ROI flip	One x86-only bundled binary fails at cold start with a confusing error
Managed services flip by class with low-risk blue/green / clone testing	A class may be unavailable for your exact engine/version
The portability audit catches the one blocking dependency up front	Skip the audit and a mandated x86-only agent vetoes a half-ported tier in week three
Graviton + Spot stacks the deepest discount on interruption-tolerant tiers	Native-heavy stacks (Python/Node addons) need native-runner CI, not cross-compile

The approach is right for any horizontally-scaled, throughput-bound estate — web/API tiers, microservices, caches, queue consumers, JIT/managed runtimes — where the audit is done and CI builds on real silicon. It is wrong, or needs a benchmark-first posture, for single-thread-latency-bound code tuned for x86, hand-written x86 intrinsics/AVX-512 paths, and anything gated by a dependency with no aarch64 build. The disadvantages are all manageable — but only if you treat the audit and the native-execution check (uname -m) as gates, not optional steps.

Hands-on lab

Build a real multi-arch image, push it to ECR, run it on a Graviton instance, and prove it’s running native arm64 — free-tier-friendly where possible (we use a t4g Graviton instance, which has a free-trial allowance; delete at the end). Run from a workstation with Docker + buildx and the aws CLI configured.

Step 1 — Variables and an ECR repository.

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-south-1
REPO=graviton-lab
REPO_URI=$ACCOUNT.dkr.ecr.$REGION.amazonaws.com/$REPO
aws ecr create-repository --repository-name $REPO --region $REGION \
  --query 'repository.repositoryUri' --output text

Expected: the repository URI prints.

Step 2 — A tiny multi-arch app and Dockerfile.

cat > main.go <<'EOF'
package main
import ("fmt"; "net/http"; "runtime")
func main() {
  http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "hello from %s/%s\n", runtime.GOOS, runtime.GOARCH)
  })
  http.ListenAndServe(":8080", nil)
}
EOF
cat > Dockerfile <<'EOF'
# syntax=docker/dockerfile:1
FROM --platform=$BUILDPLATFORM golang:1.22 AS build
ARG TARGETOS TARGETARCH
WORKDIR /src
COPY main.go .
RUN go mod init lab && CGO_ENABLED=0 GOOS=$TARGETOS GOARCH=$TARGETARCH go build -o /out/app .
FROM public.ecr.aws/docker/library/alpine:3.20
COPY --from=build /out/app /usr/local/bin/app
ENTRYPOINT ["/usr/local/bin/app"]
EOF

Step 3 — Build and push a manifest list for both arches.

aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT.dkr.ecr.$REGION.amazonaws.com
docker buildx create --name multiarch --driver docker-container --use 2>/dev/null || docker buildx use multiarch
docker buildx build --platform linux/amd64,linux/arm64 \
  --tag $REPO_URI:1.0.0 --provenance=false --push .

Step 4 — Prove the image is genuinely multi-arch (the key check).

docker buildx imagetools inspect $REPO_URI:1.0.0
# Expect BOTH:  Platform: linux/amd64  AND  Platform: linux/arm64

If only one platform appears, the build was single-arch and any arm64 node would emulate or fail — that is the trap this lab teaches you to catch.

Step 5 — Launch a Graviton instance and run the image natively. Launch a t4g.micro with an arm64 AL2023 AMI (resolved from SSM, never hardcoded), then on the instance:

# On the Graviton instance (Docker installed):
uname -m                                  # expect: aarch64  (you are on Graviton)
aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin <acct>.dkr.ecr.ap-south-1.amazonaws.com
docker run --rm -p 8080:8080 -d <acct>.dkr.ecr.ap-south-1.amazonaws.com/graviton-lab:1.0.0
curl localhost:8080                       # expect: hello from linux/arm64

The pair that proves success: uname -m returns aarch64 and the app reports linux/arm64 — native Graviton, not emulation.

Step 6 — (Optional) Confirm the x86 variant exists too. On any x86 host, docker run --rm <repo>:1.0.0 prints hello from linux/amd64 from the same tag — one manifest list, both arches, the scheduler picks correctly.

Validation checklist. You built one Dockerfile into a multi-arch manifest list, verified both platforms with imagetools inspect, ran it native on Graviton confirmed by uname -m + GOARCH, and saw the same tag serve x86. The lab steps mapped to what each proves:

Step	What you did	What it proves	Real-world analogue
3	`buildx --platform amd64,arm64 --push`	One build → manifest list	Your production image build
4	`imagetools inspect` shows both	The anti-emulation gate	The check that catches silent QEMU
5	`uname -m`=aarch64 + `GOARCH`=arm64	Native Graviton, not emulation	The pre-benchmark sanity check
6	Same tag on x86 prints amd64	One tag, both arches	Mixed-arch fleet during transition

Cleanup (avoid lingering charges).

# Terminate the t4g instance from the console/CLI, then:
aws ecr delete-repository --repository-name graviton-lab --region ap-south-1 --force
docker buildx rm multiarch

Cost note. A t4g.micro is the cheapest Graviton instance (free-trial allowance applies in many accounts; otherwise a few paise per hour). An hour of this lab is well under ₹20, and terminating the instance + deleting the repo stops everything.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark during a cutover. First as a scannable table, then the same entries with the full confirm-command detail underneath.

#	Symptom	Root cause	Confirm (exact cmd / path)	Fix
1	Throughput ~⅓ expected on Graviton; “Graviton is slow”	Single-arch image running under QEMU	`docker buildx imagetools inspect <tag>` (one platform); `uname -m` vs `GOARCH`	Publish a multi-arch manifest list; redeploy
2	`exec format error` on an arm64 node	Wrong-arch binary executed	`file ./binary`; `uname -m` on host	Build/pull arm64 artifact (don’t rely on qemu)
3	`pip install` fails: “no matching distribution”	No aarch64 wheel at the pinned version	`pip download --platform manylinux2014_aarch64 ...`	Unpin / source-build w/ toolchain / swap package
4	ASG instances never `InService`, “capacity stall”	x86 AMI on an arm64 instance type	ASG activity = launch failed; `describe-images` Architecture	Use an arm64 AMI (resolve via SSM param)
5	Pod `Pending`: “didn’t match node affinity”	`required` `amd64` affinity but only arm64 nodes (or vice-versa)	`kubectl get nodes -L kubernetes.io/arch`; describe pod	Add matching node group; or relax to `preferred`
6	Container `SIGILL`/segfault on arm64 only	Hand-written x86 intrinsics / AVX path	Crashes arm64, fine amd64; `dmesg`	Portable build flag / arm64 codepath / library
7	EDR/agent DaemonSet `CrashLoopBackOff` on Graviton	Agent version not arm64-certified	`kubectl logs ds/<agent>`; vendor matrix	Pin certified arm64 build; canary one node group
8	Node native addon: “invalid ELF header”	x86 prebuilt addon baked, run on arm64	`npm rebuild` on arm64; check addon arch	Rebuild on arm64 runner; multi-arch image
9	Lambda fails at cold start on arm64	Bundled native binary is x86	`aws lambda get-function-configuration --query Architectures`	Rebuild the bundled dep on arm64
10	Slower than x86 even when confirmed native	Genuinely x86-favored hot path	Native-vs-native benchmark; `uname -m`=aarch64	Profile; swap library; or keep tier on x86
11	Multi-arch build takes 20+ min	QEMU emulating the other arch	Build log shows qemu; one slow arch	Cross-compile (Go/Rust) or native arm64 runner
12	RDS modify to `db.r7g` fails	Class unavailable for engine/version	`describe-orderable-db-instance-options`	Upgrade engine version; pick an available class
13	Some pods on arm64, some on x86, inconsistent	No affinity + only one arch ported	`kubectl get pods -o wide`; node arch labels	Pin not-ported pods to `amd64` until validated
14	“It’s on Graviton” but bill went up	Emulation needs more instances to carry load	`uname -m` across fleet; throughput per instance	Fix to native; re-right-size instance count

The expanded form, for the entries that bite hardest:

1. Throughput is a third of expected and the team concludes “Graviton is slow.” Root cause: A single-arch (linux/amd64-only) image was pulled to an arm64 node and is running under QEMU emulation — correct output, 30-60% of native throughput. Confirm: docker buildx imagetools inspect <tag> shows only linux/amd64; kubectl exec <pod> -- uname -m returns aarch64 while the binary is x86. The pair (aarch64 host, x86 binary) is the signature of emulation. Fix: Rebuild and push a multi-arch manifest list (--platform linux/amd64,linux/arm64), redeploy, re-verify with imagetools inspect.

2. exec format error when the container or binary starts on arm64. Root cause: A wrong-architecture binary is being executed directly (no emulation layer present). Confirm: file ./binary reports x86-64; uname -m on the host is aarch64. Fix: Build/pull the arm64 artifact. Installing qemu-user-static makes it run but is a slow stopgap, not a fix — produce the native binary.

3. pip install (or npm install) fails with “no matching distribution found.” Root cause: A native package has no aarch64 wheel/prebuilt at the pinned version. Confirm: pip download -r requirements.txt --platform manylinux2014_aarch64 --only-binary=:all: ... errors on that package. Fix: Unpin to a version that ships an aarch64 wheel, source-build it with the appropriate toolchain (e.g. Rust for cryptography), or swap the package. Build on a native arm64 runner so the source build is fast.

4. ASG instances never reach InService; it looks like a Spot/capacity stall. Root cause: The launch template references an x86 AMI on an arm64 (*g) instance type, so every launch fails. Confirm: ASG Activity history says the launch failed (not “no capacity”); aws ec2 describe-images --image-ids <ami> --query 'Images[].Architecture' returns x86_64. Fix: Resolve and use an arm64 AMI from SSM (.../al2023-ami-...-arm64); never hardcode an AMI ID across arches.

7. The EDR/observability DaemonSet crash-loops on Graviton nodes. Root cause: The agent version deployed has no certified arm64 build (or the wrong build was pulled). Confirm: kubectl logs ds/<agent> -n <ns> shows an arch/format error; the vendor’s support matrix lists a different arm64-GA version. Fix: Pin the certified arm64 build with security sign-off; validate on a single canary node group before fleet-wide. This is the classic week-three veto — catch it in the audit.

10. Genuinely slower than x86 even after confirming native execution. Root cause: A real x86-favored hot path — single-thread-bound code, an x86-only optimized library, or hand-tuned intrinsics with no Arm equivalent. Confirm: A native-vs-native benchmark (uname -m = aarch64 on both runs) shows Graviton losing on price-performance, not just raw speed. Fix: Profile the hot path; swap in an Arm-optimized library; or accept that this specific tier stays on x86. A loss here is a data point, not a program failure.

14. The fleet is “on Graviton” but the bill went up. Root cause: Widespread emulation — single-arch images carrying the load under QEMU at a fraction of throughput, so you provisioned more instances to compensate. Confirm: uname -m and per-instance throughput across the fleet; imagetools inspect on the deployed tags. Fix: Make every image native multi-arch, redeploy, then re-right-size the instance count to the real (higher) native throughput. The savings reappear once you’re native.

Best practices

Gate the rollout on a portability matrix. No tier starts migrating until every native dep, runtime, agent, and sidecar is confirmed aarch64 or has an explicit waiver. The audit is the cheapest insurance in the program.
Audit lockfiles, not requirements. The blocking dependency is almost always transitive and compiled — scan the resolved, pinned graph.
Validate every mandated agent at the exact policy version. “Has an arm64 build” is not “the version security mandates has an arm64 build.” Get explicit sign-off.
One Dockerfile, one manifest list. Build --platform linux/amd64,linux/arm64 and verify both with docker buildx imagetools inspect on every release. Never maintain two Dockerfiles.
Cross-compile compiled languages; native-build the rest. Go/Rust cross-compile cleanly via $TARGETARCH; build Python/Node native-heavy stacks on a native arm64 runner, not under QEMU.
Build arm64 CI on real silicon. CodeBuild ARM_CONTAINER or GitHub ubuntu-*-arm runners. Slow emulated builds quietly kill adoption.
Prove native before you benchmark. uname -m = aarch64 (and GOARCH/os.arch = arm64) is the gate before trusting any performance number — it rules out the silent-emulation trap.
Report price-performance, not raw speed. Sustained RPS at your latency SLO ÷ on-demand price, like-for-like sizes. The decision lives in the ratio.
Keep not-yet-ported pods pinned to x86. A required nodeAffinity on amd64 ensures a half-ported workload never lands on Graviton and emulates.
Use preferred affinity for the canary. It lets a bad pull fall back to x86 rather than stranding a pod Pending.
Keep the x86 path alive through the soak. Rollback is then a nodeAffinity/weight flip with no rebuild; drain x86 only after a full business cycle clean at 100%.
Stack Graviton with Spot on interruption-tolerant tiers for the deepest discount, with capacity-optimized allocation and interruption handling.
Start with Lambda and async workers. Lowest risk, fastest ROI; they build organizational confidence before you touch the API tier.

The signals worth watching before and during a cutover — leading indicators, not the lagging “it’s slow”:

Watch	Signal	Threshold (starting point)	Why it’s leading
Native execution	`uname -m` per node/pod	any `x86_64` on a `*g` node	Catches emulation before benchmarking
Manifest completeness	`imagetools inspect` platforms	< 2 platforms on a deployed tag	Catches single-arch ship before deploy
Canary p99 drift	p99 Graviton vs x86 baseline	> a few % sustained	Promote/abort decision input
Agent health	DaemonSet ready on Graviton nodes	any `CrashLoopBackOff`	The week-three veto, early
Per-instance throughput	RPS/instance vs benchmark	well below native number	Emulation or wrong sizing
Bill trend	Compute $ per unit work	rising during “migration”	Emulation needing more instances

Security notes

Least-privilege CI push. The CI role that pushes to ECR should be scoped to the specific repository and ecr:Put*/ecr:Batch* actions, via OIDC (configure-aws-credentials) or an instance profile — never long-lived keys in the runner.
Pin image digests, not floating tags. A floating :latest can flip architectures or content under you; reference the manifest-list digest so what you scanned is what you run.
Scan both architectures. Vulnerability scanning must cover the arm64 manifest too — an arm64 base image can carry different package versions than its amd64 sibling. ECR enhanced scanning covers the index.
Treat the EDR/security agent as a security gate, not a checkbox. The arm64 build must be the version your policy certifies; a downgrade to “any arm64 build” to unblock the migration is a security regression. Get explicit sign-off.
Verify the base image source. Pull base images from a trusted registry (ECR Public, your private ECR) and confirm the arm64 variant is the genuine multi-arch tag, not a typo-squatted or stale mirror.
Keep the rollback path authenticated. The x86 node group and its launch template/AMI access must remain valid through the soak window — a rollback that fails because the x86 path’s permissions lapsed is a worse incident than the regression it was meant to fix.
Don’t let qemu-user-static linger in production images. Installing emulation to “make it run” leaves an x86 binary path in a supposedly-arm64 image — a correctness and supply-chain smell. Produce native binaries.

The security controls that also de-risk the migration — secure and correct pull the same way here:

Control	Mechanism	Secures against	Also prevents
OIDC CI role to ECR	`configure-aws-credentials` + scoped policy	Leaked long-lived keys	Unscoped pushes to wrong repos
Digest pinning	manifest-list digest in deploy	Tag-flip / supply-chain swap	Accidental single-arch / arch flip
Scan the index	ECR enhanced scanning (both arches)	arm64-specific CVEs	Shipping an unscanned arm64 layer
Certified agent version	Security sign-off on arm64 build	Downgraded EDR coverage	Agent crash-loop on Graviton
Trusted base registry	Private ECR / ECR Public	Tampered base image	Stale/typo-squatted arm64 base
No lingering qemu	Native-only images	Hidden x86 execution path	Silent emulation in “arm64” image

Cost & sizing

The bill drivers and how they interact with the migration:

Instance-hours dominate, and Graviton lowers them two ways: a roughly 20% lower hourly price for comparable capacity, and — on throughput-bound workloads — more work per instance, so you run fewer of them. The compounded effect is the headline “up to ~40% better price-performance,” but only the price part is guaranteed; the throughput part is workload-dependent and must be benchmarked.
Emulation can erase the savings. A single-arch image under QEMU at a third of native throughput forces you to provision ~3× the instances — the bill goes up while the dashboard says “Graviton.” This is why uname -m and imagetools inspect are cost controls, not just correctness checks.
Lambda arm64 is priced lower per GB-second and often runs faster — frequently the best ROI in the program for the least risk. Flip eligible functions early.
The x86 soak path is a temporary cost. Keeping the x86 node group alive through the canary and one business cycle is duplicate capacity — real money, but cheap insurance for instant rollback. Drain it on schedule so it doesn’t become permanent.
Graviton + Spot + Savings Plans stack. On interruption-tolerant tiers, Graviton Spot is the deepest discount; Compute Savings Plans apply across instance families including Graviton, so commitment discounts and the architecture discount compound.

A rough monthly picture for a mid-size service: an x86 baseline of ₹4 lakh for ~80 m6i.xlarge-equivalents, migrating to m7g.xlarge at ~20% lower price and ~15% higher throughput, lands around ₹2.7-2.9 lakh once native and right-sized — roughly a third off, matching real-world programs. The cost levers and what each buys:

Cost lever	What you pay for / save	Rough effect	What it fixes	Watch-out
Graviton hourly price	~20% lower per comparable instance	-20% on the rate	The guaranteed part of the win	Only if running native
Throughput per instance	Fewer instances for same load	-0 to -30% on count	The benchmark-dependent part	Workload must scale out
Lambda arm64	Lower per-GB-second	~20% on eligible fns	Lowest-risk savings	Bundled deps must be aarch64
Graviton Spot	Deep discount on interruptible tiers	up to ~70-90% off on-demand	Batch/async/stateless	Needs interruption handling
Compute Savings Plan	Commitment discount incl. Graviton	stacks with the above	Predictable baseline	Right-size before committing
x86 soak duplicate	Temporary dual capacity	+ short-term cost	Instant rollback safety	Drain on schedule; don’t leave it
Karpenter consolidation	Bin-pack + drop idle Graviton nodes	further -10 to -30%	Over-provisioned node count	Mind disruption budgets
Emulation tax (anti-lever)	More instances under QEMU	bill rises	(nothing — it’s the bug)	`uname -m` to detect

Interview & exam questions

1. What is the single biggest silent risk in a Graviton migration, and how do you detect it? Shipping a single-arch (linux/amd64-only) image to an arm64 node, which runs under QEMU emulation — correct output at 30-60% of native throughput, silently erasing the price-performance gain. Detect it with docker buildx imagetools inspect <tag> (must show both platforms) and uname -m returning aarch64 while the binary is native arm64, confirmed by a healthy throughput number.

2. Why does Graviton compete on price-performance rather than raw speed, and what metric should a benchmark report? Graviton cores aren’t faster per core than the latest x86 at single-threaded, latency-bound work; they win on aggregate throughput per dollar (more cores at a lower price, strong memory bandwidth). The benchmark must report price-performance = sustained RPS at your latency SLO ÷ on-demand hourly price, like-for-like sizes — not the raw latency of one request.

3. How do you build a single image that runs on both x86 and arm64? Build a multi-arch manifest list with docker buildx build --platform linux/amd64,linux/arm64 --push, using $TARGETPLATFORM/$BUILDPLATFORM/$TARGETARCH so cross-builds are explicit. The result is one tag pointing at per-arch manifests; docker pull/Kubernetes resolves the matching architecture automatically.

4. A workload is native-heavy Python. Why prefer a native arm64 CI runner over cross-compiling? Cross-compiling C extensions and native wheels for a different arch is painful and error-prone, and emulated (QEMU) building is slow. Building on a native arm64 runner (CodeBuild ARM_CONTAINER, GHA ubuntu-24.04-arm) compiles the native dependencies on real silicon quickly and correctly, then a merge step stitches the manifest from per-arch digests.

5. How does the EKS scheduler know a node’s architecture, and how do you keep a not-yet-ported pod off Graviton? The kubelet sets the well-known label kubernetes.io/arch (amd64/arm64) on every node automatically. Pin the not-yet-ported pod with a required nodeAffinity matching kubernetes.io/arch: amd64, so it never schedules onto a Graviton node and accidentally emulates.

6. Why use preferred rather than required nodeAffinity during a canary? required makes the pod un-schedulable if no node of that arch is available (it goes Pending); preferred lets the scheduler fall back to x86 if an arm64 node or a correct pull isn’t available, so a transient issue can’t strand a pod. Once the image is validated you can tighten to required.

7. What makes rollback trivial in a well-run Graviton migration? Keeping the x86 node group alive plus shipping a multi-arch image means rollback is a scheduling change, not a rebuild: flip nodeAffinity back to amd64 (or shift ELB target-group weights) and pods reschedule onto x86 with no image change. You drain x86 only after a full business cycle clean at 100% Graviton.

8. An ASG of arm64 instances never reaches InService and looks like a capacity stall. Most likely cause? The launch template references an x86 AMI on an arm64 (*g) instance type, so every launch fails (not “no capacity”). Confirm via ASG Activity history (launch failed) and describe-images Architecture = x86_64. Fix by resolving an arm64 AMI from SSM Parameter Store.

9. Which migration surface most often vetoes a tier in week three, and how do you prevent it? A mandated agent (commonly EDR) with no certified arm64 build at the policy-required version. Prevent it by treating agents and sidecars as first-class, gated dependencies in the week-one portability audit, validating the exact mandated version on a single canary node group with security sign-off before fleet-wide.

10. Why is Lambda usually the first thing you migrate to arm64? It’s the lowest-risk, highest-ROI flip: set architectures = ["arm64"] and Lambda charges less per GB-second while many functions also run faster. The only requirement is that any bundled native dependency is an aarch64 build — packaged-correctly functions migrate with a one-line change.

11. The fleet is “on Graviton” but the bill went up. What happened? Widespread emulation — single-arch images carrying load under QEMU at a fraction of throughput, so the team provisioned more instances to compensate. The price-per-instance dropped but the count rose more. Fix: make every image native multi-arch, redeploy, and re-right-size the instance count to the real native throughput.

12. How do you decide whether a workload is a Graviton candidate before benchmarking? Screen on two axes: portability (does every compiled dependency have an aarch64 build?) and scaling profile (does it scale out cleanly and run more than one instance — throughput-bound, not single-thread-latency-bound x86-tuned code?). Candidates that pass both go straight to a canary; single-thread-bound or intrinsic-heavy code gets a benchmark-first posture.

These map to AWS Certified Solutions Architect – Associate (SAA-C03) — cost-optimized, resilient compute selection — and AWS Certified DevOps Engineer – Professional (DOP-C02) — CI/CD for multi-arch artifacts, deployment strategies, and safe rollout. The FinOps/price-performance angle touches the Cloud Practitioner cost pillar and Well-Architected Cost Optimization. A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Price-performance, instance selection	SAA-C03	Design cost-optimized, resilient compute
Multi-arch build / CI strategy	DOP-C02	CI/CD; artifact management
Canary, rollback, deployment safety	DOP-C02	Deployment strategies; resilience
nodeAffinity, Karpenter, EKS scheduling	SAA-C03 / DOP-C02	Container orchestration
Lambda arm64, cost levers	CLF-C02	Cloud economics; pricing
Spot + Graviton + Savings Plans	SAA-C03	Cost-optimized purchasing options

Quick check

You deploy to Graviton and throughput is a third of what the benchmark promised. What is the most likely cause and the two commands that confirm it?
What metric must a Graviton benchmark report, and why is raw request latency the wrong one?
True or false: cross-compiling under QEMU on an x86 builder is the recommended way to build arm64 images for a native-heavy Python service.
Your async-worker pod must stay on x86 for now. What exact Kubernetes mechanism keeps it off Graviton nodes, and why use required rather than preferred here?
An ASG of m7g instances never reaches InService and looks like a capacity shortage. What’s the real cause, and how do you confirm it?

Answers

A single-arch image running under QEMU emulation on the arm64 node. Confirm with docker buildx imagetools inspect <tag> (it will show only linux/amd64, not both platforms) and uname -m inside the pod returning aarch64 while the binary is x86 — the aarch64-host/x86-binary pair is the signature of emulation. Fix by publishing a multi-arch manifest list and redeploying.
Price-performance = sustained RPS at your latency SLO ÷ on-demand hourly price, like-for-like sizes. Raw latency is wrong because Graviton competes on throughput per dollar, not single-thread speed; a workload can have similar or slightly higher per-request latency yet win decisively on the ratio because the instance is ~20% cheaper and scales out better.
False. Emulated (QEMU) builds are correct but slow, and slow CI kills adoption. For native-heavy Python, build the arm64 variant on a native arm64 runner (CodeBuild ARM_CONTAINER / GHA ubuntu-24.04-arm) so native wheels compile on real silicon, then merge the manifest from per-arch digests.
A required nodeAffinity matching kubernetes.io/arch: amd64. Use required (not preferred) because a not-yet-ported workload must never land on Graviton and silently emulate or crash — preferred would allow it onto an arm64 node if x86 capacity were tight, which is exactly the outcome you’re preventing.
The launch template references an x86 AMI on an arm64 instance type, so every launch fails — it only looks like a capacity stall. Confirm via the ASG Activity history (the entry says the launch failed, not “insufficient capacity”) and aws ec2 describe-images --image-ids <ami> --query 'Images[].Architecture' returning x86_64. Fix by resolving an arm64 AMI from SSM Parameter Store.

Glossary

arm64 / aarch64 — the 64-bit Arm instruction set; the ISA boundary the migration crosses. Compiled code exists per-architecture.
Graviton — AWS-designed Arm Neoverse server processor (Graviton2 powers *6g, Graviton3 *7g, Graviton4 *8g); the target of the migration.
Manifest list / image index — one container tag pointing at per-architecture image manifests; lets docker pull/Kubernetes resolve the matching arch automatically.
QEMU emulation — running x86 binaries on arm64 (or vice-versa) via user-mode emulation; correct but 30-60% slower, the silent killer of the price-performance gain.
$TARGETPLATFORM / $BUILDPLATFORM / $TARGETARCH — buildx build args naming the target vs the builder’s architecture, used to make cross-builds explicit.
Cross-compilation — building for a different target arch from the builder’s native arch (clean for Go/Rust); contrasted with emulated building and native-runner building.
kubernetes.io/arch — the well-known node label (amd64/arm64) set automatically by the kubelet; the key for architecture-based scheduling.
nodeAffinity — a pod scheduling rule; required enforces an arch, preferred weights toward it with x86 fallback. The pin-and-rollback mechanism.
Karpenter NodePool — just-in-time node provisioning intent for EKS; expresses arch (arm64), capacity type (spot/on-demand), and instance types.
CodeBuild ARM_CONTAINER — native Arm build compute in CodeBuild for building arm64 artifacts on real silicon.
arm64 AMI — an architecture-matched machine image (AL2023, Bottlerocket, Ubuntu publish aarch64); an x86 AMI on a *g instance type fails the launch.
Native addon / wheel — a compiled dependency artifact (Node addon, Python wheel) that must have an aarch64 build or it breaks on arm64.
Price-performance — sustained throughput at your latency SLO divided by on-demand price; the only honest migration decision metric.
exec format error — the error when a wrong-architecture binary is executed directly with no emulation present.
Graviton + Spot — running Graviton on Spot capacity for the deepest discount on interruption-tolerant tiers; stacks with Compute Savings Plans.

Next steps

You can now run a portability-gated, benchmark-proven Graviton migration with instant rollback. Build outward:

Next: EC2 Spot, Mixed Instances & Capacity-Optimized ASGs — stack Graviton with Spot for the deepest discount on interruption-tolerant tiers.
Related: Deploy Karpenter on EKS: Consolidation, Spot & Disruption Budgets — provision Graviton just-in-time and consolidate for further savings.
Related: Docker Container Images for CI/CD: Dockerfiles & Registries — the image-build foundation under the multi-arch manifest list.
Related: GitHub Actions Fundamentals: Workflows, Jobs, Runners & Secrets — wire the arm64 native-runner CI that keeps builds fast.
Related: Amazon EC2 Deep Dive: Instances, AMIs, EBS, User Data & IMDS — the instance-type and AMI mechanics behind the rollout.
Related: FinOps Showback & Chargeback Platform on AWS — attribute and prove the price-performance savings the migration delivers.