AWS Regions and Availability Zones: Resiliency from the Ground Up

Quick take: An AWS Region is a geographic area (like ap-south-1); an Availability Zone is one or more physically separate datacenters inside that Region with independent power, cooling and network. Spreading a workload across AZs survives a single datacenter failure for almost no extra money. Spreading across Regions survives a whole-area disaster — at real cost and complexity. Do the first before you pay for the second.

A media startup ran its entire platform in one AWS Region, inside a single Availability Zone, to keep the bill down. A power event took that AZ offline and the site was dark for four hours. The post-mortem was brutal in its simplicity: the application was stateless, the database was a managed service, and the whole thing could have spread across three AZs for almost the same money — the compute capacity would have been identical, just placed in three buildings instead of one. Nobody had told them that an AZ is an independent failure domain, that a subnet lives in exactly one AZ, or that an Auto Scaling group listing one subnet is a single-AZ deployment wearing a multi-AZ costume.

This article is the reference that prevents that outage. We treat Regions and AZs not as trivia for the cloud-practitioner exam but as the resiliency hierarchy every production design rests on: pick the right failure domains, place each tier across them correctly, and know precisely which failures each layer does and does not survive. You will learn what a Region, an AZ, an AZ ID, a Local Zone, a Wavelength Zone and an edge location actually are; how Multi-AZ works for EC2, ALB, RDS, Aurora, EFS, S3 and DynamoDB; how an AZ failure is detected and drained; what cross-AZ traffic costs you; and the handful of misconfigurations that quietly turn “highly available” into “single point of failure.” Every concept comes with the exact aws CLI to confirm it and the CloudFormation/Terraform to set it, and — because this is operational — a symptom → cause → confirm → fix playbook you can open mid-incident.

By the end you will stop guessing about placement. You will know why three AZs beats two for quorum systems, why apse1-az1 in your account may be a different physical building than in mine, why a 60-second RDS failover is normal, what a NAT gateway per AZ saves you when an AZ dies, and when multi-Region is genuinely warranted versus cargo-culted. Resiliency is a hierarchy: master Multi-AZ first, add multi-Region only when the business case is real.

What problem this solves

Every datacenter fails eventually — power, cooling, a network device, a fibre cut, a bad deploy of the facility’s own software. If your whole application lives in one building, that building’s worst day is your worst day. Regions and AZs exist to give you a menu of failure domains so a single failure stops being a single point of failure.

What breaks without this knowledge is depressingly common. A team launches in the default AZ the console happened to pick, scales “out” by adding instances that all land in that same AZ, and ships to production believing the load balancer makes them resilient. Then one AZ degrades: the ALB has no healthy targets, the single-AZ RDS has no standby to promote, the NAT gateway that all egress depended on is gone, and the “highly available” system is down hard. The fix was free and structural — list three subnets instead of one — but nobody designed for the failure domain because nobody understood it was there.

The pain shows up in three distinct shapes, and the whole article maps to them:

A single AZ fails (the common case, a few times a year somewhere in the fleet). You survive it by spreading every tier across two or three AZs in the same Region. This costs essentially nothing extra and is non-negotiable for production.
A whole Region fails (rare, but real, and total when it happens). You survive it only by replicating to another Region — at meaningful cost and operational complexity, justified by RTO/RPO requirements, not fashion.
Users are far from your Region (latency, not availability). You improve it with edge locations (CloudFront, Global Accelerator), Local Zones for metro-proximity compute, and ultimately a second Region near them.

Who hits this: essentially everyone running anything on AWS. It bites hardest on teams who came from a single on-prem datacenter (where “the server room” was one failure domain and that was just life), cost-sensitive startups who read “single AZ is cheaper” and missed that multi-AZ compute is usually the same price, and anyone who built their VPC by clicking “next” without noticing the subnet-to-AZ mapping. The remedy is rarely “spend more” — it is “place what you already pay for across the failure domains that already exist.”

To frame the whole field before the deep dive, here is the resiliency hierarchy as a single table — each tier, the failure it survives, the rough cost delta, and the one thing people get wrong:

Failure domain	What it survives	What it does NOT survive	Typical cost delta	The classic mistake
Single AZ (1 datacenter)	Nothing — it is the blast radius	Any AZ event	Baseline	Running prod here “to save money”
Multi-AZ (2 AZs)	One AZ failure	Two-AZ or Region event; loses quorum	~0% on compute; data-transfer + standby	Two AZs for a 3-node quorum system
Multi-AZ (3 AZs)	One AZ failure with quorum intact	Region event	~0% on compute; more cross-AZ transfer	NAT gateway in one AZ only
Multi-Region (active/passive)	A whole-Region disaster	Global control-plane edge cases	High (duplicate stacks + replication)	Building this before Multi-AZ is solid
Multi-Region (active/active)	Region disaster + serves global users	Data-consistency complexity you now own	Highest	Underestimating split-brain/conflict handling
Edge (CloudFront / GA / Local Zones)	Latency for far users; absorbs L3/4 DDoS	It is not a DR strategy	Per-GB / per-hour	Treating a CDN as availability

Learning objectives

By the end of this article you can:

Define a Region, an Availability Zone, an AZ ID, a Local Zone, a Wavelength Zone and an edge location, and explain how each maps to a failure domain and a latency profile.
Explain why a subnet maps to exactly one AZ, why AZ names are randomized per account (and when to use AZ IDs instead), and how that drives VPC design.
Place each tier — ALB/NLB, EC2 Auto Scaling, RDS/Aurora, EFS, ElastiCache, NAT gateways, S3, DynamoDB — across AZs correctly, and say exactly what Multi-AZ buys for each.
Decide two AZs vs three on real grounds (quorum, cost, capacity headroom) rather than habit.
Diagnose the common “fake HA” failures — single-subnet ALB, ASG pinned to one AZ, single-AZ RDS, single-AZ NAT gateway — with the exact describe-* command that confirms each.
Quantify and reduce cross-AZ data-transfer cost, and know which traffic is free, which is billed, and how to keep chatty paths same-AZ where it matters.
Decide when multi-Region is genuinely warranted, choose active/passive vs active/active, and pick a Region against the four real criteria (latency, service availability, data residency, cost).
Map all of this to the relevant certifications (Cloud Practitioner, Solutions Architect Associate, SysOps) and answer the questions examiners actually ask.

Prerequisites & where this fits

You should be comfortable with the AWS console and aws CLI, understand that a VPC is your private network in a Region and that it is carved into subnets, and know roughly what EC2, an Auto Scaling group (ASG), an Application Load Balancer (ALB) and RDS are. Familiarity with basic networking (CIDR, route tables) and HTTP health checks helps. You do not need prior HA experience — building it correctly is what this article teaches.

This is a foundations piece in the AWS fundamentals track and sits upstream of almost everything else. The networking detail it assumes is covered in Amazon VPC, Subnets and Security Groups Explained — that is where subnets, route tables and the per-AZ structure live. The data-tier resiliency it references in depth is in Amazon RDS vs DynamoDB vs Aurora Compared. The cross-Region story it points at is the subject of AWS Backup and Disaster Recovery Strategies. The compute choices that determine what you are spreading across AZs are in AWS Compute: EC2 vs Lambda vs ECS vs EKS. Where this article ends — “you survived the AZ, now what about the Region?” — those three pick up.

A quick map of who owns what during a placement decision or an incident, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure class it can cause
Global edge (Route 53, CloudFront)	DNS routing, CDN, anycast	Frontend / SRE	Misrouting, stale TTL; not an AZ outage
Region selection	Latency, residency, service set	Architect / compliance	Wrong Region → latency or legal exposure
VPC / subnets	CIDR, AZ-to-subnet mapping	Network team	Single-AZ subnet design → fake HA
Load balancing	ALB/NLB subnet attachment, cross-zone	Platform / network	Single-subnet LB → blackhole on AZ loss
Compute	ASG AZ list, instance distribution	App / platform	ASG pinned to one AZ → outage
Data tier	RDS Multi-AZ, Aurora, replication	DBA / platform	Single-AZ DB → no failover
Egress	NAT gateway per AZ, IGW	Network team	Single-AZ NAT → egress dead on AZ loss

Core concepts

Five mental models make every later decision obvious.

Before the five models, fix one distinction that trips up every beginner: every AWS resource has a scope — global, regional, or zonal (AZ-bound) — and that scope decides what fails with what. Knowing a resource’s scope tells you instantly whether an AZ event can touch it:

Scope	What it means	Examples	What an AZ failure does to it
Global	Exists outside any single Region	IAM, Route 53, CloudFront, Organizations, WAF (for CloudFront)	Unaffected by an AZ (or single-Region) event
Regional	Lives in one Region, spans its AZs	S3, DynamoDB, SQS, SNS, Lambda, ECR, ELB (the service)	Survives one AZ — the service spreads across them
Zonal (AZ-bound)	Pinned to a single AZ	EC2 instance, EBS volume, subnet, NAT gateway, RDS instance	Dies with its AZ unless you placed peers elsewhere

The reading: you achieve HA by taking zonal resources and deploying copies across multiple AZs; regional services do that for you; global services are above the whole concern. Now the five models.

A Region is a geography; an AZ is a building (or buildings) you can fail independently. A Region is a named geographic area — ap-south-1 (Mumbai), us-east-1 (N. Virginia), eu-west-1 (Ireland) — each a fully independent island with its own copy of regional services. Inside a Region are Availability Zones: distinct locations, each one or more physically separate datacenters with independent power, cooling, physical security and network, far enough apart that a fire/flood/power event in one will not take another, yet close enough (typically within ~100 km, single-digit-millisecond latency) that synchronous replication between them is practical. Most Regions have three or more AZs; some have four to six. AZs are interconnected by high-bandwidth, low-latency private fibre — that link is what makes Multi-AZ synchronous databases feasible.

A subnet lives in exactly one AZ — this is the rule that governs all VPC HA. When you create a subnet you choose its AZ, and it never spans two. Therefore “spread across AZs” concretely means “create a subnet in each AZ and place resources in each subnet.” An ALB attached to one subnet is single-AZ. An ASG listing one subnet is single-AZ. A NAT gateway lives in one subnet, hence one AZ. Every resilience decision in a VPC reduces to which subnets (and therefore which AZs) did I list?

AZ names are randomized per account; AZ IDs are stable. The friendly name ap-south-1a is mapped to a different physical AZ in different accounts — AWS shuffles the name-to-hardware mapping so load spreads evenly and so two accounts don’t both pile onto “the first one.” The stable identifier is the AZ ID (e.g. apse1-az1), which refers to the same physical location across every account. This matters when you share resources across accounts (a shared VPC, a cross-account peering, a PrivateLink) and need them in the same physical AZ to avoid cross-AZ charges — you coordinate on the AZ ID, not the name.

Failure domains nest, and you choose how deep to go. A single instance can fail (host issue). A single AZ can fail (datacenter event). A single Region can fail (rare, area-wide). A global edge service can have a control-plane issue (rarer still). Each level up costs more to defend and removes a larger class of failure. The discipline is to defend the level your risk and budget justify — always Multi-AZ for production, multi-Region only when RTO/RPO demands it — and to know what you have not defended, rather than discover it during an incident.

Multi-AZ is mostly free on compute and cheap on data; multi-Region is expensive. Running three EC2 instances across three AZs costs the same as three in one AZ — you pay per instance, not per AZ. The Multi-AZ costs are subtler: cross-AZ data transfer (billed per GB each way), a standby for Multi-AZ RDS (you pay for the standby instance), and a NAT gateway per AZ if you want egress to survive an AZ loss. Multi-Region, by contrast, duplicates whole stacks and adds cross-Region replication bandwidth — a different order of cost. This asymmetry is the entire reason “Multi-AZ first” is the rule.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Failure domain	Why it matters
Region	A geographic area of independent infrastructure	Whole area	Latency, residency, service availability, cost
Availability Zone (AZ)	1+ physically separate datacenters in a Region	One datacenter cluster	The unit you spread across for HA
AZ ID	Stable cross-account ID for a physical AZ (`apse1-az1`)	—	Align shared resources to the same physical AZ
Subnet	A CIDR range bound to exactly one AZ	One AZ	“Spread across AZs” = a subnet per AZ
Local Zone	AWS compute placed in a metro, tied to a parent Region	Metro extension	Single-digit-ms latency to a specific city
Wavelength Zone	Compute embedded in a telco 5G network	Carrier edge	Ultra-low latency for mobile/5G apps
Edge location / PoP	CloudFront / Global Accelerator point of presence	Global anycast	CDN caching, DDoS absorption, fast TLS
Multi-AZ	Resources running in 2+ AZs in one Region	Survives 1 AZ	The baseline for production HA
Multi-Region	Resources running in 2+ Regions	Survives 1 Region	DR and global low-latency serving
Quorum	Majority needed for a consistent decision	Spans AZs	Why 3 AZs beats 2 for consensus systems
Cross-AZ data transfer	Bytes moving between AZs	—	Billed per GB each way; a real cost line

The beliefs that cause outages — a decision table

Most single-AZ disasters trace to a handful of wrong beliefs. Map what you think you have to what you actually have, and what to do:

If you believe…	It’s actually…	Do this
“I have an ALB, so I’m highly available”	True only if it’s attached to subnets in ≥2 AZs	`describe-load-balancers` — confirm multiple AZs
“Auto Scaling makes me multi-AZ”	True only if `VPCZoneIdentifier` lists multiple subnets	List one subnet per AZ; set `min ≥ 2–3`
“Managed RDS is automatically resilient”	False unless `MultiAZ=true`	Enable Multi-AZ; subnet group spans ≥2 AZs
“Single AZ is cheaper”	Compute is the same price; you only save a standby/NAT	Spread compute (free); pay only for real HA costs
“Two AZs is enough for everything”	False for quorum systems (no majority after one loss)	Use three AZs for stateful/consensus tiers
“My CDN makes me available”	False — CloudFront is latency/DDoS, not DR	Fix the origin’s AZ spread
“`ap-south-1a` is the same AZ everywhere”	False — names are per-account randomized	Coordinate on the AZ ID
“Multi-Region is the responsible default”	Premature if Multi-AZ isn’t solid first	Nail Multi-AZ, then justify multi-Region by RTO/RPO

Regions: what they are and how to choose one

A Region is the largest unit of isolation AWS gives you and the first decision you make. Regions are fully independent — ap-south-1 and eu-west-1 share no failure domain, and most regional services (EC2, RDS, SQS, etc.) are scoped to one Region; a resource in one Region is invisible to another unless you explicitly replicate. A handful of services are global (IAM, Route 53, CloudFront, WAF for CloudFront, Organizations) and exist outside any single Region.

List the Regions enabled for your account and inspect one:

# All Regions visible to your account (some are opt-in and disabled by default)
aws ec2 describe-regions \
  --query "Regions[].{Region:RegionName, Endpoint:Endpoint, OptIn:OptInStatus}" \
  --output table

# How many AZs does a Region have, and what are their stable IDs?
aws ec2 describe-availability-zones --region ap-south-1 \
  --query "AvailabilityZones[].{Name:ZoneName, Id:ZoneId, State:State}" \
  --output table

Choosing a Region — the four real criteria

You choose a Region on four axes, roughly in this priority order. Never default to whatever the console shows (often us-east-1):

Criterion	Why it matters	How to evaluate	Common trap
Latency to users	Round-trip time dominates UX	Measure from user geographies; closer Region wins	Picking `us-east-1` for an all-India user base
Service availability	Not every service/feature is in every Region	Check the AWS regional services list before committing	Designing for a service the Region lacks
Data residency / compliance	Law may require data stay in-country	Map regulatory requirement to Region geography	Storing regulated PII in the wrong jurisdiction
Cost	Per-unit prices vary by Region	Compare the same SKU across candidate Regions	Assuming all Regions cost the same

A worked comparison makes the four criteria concrete. Suppose Streamly (an India-first product) is choosing among three Regions for its primary stack — the grid that drives the call:

Candidate Region	Latency to Indian users	AZ count	Data residency fit	Relative cost	Verdict
`ap-south-1` (Mumbai)	Lowest (in-country)	3	In-India (meets local rules)	Baseline	Chosen — latency + residency
`ap-southeast-1` (Singapore)	Higher (sea hop)	3	Out-of-country	Slightly higher	DR Region candidate
`us-east-1` (N. Virginia)	Highest (trans-Pacific)	6	Out-of-country	Often cheapest	Rejected for primary (latency/residency)

The lesson the grid teaches: us-east-1 being cheapest and having the most AZs does not make it right — latency to the actual user base and data-residency law decided it. Pick the Region for your users, not the console default.

Region types and their quirks

Not all Regions behave identically. Some are opt-in (disabled by default; you must enable them and they have separate STS endpoints), some are isolated for government/sovereign use, and us-east-1 has a special role as the home of some global control planes:

Region type	Examples	Enabled by default?	Notable quirk
Standard (commercial)	`ap-south-1`, `eu-west-1`, `us-west-2`	Yes (most)	The normal case
Opt-in	newer Regions (e.g. some EU/ME/AF Regions)	No — must enable	Separate STS endpoint; IAM must be configured
`us-east-1` (special)	N. Virginia	Yes	Home of IAM/Route53/CloudFront control planes; some global ops only here
GovCloud	`us-gov-east-1`, `us-gov-west-1`	No (separate accounts)	Physically/logically isolated; US-person access controls
China	`cn-north-1`, `cn-northwest-1`	No (separate partition)	Operated by local partners; separate accounts/credentials
Wavelength (region-attached)	carrier 5G zones	No — must opt in	Compute inside a telco network; ultra-low latency
Local Zones (region-attached)	`us-west-2-lax-1a` etc.	No — must enable	Metro compute tied to a parent Region

The practical reading notes that save time:

Distinction	The trap	How to tell them apart
Region vs AZ	Treating “Mumbai” as a single failure domain	A Region contains multiple independent AZs; design across the AZs
Regional vs global service	Expecting EC2 in `us-east-1` to appear in `eu-west-1`	Regional services are Region-scoped; only IAM/Route53/CloudFront/Org are global
`us-east-1` outage scope	Assuming a `us-east-1` blip is “just one Region”	Some global control planes are anchored there; design global services with that in mind
Opt-in Region surprises	API calls fail with auth errors in a new Region	Opt-in Regions need enabling and a regional STS endpoint

Availability Zones: the failure domain that does the work

An Availability Zone is the unit you actually engineer around. Each AZ is isolated — its own power, cooling, networking and physical security — so a failure in one is contained. AZs in a Region are connected by redundant, low-latency private fibre (typically <1–2 ms between them), which is what lets a database in AZ-a synchronously commit to a standby in AZ-b without crippling write latency. AWS designs Regions so that AZs are meaningfully far apart (different flood plains, power grids, often kilometres of separation) while staying close enough for synchronous replication.

Why a subnet is one AZ — and what that forces

Because a subnet is bound to exactly one AZ, your VPC’s AZ coverage is literally the set of subnets you create. The canonical production VPC has, per AZ, a public subnet (for the ALB and NAT gateway) and one or more private subnets (for compute and data). To go Multi-AZ you replicate that subnet set into each AZ you want to use:

# Subnets in a VPC and the AZ each one lives in — your AZ coverage at a glance
aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0abc123" \
  --query "Subnets[].{Subnet:SubnetId, AZ:AvailabilityZone, AZId:AvailabilityZoneId, CIDR:CidrBlock, Public:MapPublicIpOnLaunch}" \
  --output table

# Terraform: a subnet per AZ, driven off the Region's AZ list — the idiomatic Multi-AZ VPC
data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 4, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  tags = { Name = "private-${data.aws_availability_zones.available.names[count.index]}" }
}

Two AZs or three? Decide on quorum and capacity, not habit

The two-vs-three question has real answers. Two AZs survives one AZ failure for stateless tiers and is the minimum for production. But quorum systems — anything using a majority vote for consistency (etcd/control planes, ZooKeeper, Aurora’s storage, many distributed databases) — need three AZs so that losing one still leaves a majority (2 of 3) and the cluster keeps writing. There is also a capacity argument: when one of two AZs fails, the survivor must absorb 100% of load (each AZ must be sized for 2×); with three AZs, a single failure shifts load to two survivors (each sized for 1.5×), which is cheaper headroom.

Factor	2 AZs	3 AZs
Survives one AZ failure	Yes (stateless)	Yes
Quorum after one AZ loss	No (1 of 2 = no majority)	Yes (2 of 3 = majority)
Spare capacity each AZ must hold	100% (size for 2×)	50% (size for 1.5×)
Cross-AZ data transfer	Lower	Slightly higher
Cost of idle headroom	Higher per AZ	Lower per AZ
Recommended for	Simple stateless web/app tiers	Quorum systems, critical prod, most real workloads

The rule: default to three AZs for anything stateful or critical; two is acceptable only for simple stateless tiers where you’ve accepted the 2× sizing cost.

What actually happens, second by second, when an AZ fails

For a correctly built Multi-AZ stack, an AZ loss is a sequence of automatic events, not a manual scramble. Knowing the timeline tells you what to expect (and what not to touch):

~Time after failure	What happens	Which mechanism	Your action
0 s	An AZ’s power/cooling/network faults; its hosts stop responding	(the event)	None — don’t restart blind
0–30 s	ALB health checks start failing for that AZ’s targets	ELB health checks	Watch per-AZ `HealthyHostCount`
~30 s	ALB stops routing to the dead AZ; serves from survivors	ALB target draining	Confirm survivors absorb load
60–120 s	RDS Multi-AZ promotes the standby in another AZ; DNS endpoint flips	RDS automatic failover	Confirm app reconnected (it retries)
1–3 min	ASG marks the AZ’s instances unhealthy, launches replacements in healthy AZs	Auto Scaling + ELB health	Verify capacity recovers
Minutes–hours	AWS recovers the AZ	(AWS)	Personal Health Dashboard
On recovery	ASG AZ Rebalance re-spreads instances back across all AZs	AZ Rebalance	Watch for brief extra instances

The single most important row is the first: on a correctly designed stack there is nothing to fix during the event — the platform drains, fails over and replaces automatically. Blind restarts and panic scaling (as Streamly’s first incident showed) only help when the design was wrong to begin with.

AZ IDs and cross-account alignment

When two accounts share infrastructure, the friendly AZ names lie — ap-south-1a is a different building in each. Align on the AZ ID (apse1-az1). The common case is a shared/peered VPC or a PrivateLink endpoint you want to keep same-AZ with the producer to avoid cross-AZ data-transfer charges:

# Map names to stable IDs in THIS account — coordinate cross-account on the Id, not the Name
aws ec2 describe-availability-zones --region ap-south-1 \
  --query "AvailabilityZones[].{Name:ZoneName, Id:ZoneId}" --output table

The settings that govern AZ behaviour, end to end:

Setting / control	What it does	Default	When to change	Gotcha
Subnet AZ	Binds a subnet to one AZ	Chosen at create	One subnet per AZ you use	Cannot be changed after creation
AZ ID vs name	Stable physical ID vs per-account name	Name shown in console	Cross-account same-AZ alignment	Names are randomized per account
ALB subnets	Which AZs the ALB serves	You choose ≥2	Always list ≥2 (ideally 3)	One subnet = single-AZ ALB
Cross-zone load balancing	LB spreads to targets in all AZs	On for ALB, off for NLB	Turn on for NLB to balance evenly	NLB cross-zone billed as cross-AZ transfer
ASG subnets/AZs	Which AZs compute spreads across	You list them	Always list ≥2–3	One subnet = single-AZ ASG
ASG AZ Rebalance	Re-spreads capacity after an AZ recovers	On (managed)	Leave on	Brief extra instances during rebalance

The numbers and quotas that bound AZ design

Real limits shape what you can build across AZs. The figures that matter (defaults; many are raisable via Service Quotas, some are hard):

Limit / quota	Typical value	Raisable?	Why it matters for AZ design
AZs per Region	3–6 (varies; some have 3, a few up to 6)	No (physical)	Caps how wide you can spread in one Region
Subnets per VPC	200 (default)	Yes	Plenty for a subnet-per-AZ-per-tier design
VPCs per Region	5 (default)	Yes	Multi-VPC architectures need an increase
NAT gateways per AZ	5 (default)	Yes	One-per-AZ for HA is well within this
Elastic IPs per Region	5 (default)	Yes	Per-AZ NAT each needs an EIP
Route tables per VPC	200 (default)	Yes	Per-AZ route tables are cheap on this budget
RDS DB instances per Region	40 (default)	Yes	Multi-AZ standby counts toward usage
Cross-AZ latency	<1–2 ms typical	No (physics)	Why synchronous Multi-AZ replication works
S3 / DynamoDB durability	11 nines (≥3 AZ)	No (design)	The free multi-AZ baseline you build on
RDS Multi-AZ failover	60–120 s (instance mode)	No	Budget this into RTO and client retries
`WEBSITE`/instance boot for ASG replace	1–3 min typical	No	Survivor capacity must cover the gap

The takeaways: the platform limits almost never constrain a sane Multi-AZ design (subnets, route tables and NAT quotas are generous), but the physical ones do — you cannot have more AZs than the Region offers, and you cannot make an RDS failover instantaneous. Design within the physics, raise the soft quotas as needed.

Multi-AZ for each tier: what it actually buys

“Multi-AZ” means something slightly different for every service. Knowing exactly what each one gives you — automatic or not, synchronous or not, free or not — is the difference between a design that survives an AZ loss and one that merely looks like it does.

Load balancers — the front door must face every AZ

An ALB/NLB is a regional, AZ-aware service, but only for the AZs whose subnets you attach. Attach it to one subnet and an AZ blip blackholes all inbound traffic; attach it to subnets in every AZ your targets live in and it routes around a dead AZ automatically via target health checks. Cross-zone load balancing (on by default for ALB, off for NLB) lets the LB send traffic to healthy targets in any AZ, smoothing load when AZs hold uneven capacity.

# Confirm the load balancer faces multiple AZs — one entry here is a single point of failure
aws elbv2 describe-load-balancers --names web-alb \
  --query "LoadBalancers[].{Name:LoadBalancerName, Scheme:Scheme, AZs:AvailabilityZones[].ZoneName}" \
  --output json

# CloudFormation: an ALB explicitly across three subnets in three AZs
WebALB:
  Type: AWS::ElasticLoadBalancingV2::LoadBalancer
  Properties:
    Type: application
    Scheme: internet-facing
    Subnets:
      - !Ref PublicSubnetAZa
      - !Ref PublicSubnetAZb
      - !Ref PublicSubnetAZc   # three AZs — never one
    SecurityGroups: [ !Ref AlbSecurityGroup ]

Compute — Auto Scaling across AZs is the whole game

An Auto Scaling group distributes instances across the subnets (AZs) you list and, on an AZ failure, marks instances there unhealthy, launches replacements in the survivors, and (via AZ Rebalance) re-spreads once the AZ recovers. List one subnet and your “auto-scaled, highly available” tier is a single-AZ deployment that dies with its AZ.

# Does the ASG actually span AZs? One AZ here is the classic fake-HA failure
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names web-asg \
  --query "AutoScalingGroups[].{Name:AutoScalingGroupName, Min:MinSize, Desired:DesiredCapacity, AZs:AvailabilityZones, Subnets:VPCZoneIdentifier}" \
  --output json

# Terraform: ASG across three private subnets (three AZs), min sized for quorum/headroom
resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  min_size            = 3
  max_size            = 9
  desired_capacity    = 6
  vpc_zone_identifier = aws_subnet.private[*].id   # all three AZ subnets
  health_check_type   = "ELB"
  target_group_arns   = [aws_lb_target_group.web.arn]

  instance_distribution {
    on_demand_base_capacity = 0
  }
}

Data tier — Multi-AZ semantics differ sharply by service

This is where the “what does Multi-AZ mean” question has the most variety. RDS Multi-AZ is a synchronous standby you pay for, with automatic failover. Aurora replicates storage across three AZs inherently. S3 and DynamoDB are multi-AZ by default, for free, with no knob to turn. EFS is multi-AZ when you mount it via per-AZ targets. ElastiCache needs Multi-AZ explicitly enabled. Get this table wrong and you put a single-AZ database behind a three-AZ app:

Service	Multi-AZ mechanism	Automatic on AZ loss?	Synchronous?	Cost of Multi-AZ	The gotcha
EC2	You place instances per AZ (ASG)	Via ASG/ELB health	N/A	~0% (same instance count)	One subnet = single AZ
ALB / NLB	Attach subnets per AZ	Yes (target health)	N/A	Cross-zone transfer (NLB billed)	One subnet = blackhole
RDS (Multi-AZ instance)	Synchronous standby in 2nd AZ	Yes (60–120 s failover)	Yes	~2× DB instance cost	Standby can’t serve reads (instance mode)
RDS Multi-AZ DB cluster	2 readable standbys across 3 AZs	Yes (faster failover)	Yes	~3 instances	Engine/Region support varies
Aurora	Storage replicated across 3 AZs	Yes (replica promotion)	Storage-level	Per-replica compute	No reader = slower failover
DynamoDB	Replicated across ≥3 AZs by default	Yes (transparent)	Yes	Free (built in)	Nothing to configure — don’t overthink it
S3 (Standard)	≥3 AZ object replication by default	Yes (transparent)	Yes	Free (built in)	One-Zone-IA is the single-AZ exception
EFS (Standard)	Data across multiple AZs; per-AZ mount targets	Yes	Yes	Standard storage price	Need a mount target per AZ to survive
ElastiCache (Redis)	Replicas across AZs + Multi-AZ failover	Only if Multi-AZ enabled	Async	Replica node cost	Off by default — enable it
NAT gateway	One per AZ (not automatic)	No — manual design	N/A	~per-AZ hourly + per-GB	One NAT = egress dies with its AZ

Recovery is not instantaneous, and the time differs sharply by service. Budget for it in your RTO and your client retry logic:

Service	Recovery mechanism on AZ loss	Typical recovery time	What the client experiences
ALB targets	Health-check draining to survivors	Seconds (health-check interval)	A few failed/retried requests
EC2 via ASG	Relaunch in healthy AZs	1–3 min (boot + warm)	Reduced capacity briefly
RDS Multi-AZ instance	Standby promotion + DNS flip	60–120 s	Connection drop; reconnect succeeds
RDS Multi-AZ DB cluster	Promote a readable standby	~35 s or less	Shorter blip than instance mode
Aurora	Promote a reader (or rebuild from storage)	~30 s with a reader; longer without	Faster with a provisioned reader
DynamoDB	Transparent (multi-AZ internally)	~0 (no visible failover)	Nothing
S3	Transparent (multi-AZ internally)	~0	Nothing
ElastiCache (Multi-AZ on)	Replica promotion	Tens of seconds	Brief cache unavailability

Two design consequences fall out of this table: always provision an Aurora reader (failover with no reader is much slower), and make clients retry with backoff — a 60–120 s RDS failover is invisible to a user only if the app reconnects rather than erroring out.

RDS Multi-AZ in CLI and CloudFormation, because it is the single most commonly missed data-tier control:

# Turn on Multi-AZ (synchronous standby + automatic failover) for an existing instance
aws rds modify-db-instance --db-instance-identifier prod-db \
  --multi-az --apply-immediately

# Confirm it stuck
aws rds describe-db-instances --db-instance-identifier prod-db \
  --query "DBInstances[].{Id:DBInstanceIdentifier, MultiAZ:MultiAZ, AZ:AvailabilityZone, Secondary:SecondaryAvailabilityZone}" \
  --output table

# CloudFormation: Multi-AZ RDS from the start (the only correct default for prod)
ProdDB:
  Type: AWS::RDS::DBInstance
  Properties:
    Engine: postgres
    DBInstanceClass: db.r6g.large
    MultiAZ: true                 # synchronous standby in a second AZ
    AllocatedStorage: 100
    DBSubnetGroupName: !Ref DbSubnetGroup   # the subnet group must list ≥2 AZ subnets

A DB subnet group must contain subnets in at least two AZs or RDS refuses to enable Multi-AZ — a frequent first-time error.

Egress — the NAT gateway trap

A NAT gateway is zonal: it lives in one subnet, hence one AZ. If all your private subnets route egress through a single NAT gateway and that AZ fails, every instance in every AZ loses outbound internet — an AZ failure in one zone takes down egress for all of them. The fix is one NAT gateway per AZ, with each AZ’s private route table pointing at its own zone’s NAT:

# How many NAT gateways, and in which AZs? Fewer than your AZ count = a hidden SPOF
aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=vpc-0abc123" \
  --query "NatGateways[].{Id:NatGatewayId, Subnet:SubnetId, State:State}" \
  --output table

The trade-off is cost: a NAT gateway per AZ multiplies the hourly charge and keeps cross-AZ NAT traffic same-AZ (cheaper and more resilient). For dev environments a single NAT gateway is a reasonable cost saving; for production it is a false economy.

Availability math: what each design actually promises

Resilience choices translate into hard availability numbers, and the AWS SLA depends on how you deploy. The figures that frame the conversation (illustrative SLA tiers and the downtime they imply):

Design	Representative availability target	Approx. downtime / year	What it survives
Single instance, single AZ	~99.5% (instance-level)	~1.8 days	Nothing structural
Single-AZ, redundant instances	~99.9%	~8.8 hours	Instance/host failures only
Multi-AZ (2–3 AZs)	~99.95–99.99%	~4.4 h–53 min	One AZ failure
Multi-Region (active/passive)	~99.99%+	~53 min or less	One Region failure
Multi-Region (active/active)	~99.999% achievable	~5 min	Region failure + global serving

Two honest caveats: these are design targets, not guarantees your app will hit (your code and dependencies matter), and AWS publishes specific SLAs per service — EC2, RDS, S3 each have their own. The point of the table is directional: each rung up the hierarchy removes roughly an order of magnitude of downtime, at increasing cost. Multi-AZ is the rung with the best return.

Who guarantees what — the shared-responsibility split

AWS runs the physical AZ/Region infrastructure; you decide how to deploy across it. Outages happen when teams assume AWS covers a layer they actually own:

Layer	AWS provides	You own
Physical AZ (power/cooling/network)	Independent, redundant facilities	Choosing to use more than one
Inter-AZ network	Low-latency redundant fibre	Sending traffic across it (and the bill)
Regional service durability	S3/DynamoDB multi-AZ by default	Not picking the single-AZ option (One Zone-IA)
Managed-service failover	RDS Multi-AZ mechanism	Enabling Multi-AZ; subnet group across AZs
Compute placement	Capacity across AZs	Listing multiple subnets on ASG/ALB
DNS failover	Route 53 health checks/policies	Configuring failover routing + low TTL
Multi-Region	The second Region’s infrastructure	Replicating data, mirroring controls, cutover

The recurring theme of every outage in this article lives in the right-hand column: AWS gives you independent failure domains; using them is your job.

Cross-AZ data transfer: the cost nobody budgets for

Multi-AZ is cheap, not free, and the line item people miss is data transfer between AZs. AWS bills traffic that crosses an AZ boundary — typically per GB in each direction — while same-AZ traffic (by private IP) and most traffic to/from regional services is free. On a chatty microservice mesh or a high-throughput app↔DB path, cross-AZ bytes become a real monthly number.

What’s free, what’s billed — the rules that actually matter:

Traffic path	Billed as cross-AZ?	Notes
EC2 ↔ EC2, same AZ, via private IP	Free	Use private IPs; public-IP hops can route oddly
EC2 ↔ EC2, different AZ, private IP	Billed (per GB each way)	The main cross-AZ cost driver
EC2 ↔ EC2 via public IP / Elastic IP (same Region)	Billed	Avoid public IPs for internal traffic
App ↔ RDS in a different AZ	Billed	Multi-AZ failover may move the active to another AZ
Within the same AZ to RDS	Free	Keep read replicas same-AZ as readers where possible
App ↔ ElastiCache in a different AZ	Billed	Hot cache paths benefit from same-AZ placement
Inter-AZ replication (RDS sync, Aurora storage)	Included in service cost	Not a separate transfer line you control
Traffic to S3 / DynamoDB in-Region (via gateway endpoint)	Free	Use a gateway VPC endpoint to also avoid NAT cost
Traffic to S3/DynamoDB in-Region without an endpoint (via NAT)	NAT data-processing charge	The hidden cost a gateway endpoint removes
Traffic through an interface VPC endpoint (PrivateLink)	Hourly + per-GB	Cheaper than NAT for many AWS-service calls
NLB cross-zone load balancing	Billed as cross-AZ	Why NLB cross-zone is off by default
ALB cross-zone	Not separately billed	On by default; effectively free to you
Data into AWS from the internet	Free	Ingress is generally free
Data out to the internet	Billed (per-GB, tiered)	The other big transfer line beyond cross-AZ
Cross-Region transfer	Billed (higher rate)	A different, larger cost than cross-AZ

How to keep the bill sane without sacrificing resilience:

Technique	What it saves	Trade-off
Use private IPs for internal traffic	Avoids public-IP routing surcharges	None — just discipline
Gateway VPC endpoints for S3/DynamoDB	Removes NAT data-processing + makes S3/DDB free	Endpoint setup; route-table entries
Keep chatty app↔cache same-AZ where viable	Cuts cross-AZ GB on hot paths	Reduced AZ spread for that link; balance vs HA
AZ-aware service routing (where the app supports it)	Prefers same-AZ targets	App/mesh complexity
Right-size 3 AZs vs 2 for actual traffic	Fewer cross-AZ hops if traffic is heavy	Quorum needs may force 3 anyway
NAT per AZ	Keeps NAT traffic same-AZ (cheaper + resilient)	More NAT gateways to pay for
Interface endpoints for chatty AWS-service calls	Avoids NAT data-processing for those calls	Per-endpoint hourly cost
VPC peering / Transit Gateway intra-Region over private IP	Avoids public-IP routing surcharges	TGW per-attachment + data cost

The honest tension: spreading across AZs adds cross-AZ transfer, and minimizing transfer pushes toward fewer AZs — but never sacrifice the AZ spread your availability target needs to shave a transfer bill. Optimize the paths (private IPs, endpoints, same-AZ for the hottest links), not the AZ count.

Edge, Local Zones and Wavelength: latency, not availability

Three constructs sit in front of or beside Regions and are routinely confused with HA. They are about latency and proximity, not surviving an AZ or Region failure. Using a CDN does not make you available; it makes you fast and absorbs some attacks.

Construct	What it is	Primary purpose	Failure-domain role	Example
Edge location / PoP	CloudFront / Global Accelerator point of presence	Cache content, terminate TLS near users, anycast routing	Absorbs L3/4 DDoS; not a DR strategy	CloudFront cache for static assets
Regional edge cache	Larger mid-tier cache behind PoPs	Improve cache-hit ratio	None (performance only)	CloudFront origin shielding
Local Zone	AWS compute/storage in a metro, tied to a parent Region	Single-digit-ms latency to a specific city	Metro extension of a Region; not separate DR	`us-west-2-lax-1` for LA media workloads
Wavelength Zone	Compute embedded in a telco’s 5G network	Ultra-low latency for mobile/5G/edge apps	Carrier-edge; tied to a Region	AR/VR, real-time mobile gaming
Outposts	AWS racks in your datacenter	AWS APIs on-prem (residency, latency)	On-prem extension; you own the building’s risk	Low-latency factory floor

A decision table for which proximity construct fits a given need:

If you need…	Reach for	Not for
Cache static/dynamic content near global users	CloudFront	Compute or DR
Lowest-latency, static-anycast entry + L3/4 DDoS	Global Accelerator	Caching (use CloudFront)
Single-digit-ms compute latency to a specific city	Local Zone	Surviving a Region failure
Ultra-low latency inside a 5G/mobile network	Wavelength Zone	General workloads
AWS APIs running in your datacenter (residency/latency)	Outposts	Offloading building risk to AWS
Survive one datacenter failing	Multi-AZ (not any of the above)	—
Survive a whole Region failing	Multi-Region (not any of the above)	—

The reading note: CloudFront/Global Accelerator improve latency and DDoS posture; Local Zones/Wavelength improve metro/carrier latency; none of them is a substitute for Multi-AZ or multi-Region resilience. If someone says “we’re safe, we have a CDN,” they have confused performance with availability.

Multi-Region: when it’s actually warranted

Multi-AZ survives the common failure. Multi-Region survives the rare, total one — a whole-Region event — and lets you serve users on another continent with low latency. It is also a large step up in cost and operational burden: duplicate stacks, cross-Region replication, DNS failover, and (for active/active) data-consistency problems you now own. The senior judgment is to add it only when RTO/RPO requirements or a global user base demand it — not because it sounds robust.

The patterns, in increasing cost/complexity:

Pattern	What runs where	RTO	RPO	Cost	When to choose
Backup & restore	Backups copied cross-Region; rebuild on disaster	Hours–day	Hours	Low	Non-critical; tight budget
Pilot light	Core (DB replica) warm in DR; rest off	~Tens of min	Minutes	Low–medium	Important but cost-sensitive
Warm standby	Scaled-down full stack always running in DR	Minutes	Seconds–minutes	Medium–high	Critical workloads, fast recovery
Active/active (multi-site)	Full stack live in both, traffic split	~Zero	~Zero (with global data)	Highest	Global low-latency + near-zero RTO

A side-by-side that maps each pattern to the AWS building blocks:

Concern	Backup & restore	Pilot light	Warm standby	Active/active
DB replication	Snapshot copy	Cross-Region read replica	Read replica (promotable)	Global table / Aurora Global DB
Compute in DR	None	Minimal	Scaled-down, running	Full, serving traffic
DNS strategy	Manual repoint	Route 53 failover	Route 53 failover/health	Route 53 latency/geolocation
Data store fit	Any	RDS/Aurora	RDS/Aurora	DynamoDB Global Tables, Aurora Global
Main risk	Long RTO	Promotion + scale-out time	Cost of idle stack	Split-brain / conflict resolution

The DNS layer is what actually steers traffic between Regions, and Route 53 routing policies are the lever. Picking the wrong one is a common multi-Region failure (traffic that won’t fail over, or that ignores latency):

Routing policy	What it does	Use it for	Watch-out
Failover	Primary; switch to secondary when a health check fails	Active/passive DR	Health check must probe a real endpoint; low TTL
Latency	Routes each user to the lowest-latency Region	Active/active global serving	Needs a healthy stack in each Region
Geolocation	Routes by the user’s location	Residency / localized content	Define a default for unmatched locations
Geoproximity	Routes by geography with a bias “shift”	Tuning traffic between Regions	More complex; needs Traffic Flow
Weighted	Splits traffic by assigned weights	Canary / gradual cutover	Weights are coarse; not health-aware alone
Multivalue answer	Returns multiple healthy records	Simple client-side spread	Not a substitute for a real LB
Simple	One record, no health logic	Single-Region only	No failover — wrong for multi-Region

The pairing that matters: active/passive DR uses Failover routing with a health check and a low TTL (60 s), while active/active global serving uses Latency or Geolocation with a live stack in every Region. A high record TTL (e.g. 3600 s) will pin resolvers to a dead Region for an hour — set it to 60 s for anything that needs to fail over.

The hard rule and a pointer onward: get Multi-AZ rock-solid first. A multi-Region design built on top of single-AZ tiers is theatre — you’ll fail to the DR Region for an AZ incident you should have absorbed locally, paying a huge RTO for a small failure. The cross-Region mechanics (snapshot copy, Vault Lock, Route 53 cutover, RTO/RPO tiers) are covered in depth in AWS Backup and Disaster Recovery Strategies.

Architecture at a glance

The diagram traces one request through the resiliency hierarchy, left to right, and marks the controls that — set wrong — collapse a failure domain. Start at the global edge: Route 53 applies latency and health-based routing, and CloudFront serves cached content from edge PoPs (with AWS Shield absorbing L3/4 attacks) — both global, in front of every Region. The request lands in Region ap-south-1, enters the VPC (10.0.0.0/16, three AZ subnets) and hits a single regional ALB on port 443 with cross-zone load balancing on. That one ALB fans the request across three independent Availability Zones (apse1-az1/2/3) — each a physically separate datacenter with its own power, cooling and network — where an EC2 Auto Scaling group spans all three (min 3, desired 6, one subnet per AZ). When an AZ’s power/cooling/network faults, it becomes an isolated failure domain; the ALB’s health checks drain the unhealthy AZ and capacity shifts to the survivors.

Behind compute sits the data tier: RDS Multi-AZ with a synchronous standby in a second AZ (60–120 s automatic failover), DynamoDB replicating across three AZs at eleven-nines durability for free, and S3 writing every object to ≥3 AZs, also free. Finally, a second Region (ap-southeast-1) holds a warm-standby passive stack plus cross-Region replication (an RDS read replica and S3 CRR) — the only thing in the picture that survives a whole-Region event. Each numbered badge marks a control that, misconfigured, breaks the chain: a single-subnet ALB (1), an ASG pinned to one AZ (2), the AZ failure domain itself (3), a database that isn’t Multi-AZ (4), and the absence of any region-failure plan (5). Read the legend as symptom · how to confirm · fix for each.

Real-world scenario

Streamly is a fictional but realistic Indian video-streaming startup: a three-tier app (React front end on CloudFront/S3, a Node API on EC2, PostgreSQL on RDS) serving ~250,000 daily users, almost all in India, out of ap-south-1 (Mumbai). The platform team is five engineers; the monthly AWS bill is about ₹6,80,000. To launch fast and cheap they had done what many do: clicked through the console, accepted the default AZ, and built everything in ap-south-1a — one public subnet, one private subnet, one NAT gateway, a single-AZ RDS instance, and an Auto Scaling group whose VPCZoneIdentifier listed exactly one subnet. On paper they had “an ALB, Auto Scaling, and managed RDS” — the vocabulary of high availability. In reality every tier shared one failure domain.

The incident hit on a Saturday evening during a cricket-final livestream — peak traffic. At 20:42 a power/cooling event degraded apse1-az1 (the physical AZ their ap-south-1a mapped to). The symptoms cascaded in seconds. The ALB, attached to one subnet, had no healthy targets and started returning 503. The Auto Scaling group tried to launch replacement instances — into the same dead AZ, because that was the only subnet it knew — and they failed to come up. The RDS instance, single-AZ, had no standby to promote; the database was simply gone. And because the lone NAT gateway lived in that AZ, even the few instances that mattered elsewhere had lost egress. The “highly available” stack was fully down. The on-call engineer’s reflexes — restart the app, scale the ASG — did nothing, because every lever pointed back into the failed AZ.

The breakthrough was diagnostic, not heroic. The senior on-call ran aws elbv2 describe-load-balancers and saw a single entry under AvailabilityZones. aws autoscaling describe-auto-scaling-groups showed one AZ and one subnet. aws rds describe-db-instances showed MultiAZ: false. The picture was unmistakable: this was never a multi-AZ system. There was no fast in-incident fix — you cannot conjure a standby database or new subnets mid-outage — so they waited for AWS to recover apse1-az1, which took about two hours and forty minutes. Total downtime during their highest-revenue window of the quarter.

The remediation, done deliberately over the next two weeks, was almost entirely structural and barely moved the bill. They added public and private subnets in apse1-az2 and apse1-az3, re-attached the ALB to all three public subnets, set the ASG’s VPCZoneIdentifier to all three private subnets with min_size = 3 and AZ Rebalance on, converted RDS to Multi-AZ (a synchronous standby in a second AZ), and deployed a NAT gateway per AZ with per-AZ route tables. The compute cost was unchanged — the same six instances, now spread across three buildings instead of stacked in one. The genuine new costs were the RDS standby (~+₹38,000/mo), two extra NAT gateways (~+₹9,000/mo), and a modest rise in cross-AZ data transfer (~₹12,000/mo) — together under 9% on a bill that had just eaten a multi-hour outage during the cricket final.

Six weeks later, a different AZ in ap-south-1 had a brief network event. This time describe-target-health showed the affected AZ’s targets draining, the ASG launched replacements in the two healthy AZs within minutes, RDS executed an automatic failover to its standby in about 70 seconds, and users saw a blip in buffering, not an outage. The line the team pinned to the wall: “Auto Scaling and Multi-AZ are not features you enable — they are subnets you list. We had the words without the wiring.”

Advantages and disadvantages

Multi-AZ as the default production posture is overwhelmingly the right call, but weigh it honestly:

Advantages (why Multi-AZ is the baseline)	Disadvantages (the costs and limits)
Survives the common failure (a single AZ) — by far the most likely datacenter incident	Does not survive a whole-Region event — multi-Region is a separate, costlier effort
Essentially free on compute — same instance count, spread across buildings	Cross-AZ data transfer is billed per GB each way; chatty paths add up
Managed services do it for you — S3/DynamoDB are multi-AZ by default, RDS Multi-AZ failover is automatic	A Multi-AZ RDS standby is a paid, idle instance (instance mode can’t even serve reads)
Operationally simple — identity, networking and most services stay within one Region	A NAT gateway per AZ for resilient egress multiplies that hourly charge
Low-latency HA — AZs are close enough for synchronous replication, so failover is fast and consistent	Quorum systems need three AZs, so two-AZ designs can still lose availability on one failure
Health-based draining routes around a dead AZ automatically when wired correctly	Capacity headroom: each AZ must hold spare for a neighbour’s failure (2× for two AZs)
AZ Rebalance re-spreads capacity automatically once an AZ recovers	Defaults are unsafe — console-default single-subnet placement looks HA but isn’t

The model is right for essentially every production workload: you want the common failure absorbed for almost no money, and managed services hand you most of the resilience. It bites when (a) you skip it and run single-AZ “to save money” — the costliest false economy in this article, (b) you build two AZs for a three-AZ quorum need, or © you minimize AZ spread to shave a data-transfer bill and reintroduce a single point of failure. Each disadvantage is manageable once you know it exists — which is the whole point.

Hands-on lab

Build a genuinely Multi-AZ web tier — VPC across three AZs, ALB across three subnets, an Auto Scaling group spanning all three — then prove it by draining an AZ’s targets and watching traffic survive. Free-tier-friendly where possible (t3.micro); delete everything at the end to avoid NAT/ALB charges. Run in CloudShell or any shell with the CLI configured for a three-AZ Region (we use ap-south-1).

Step 1 — Variables and the three AZs.

REGION=ap-south-1
VPC_CIDR=10.20.0.0/16
# Grab the first three AZ names in the Region
mapfile -t AZS < <(aws ec2 describe-availability-zones --region $REGION \
  --query "AvailabilityZones[?State=='available'].ZoneName" --output text | tr '\t' '\n' | head -3)
echo "Using AZs: ${AZS[@]}"   # expect three, e.g. ap-south-1a ap-south-1b ap-south-1c

Step 2 — VPC and an Internet Gateway.

VPC_ID=$(aws ec2 create-vpc --cidr-block $VPC_CIDR --region $REGION \
  --query Vpc.VpcId --output text)
IGW_ID=$(aws ec2 create-internet-gateway --region $REGION \
  --query InternetGateway.InternetGatewayId --output text)
aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID --region $REGION
echo "VPC=$VPC_ID IGW=$IGW_ID"

Step 3 — One public subnet per AZ (this is the Multi-AZ part).

SUBNETS=()
for i in 0 1 2; do
  SID=$(aws ec2 create-subnet --vpc-id $VPC_ID --region $REGION \
    --cidr-block 10.20.$((i*10)).0/24 --availability-zone ${AZS[$i]} \
    --query Subnet.SubnetId --output text)
  aws ec2 modify-subnet-attribute --subnet-id $SID --map-public-ip-on-launch --region $REGION
  SUBNETS+=($SID)
  echo "Subnet in ${AZS[$i]} = $SID"
done

Expected: three subnet IDs, one per AZ. This is your AZ coverage.

Step 4 — Route the subnets to the internet.

RT_ID=$(aws ec2 create-route-table --vpc-id $VPC_ID --region $REGION \
  --query RouteTable.RouteTableId --output text)
aws ec2 create-route --route-table-id $RT_ID --destination-cidr-block 0.0.0.0/0 \
  --gateway-id $IGW_ID --region $REGION
for SID in "${SUBNETS[@]}"; do
  aws ec2 associate-route-table --route-table-id $RT_ID --subnet-id $SID --region $REGION
done

Step 5 — Security group, then an ALB across all three subnets.

SG_ID=$(aws ec2 create-security-group --group-name lab-alb-sg --description "lab alb" \
  --vpc-id $VPC_ID --region $REGION --query GroupId --output text)
aws ec2 authorize-security-group-ingress --group-id $SG_ID --protocol tcp --port 80 \
  --cidr 0.0.0.0/0 --region $REGION

ALB_ARN=$(aws elbv2 create-load-balancer --name lab-alb --type application \
  --subnets "${SUBNETS[@]}" --security-groups $SG_ID --region $REGION \
  --query "LoadBalancers[0].LoadBalancerArn" --output text)

# Confirm it faces THREE AZs — the whole point of the lab
aws elbv2 describe-load-balancers --load-balancer-arns $ALB_ARN --region $REGION \
  --query "LoadBalancers[0].AvailabilityZones[].ZoneName" --output table

Expected: a table with three AZ rows. One row would mean a single-AZ ALB — the failure we’re avoiding.

Step 6 — A target group and an Auto Scaling group across all three subnets.

TG_ARN=$(aws elbv2 create-target-group --name lab-tg --protocol HTTP --port 80 \
  --vpc-id $VPC_ID --target-type instance --region $REGION \
  --query "TargetGroups[0].TargetGroupArn" --output text)
aws elbv2 create-listener --load-balancer-arn $ALB_ARN --protocol HTTP --port 80 \
  --default-actions Type=forward,TargetGroupArn=$TG_ARN --region $REGION

# Minimal launch template that serves an AZ-aware hello page
AMI=$(aws ssm get-parameter --region $REGION \
  --name /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \
  --query Parameter.Value --output text)
USERDATA=$(printf '#!/bin/bash\ndnf install -y httpd\nAZ=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)\necho "OK from $AZ" > /var/www/html/index.html\nsystemctl enable --now httpd\n' | base64 -w0)
LT_ID=$(aws ec2 create-launch-template --launch-template-name lab-lt --region $REGION \
  --launch-template-data "{\"ImageId\":\"$AMI\",\"InstanceType\":\"t3.micro\",\"SecurityGroupIds\":[\"$SG_ID\"],\"UserData\":\"$USERDATA\"}" \
  --query "LaunchTemplate.LaunchTemplateId" --output text)

SUBNET_CSV=$(IFS=,; echo "${SUBNETS[*]}")
aws autoscaling create-auto-scaling-group --auto-scaling-group-name lab-asg \
  --launch-template "LaunchTemplateId=$LT_ID" --min-size 3 --max-size 6 --desired-capacity 3 \
  --vpc-zone-identifier "$SUBNET_CSV" --target-group-arns $TG_ARN \
  --health-check-type ELB --region $REGION

# Prove the ASG spans three AZs
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names lab-asg --region $REGION \
  --query "AutoScalingGroups[0].{AZs:AvailabilityZones, Subnets:VPCZoneIdentifier}" --output json

Expected: AZs lists three zones. After a couple of minutes, curl http://<ALB-DNS>/ (from describe-load-balancers --query LoadBalancers[0].DNSName) returns OK from ap-south-1a/b/c, varying by which instance answered.

Step 7 — Simulate an AZ loss and watch survival. Pull the instances of one AZ out of the target group (a stand-in for that AZ failing) and confirm the ALB keeps serving from the other two:

# Watch target health; deregister one AZ's targets; traffic should continue from the rest
aws elbv2 describe-target-health --target-group-arn $TG_ARN --region $REGION \
  --query "TargetHealthDescriptions[].{Id:Target.Id, AZ:Target.AvailabilityZone, State:TargetHealth.State}" \
  --output table
# (Deregister the instance(s) whose AZ you want to 'fail', then re-run a curl loop against the ALB DNS — it stays up.)

Validation checklist. You built a VPC across three AZs, attached the ALB to three subnets (confirmed three AvailabilityZones), spread an ASG across all three (min 3), and demonstrated that removing one AZ’s targets does not take the service down. No application code resilience was involved — the availability came entirely from which subnets you listed.

Step	What you did	What it proves	Real-world analogue
3	One subnet per AZ	AZ coverage = the subnets you create	The structural HA decision
5	ALB across 3 subnets	The front door faces every AZ	Avoiding the single-subnet blackhole
6	ASG `VPCZoneIdentifier` = 3 subnets	Compute spreads across AZs	The fix for “fake HA” Auto Scaling
7	Drain one AZ’s targets	Service survives one AZ loss	An actual AZ incident

Cleanup (avoid NAT/ALB/instance charges).

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name lab-asg --force-delete --region $REGION
aws elbv2 delete-listener --listener-arn $(aws elbv2 describe-listeners --load-balancer-arn $ALB_ARN --region $REGION --query "Listeners[0].ListenerArn" --output text) --region $REGION
aws elbv2 delete-load-balancer --load-balancer-arn $ALB_ARN --region $REGION
aws elbv2 delete-target-group --target-group-arn $TG_ARN --region $REGION
aws ec2 delete-launch-template --launch-template-id $LT_ID --region $REGION
# Then detach/delete IGW, subnets, route table, SG, and finally the VPC.
aws ec2 detach-internet-gateway --internet-gateway-id $IGW_ID --vpc-id $VPC_ID --region $REGION
aws ec2 delete-internet-gateway --internet-gateway-id $IGW_ID --region $REGION
for SID in "${SUBNETS[@]}"; do aws ec2 delete-subnet --subnet-id $SID --region $REGION; done
aws ec2 delete-vpc --vpc-id $VPC_ID --region $REGION

Cost note. An ALB and three t3.micro instances for an hour run a few tens of rupees; we used no NAT gateway in the lab (public subnets) precisely to keep it cheap. Delete promptly — an ALB left running is the main lingering charge.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First a scannable table you can read mid-incident, then the full reasoning for the entries that bite hardest. Every “fake HA” failure has the same shape: a resource that looks spread across AZs but is listed against one.

#	Symptom	Root cause	Confirm (exact command)	Fix
1	One AZ degrades and the whole site 503s despite an ALB	ALB attached to a single subnet (one AZ)	`aws elbv2 describe-load-balancers --query "LoadBalancers[].AvailabilityZones"` shows one entry	Attach the ALB to subnets in ≥2 (ideally 3) AZs; turn on cross-zone
2	“Auto-scaled” tier dies entirely with one AZ; replacements won’t launch	ASG `VPCZoneIdentifier` lists one subnet	`aws autoscaling describe-auto-scaling-groups --query "...AvailabilityZones"` shows one AZ	List a subnet per AZ; set `min_size ≥ 2–3`; enable AZ Rebalance
3	AZ loss takes the database down, no failover	RDS is single-AZ (`MultiAZ=false`)	`aws rds describe-db-instances --query "...MultiAZ"` returns `false`	`modify-db-instance --multi-az` (subnet group must list ≥2 AZs)
4	One AZ fails and ALL instances (every AZ) lose internet egress	Single NAT gateway in the failed AZ	`aws ec2 describe-nat-gateways` shows fewer NAT GWs than AZs	One NAT gateway per AZ + per-AZ route tables
5	Shared/peered resources cost more or can’t co-locate across accounts	Aligned on AZ name not AZ ID	Compare `ZoneId` across accounts via `describe-availability-zones`	Coordinate on the AZ ID (`apse1-az1`), not `ap-south-1a`
6	NLB traffic piles onto one AZ; targets in others idle	NLB cross-zone load balancing is off (default)	`aws elbv2 describe-target-group-attributes` shows cross-zone false	Enable cross-zone on the NLB (note: billed as cross-AZ transfer)
7	Multi-AZ RDS enabled but reads still hammer one instance	Multi-AZ instance standby can’t serve reads	`describe-db-instances` shows MultiAZ true but one endpoint	Use Multi-AZ DB cluster or read replicas for read scaling
8	API call fails with auth/endpoint error in a new Region	Region is opt-in and not enabled / wrong STS endpoint	`aws ec2 describe-regions --query "...OptInStatus"`	Enable the Region; use its regional STS endpoint
9	Surprise data-transfer bill after going Multi-AZ	Heavy cross-AZ traffic (app↔DB, mesh) over private IP	Cost Explorer → data transfer; check inter-AZ GB	Gateway endpoints for S3/DDB; keep hottest paths same-AZ
10	App still down after failover; can’t reach the internet	NAT per AZ exists but route tables not per-AZ	`aws ec2 describe-route-tables` — all subnets point at one NAT	Give each AZ’s private subnet its own route to its AZ’s NAT
11	“We have a CDN, we’re highly available” — but an AZ event still broke us	Confusing edge/latency with availability	Architecture review: where do origins actually live?	CloudFront is performance/DDoS, not DR; fix origin AZ spread
12	Two-AZ quorum cluster loses availability when one AZ fails	Quorum needs majority; 1 of 2 is no majority	Cluster shows no quorum / read-only after one AZ down	Run quorum systems across three AZs
13	One-Zone-IA S3 data lost after an AZ event	`S3 One Zone-IA` stores in a single AZ	`aws s3api get-bucket...` / object storage class is `ONEZONE_IA`	Use S3 Standard (≥3 AZ) for anything not reproducible
14	EFS-backed app loses access in one AZ	No mount target in the surviving AZs	`aws efs describe-mount-targets` shows targets in <all AZs	Create an EFS mount target in every AZ the app runs in
15	ElastiCache cluster has no failover when its AZ dies	Multi-AZ not enabled on the replication group	`aws elasticache describe-replication-groups` → `AutomaticFailover` `disabled`	Enable Multi-AZ + automatic failover; add a replica in another AZ
16	RDS won’t enable Multi-AZ (“subnet group must cover 2 AZs”)	DB subnet group lists subnets in only one AZ	`aws rds describe-db-subnet-groups --query "...Subnets[].SubnetAvailabilityZone"`	Add a subnet from a second AZ to the DB subnet group
17	Spot/On-Demand capacity error when one AZ is constrained	All capacity requested in a single AZ	ASG activity history shows `InsufficientInstanceCapacity` in one AZ	Spread the ASG across 3 AZs; use mixed instances/capacity-optimized
18	“Highly available” stack on EBS won’t recover in another AZ	EBS volume is AZ-bound; can’t attach across AZs	`aws ec2 describe-volumes --query "...AvailabilityZone"`	Don’t rely on a single EBS volume for HA; use snapshots/EFS/managed stores

The expanded form, for the failures that cost the most:

1. One AZ degrades and the whole site 503s despite “having an ALB.” Root cause: The ALB is attached to a single subnet (one AZ). When that AZ’s targets go unhealthy, the ALB has nowhere to route. The presence of an ALB created the illusion of HA. Confirm: aws elbv2 describe-load-balancers --names <alb> --query "LoadBalancers[].AvailabilityZones[].ZoneName" returns one zone. Fix: Attach the ALB to subnets in ≥2 (ideally 3) AZs; ensure cross-zone load balancing is on (default for ALB). Re-run the confirm command and expect multiple zones.

2. The “auto-scaled, highly available” tier dies entirely with one AZ; replacements never come up. Root cause: The ASG’s VPCZoneIdentifier lists one subnet, so it can only ever launch into one AZ — including the replacements it tries to launch during that AZ’s failure. Confirm: aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names <asg> --query "AutoScalingGroups[].{AZs:AvailabilityZones, Subnets:VPCZoneIdentifier}" shows a single AZ. Fix: Set VPCZoneIdentifier to one subnet per AZ, min_size at least equal to the AZ count (≥3 for quorum/headroom), and leave AZ Rebalance on so capacity re-spreads after recovery.

3. An AZ loss takes the database down with no failover. Root cause: Single-AZ RDS — there is no standby to promote, so the database simply disappears with its AZ. Confirm: aws rds describe-db-instances --db-instance-identifier <db> --query "DBInstances[].MultiAZ" returns false. Fix: aws rds modify-db-instance --db-instance-identifier <db> --multi-az --apply-immediately. The DB subnet group must contain subnets in ≥2 AZs or the change is rejected. Expect a 60–120 s automatic failover on a real AZ event thereafter.

4. One AZ fails and every instance in every AZ loses outbound internet. Root cause: A single NAT gateway in the failed AZ that all private subnets route through; when its AZ dies, egress dies fleet-wide — an AZ failure amplified into a Region-wide egress outage. Confirm: aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=<vpc>" shows fewer NAT gateways than AZs; aws ec2 describe-route-tables shows multiple AZ subnets pointing at the same NAT. Fix: Deploy one NAT gateway per AZ and give each AZ’s private subnet its own route table pointing at its own zone’s NAT (see also #10).

5. Shared resources across accounts cost more or can’t be co-located. Root cause: You aligned on the AZ name (ap-south-1a), which maps to different physical AZs in different accounts, so “same AZ” wasn’t actually same — adding cross-AZ charges or breaking same-AZ placement assumptions. Confirm: aws ec2 describe-availability-zones --query "AvailabilityZones[].{Name:ZoneName,Id:ZoneId}" in each account and compare the ZoneId. Fix: Coordinate cross-account placement on the AZ ID (apse1-az1), not the name.

9. A surprise data-transfer bill after going Multi-AZ. Root cause: Correctly spreading across AZs added cross-AZ data transfer on chatty paths (app↔DB, service mesh), billed per GB each way. Confirm: Cost Explorer → filter on data-transfer usage types for inter-AZ; correlate with your hottest internal paths. Fix: Add gateway VPC endpoints for S3/DynamoDB (makes that traffic free and removes NAT processing), keep the hottest links same-AZ where availability allows, and use private IPs. Do not collapse AZ spread to save transfer — optimize paths, not the AZ count.

Best practices

Spread every production tier across at least two AZs; use three for anything stateful or quorum-based. Two AZs cannot hold a majority after one fails; three can. Default to three.
Remember the rule: HA is which subnets you list. ALB subnets, ASG VPCZoneIdentifier, and the DB subnet group must each enumerate multiple AZs. A single subnet anywhere is a single point of failure.
Use Multi-AZ RDS or Aurora for every production database. A single-AZ database has no failover; the standby/replicas across AZs are what give you a ~60–120 s recovery instead of a multi-hour outage.
Keep state in managed services that are multi-AZ by default — S3 Standard (≥3 AZ), DynamoDB (≥3 AZ) — and avoid the single-AZ exceptions (S3 One Zone-IA) for anything you can’t reproduce.
Deploy a NAT gateway per AZ with per-AZ route tables in production, so an AZ failure can’t take out egress for the whole VPC.
Coordinate cross-account placement on AZ IDs, not names. apse1-az1 is stable; ap-south-1a is not.
Size each AZ for a neighbour’s failure. With two AZs, each must hold ~2× steady-state; with three, ~1.5×. Bake the headroom into Auto Scaling minimums.
Choose the Region on latency, service availability, data residency and cost — never the console default. Measure latency from your real user geographies.
Treat CloudFront/Global Accelerator and Local Zones as latency/DDoS tools, not DR. They do not replace Multi-AZ or multi-Region.
Make Multi-AZ rock-solid before investing in multi-Region. Failing to the DR Region for an AZ incident is an expensive way to handle a small failure.
Monitor at the AZ granularity. Per-AZ target health, the Personal Health Dashboard, and EC2 status checks tell you which AZ is the problem.
Build and rehearse an AZ-failure runbook. Know in advance what drains, what fails over automatically (RDS), and what you’d touch manually.

The alerts and signals worth wiring before the next AZ event — the leading indicators, not just “site down”:

Watch	Signal / source	What it tells you	Action
Per-AZ target health	ALB `HealthyHostCount` by AZ	An AZ’s targets draining	Confirm AZ event; verify survivors absorb load
AWS health events	Personal Health Dashboard	AWS-reported AZ/Region issues	Match symptoms; avoid blind restarts
RDS failover	RDS events / `DBInstanceClass` AZ change	Standby promoted in another AZ	Confirm app reconnected to the new active
Egress reachability	NAT gateway metrics / synthetic egress check	An AZ’s NAT down	Confirm per-AZ NAT routing held
Cross-AZ transfer	Cost Explorer inter-AZ usage	Spend creeping on chatty paths	Add endpoints; review hot links
Capacity headroom	ASG desired vs max, per-AZ spread	Whether survivors can absorb load	Raise minimums / max if too tight
Spot interruptions by AZ	Spot interruption notices	Capacity stress in an AZ	Diversify instance types/AZs
Per-AZ error rate	ALB target 5xx by AZ	An AZ degrading before full failure	Pre-emptively drain / investigate

Security notes

Resilience and security overlap more than they look — placement decisions are also blast-radius decisions.

Isolate tiers with subnet + security-group design across AZs. Public subnets (ALB, NAT) and private subnets (app, data) per AZ keep the data tier off the internet while staying Multi-AZ; the VPC, subnets and security groups model is the foundation.
Don’t expose the data tier directly. RDS, ElastiCache and internal services belong in private subnets with no route to an IGW; the ALB in public subnets is the only internet-facing hop.
Use VPC endpoints (gateway for S3/DynamoDB, interface for others) so internal traffic to AWS services never traverses the public internet — and as a bonus, gateway endpoints remove the cross-AZ/NAT cost for that traffic.
Encrypt cross-AZ and cross-Region replication. RDS/Aurora replication, S3 CRR and DynamoDB global tables encrypt in transit and at rest with KMS; keep keys scoped per Region (or use multi-Region keys deliberately) so a Region’s compromise doesn’t hand over another’s data.
Scope IAM and resource policies per Region/account where it limits blast radius. A credential leak should not grant cross-Region or cross-account reach by default; align on AZ/Region IDs in conditions when you constrain placement.
For multi-Region, mirror your security controls, not just your data. WAF rules, security groups, GuardDuty, and config rules must exist in the DR Region too, or you fail over into a less-defended environment.
Treat the global control plane realistically. Some global services are anchored in us-east-1; design so a regional control-plane event degrades gracefully rather than hard-failing your global routing.

The placement-as-security controls in one view:

Control	Mechanism	Limits blast radius of	Resilience bonus
Private subnets per AZ	Route tables without IGW	Internet exposure of data tier	Multi-AZ without public exposure
Gateway VPC endpoints	S3/DynamoDB endpoints	Internet path for AWS-service traffic	Removes cross-AZ/NAT cost for that traffic
KMS per-Region keys	Region-scoped CMKs	Cross-Region key compromise	Clean per-Region encryption boundary
IAM Region/account scoping	Condition keys, SCPs	Cross-Region/account credential reach	Forces deliberate multi-Region grants
Mirrored controls in DR	WAF/SG/GuardDuty in both Regions	A weakly-defended failover target	DR Region is production-equivalent

Cost & sizing

The bill drivers and how they interact with resilience:

Compute is unchanged by AZ spread. Three instances across three AZs cost the same as three in one AZ — you pay per instance-hour, not per AZ. There is no compute reason to run single-AZ.
Cross-AZ data transfer is the line item Multi-AZ adds: billed per GB each way on traffic crossing an AZ boundary. On low-traffic apps it’s negligible; on chatty meshes or heavy app↔DB paths it’s real — mitigate with gateway endpoints and same-AZ hot paths, not by collapsing AZ spread.
Multi-AZ RDS doubles the database instance cost (you pay for the synchronous standby), and a Multi-AZ DB cluster runs ~3 instances. This is the genuine cost of automatic database failover — and far cheaper than a multi-hour outage.
NAT gateways per AZ multiply the hourly + per-GB NAT charge. For dev, one NAT gateway is a fine saving; for production, per-AZ NAT is resilient and keeps NAT traffic same-AZ (cheaper transfer).
Multi-Region is a different order of cost — duplicate stacks plus cross-Region replication bandwidth (a higher per-GB rate than cross-AZ). Justify it with RTO/RPO, not instinct.

A rough monthly picture for a small production three-tier app in ap-south-1, single-AZ vs proper Multi-AZ:

Cost driver	Single-AZ (don’t)	Multi-AZ (3 AZ)	Notes
Compute (6× app instances)	~₹X	~₹X (same)	Spread, not multiplied — identical cost
RDS	1× instance	~2× (standby)	The price of automatic failover
NAT gateways	1	3	Per-AZ NAT for resilient egress
Cross-AZ data transfer	~0	small–moderate	Per GB each way; optimize hot paths
ALB	1 (single subnet)	1 (three subnets)	Same ALB, no extra cost for more AZs
EBS volumes	per-AZ, same count	same count	Snapshots are cross-AZ/Region durable
Data egress to internet	same	same	Unchanged by AZ spread
Net delta of going Multi-AZ	—	standby + 2 NAT + transfer	Typically single-digit % of total bill

The headline: Multi-AZ is cheap insurance. For Streamly above it was under 9% on the bill, against a multi-hour outage during peak revenue. The expensive choice is single-AZ — it just defers the cost to your worst day. Multi-Region is where real money appears; spend it only when the recovery objectives demand it (the patterns and their cost/RTO/RPO trade-offs are in AWS Backup and Disaster Recovery Strategies).

Interview & exam questions

1. What is the difference between an AWS Region and an Availability Zone? A Region is a geographic area (e.g. ap-south-1) of fully independent infrastructure with its own copy of regional services. An Availability Zone is one or more physically separate datacenters within a Region, with independent power, cooling and network, connected to the other AZs by low-latency private fibre. You design across AZs to survive a datacenter failure and across Regions to survive an area-wide disaster.

2. Why does a subnet belong to exactly one AZ, and why does that matter? A subnet is bound to a single AZ at creation and can’t span two. It matters because your VPC’s AZ coverage is the set of subnets you create — “spread across AZs” concretely means “a subnet per AZ, with resources in each.” An ALB or ASG listing one subnet is single-AZ no matter what else you do.

3. What does RDS Multi-AZ actually provide, and how fast is failover? Multi-AZ provisions a synchronous standby in a second AZ and fails over to it automatically on an AZ or instance failure, typically in 60–120 seconds (faster for Multi-AZ DB clusters). In the classic instance mode the standby doesn’t serve reads — it’s for availability, not read scaling; use read replicas or a Multi-AZ DB cluster for that.

4. Two AZs or three — how do you decide? Two AZs survive one AZ failure for stateless tiers but cannot hold a majority for quorum systems (1 of 2 is no quorum). Three AZs keep a majority (2 of 3) after one failure and require less spare capacity per AZ (1.5× vs 2×). Default to three for anything stateful or consensus-based; two only for simple stateless tiers.

5. Why are AZ names randomized per account, and when do you use AZ IDs? AWS maps the friendly name (ap-south-1a) to a different physical AZ per account to spread load evenly. The AZ ID (apse1-az1) is stable across accounts. Use AZ IDs when coordinating cross-account placement (shared VPCs, PrivateLink, peering) to land resources in the same physical AZ and avoid cross-AZ charges.

6. A team has an ALB, Auto Scaling and managed RDS but still went fully down in an AZ event. What happened? Almost certainly “fake HA”: the ALB was on one subnet, the ASG’s VPCZoneIdentifier listed one subnet, and RDS was single-AZ — every tier shared one AZ. Confirm with describe-load-balancers, describe-auto-scaling-groups and describe-db-instances. The fix is structural — list multiple AZ subnets everywhere and enable Multi-AZ RDS.

7. How is cross-AZ data transfer billed, and what’s free? Traffic crossing an AZ boundary is billed per GB in each direction. Same-AZ traffic by private IP is free, as is in-Region traffic to S3/DynamoDB via gateway endpoints. ALB cross-zone isn’t separately billed; NLB cross-zone is billed as cross-AZ (why it’s off by default). Optimize the hot paths, but never drop the AZ spread you need for availability.

8. Is scaling out across more AZs the fix for a single-AZ NAT gateway outage? No — the issue is that a single NAT gateway in the failed AZ carried all egress. The fix is one NAT gateway per AZ with per-AZ route tables, so each AZ’s private subnets egress through their own zone’s NAT and one AZ’s failure can’t kill fleet-wide egress.

9. When is multi-Region genuinely warranted, and what comes first? When your RTO/RPO requirements exceed what a single Region can offer, or you must serve a global user base with low latency, or meet data-residency rules in another geography. Multi-AZ must be solid first — multi-Region built on single-AZ tiers makes you fail over for small AZ incidents at huge RTO cost.

10. What’s the difference between an edge location, a Local Zone and a Region for resilience? An edge location (CloudFront/Global Accelerator PoP) is for latency and DDoS, not availability. A Local Zone places compute in a metro tied to a parent Region for low latency — still not a separate DR domain. Only a Region (and AZs within it) is a true resilience boundary. Don’t confuse a CDN with high availability.

11. Which AWS data services are multi-AZ by default, and which need configuring? S3 Standard (≥3 AZ) and DynamoDB (≥3 AZ) are multi-AZ for free, automatically. RDS needs Multi-AZ enabled; ElastiCache needs Multi-AZ turned on; EFS needs a mount target per AZ. The single-AZ exception to watch is S3 One Zone-IA, which deliberately stores in one AZ.

12. How do you choose an AWS Region? On four criteria, roughly in order: latency to your users (measure it), service availability (not every service is in every Region), data residency / compliance (legal constraints on where data lives), and cost (per-unit prices vary by Region). Never default to us-east-1 just because the console opens there.

These map cleanly to AWS Certified Cloud Practitioner (CLF-C02) — global infrastructure, Regions/AZs/edge, the shared-responsibility and reliability pillars — and Solutions Architect Associate (SAA-C03) — designing resilient, multi-AZ and multi-Region architectures, choosing Regions, and Multi-AZ data services. The operational confirm-and-fix material maps to SysOps Associate (SOA-C02). A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Region vs AZ vs edge, global infrastructure	CLF-C02	Cloud concepts; global infrastructure
Multi-AZ design, choosing a Region	SAA-C03	Design resilient architectures
Multi-AZ data services (RDS/Aurora/DynamoDB/S3)	SAA-C03	Design high-availability/storage solutions
Multi-Region DR patterns (pilot light/warm standby)	SAA-C03 / SAP-C02	Design for reliability and DR
AZ-failure diagnosis, per-AZ monitoring, NAT routing	SOA-C02	Reliability and business continuity; networking
Cross-AZ cost, data-transfer optimization	SAA-C03	Cost-optimized architectures

Quick check

You have an ALB, an Auto Scaling group and managed RDS, yet a single AZ event took the whole application down. Name the most likely root cause and the one command that confirms it for the ALB.
True or false: running three EC2 instances spread across three AZs costs significantly more than running three in one AZ.
Why do quorum-based systems (etcd, ZooKeeper, many distributed databases) need three AZs rather than two?
Your application loses all outbound internet access — across every AZ — when one AZ fails. What’s the cause and the fix?
Two AWS accounts both want resources in “the same AZ” to avoid cross-AZ charges, but they keep landing in different physical zones. What are they doing wrong?

Answers

Fake HA — the ALB, ASG and/or RDS are each pinned to a single AZ (one subnet). Confirm the ALB with aws elbv2 describe-load-balancers --names <alb> --query "LoadBalancers[].AvailabilityZones[].ZoneName"; one zone means single-AZ. The fix is to attach the ALB to subnets in ≥2 (ideally 3) AZs, list multiple subnets on the ASG, and enable Multi-AZ RDS.
False. You pay per instance-hour, not per AZ — three instances cost the same whether they’re in one AZ or spread across three. The only Multi-AZ costs are cross-AZ data transfer, a paid RDS standby, and (optionally) a NAT gateway per AZ; compute itself is unchanged.
Quorum needs a majority to make consistent decisions. With two AZs, losing one leaves 1 of 2 — no majority, so the cluster stops accepting writes (or goes read-only). With three AZs, losing one leaves 2 of 3 — a majority — so it keeps operating. Three AZs preserve quorum through a single AZ failure.
A single NAT gateway in the failed AZ carried egress for every private subnet, so its AZ’s failure killed fleet-wide outbound internet. Fix: deploy one NAT gateway per AZ and give each AZ’s private subnet its own route table pointing at its own zone’s NAT.
They’re aligning on the AZ name (ap-south-1a), which AWS maps to a different physical AZ in each account. They should coordinate on the stable AZ ID (apse1-az1), visible via aws ec2 describe-availability-zones --query "AvailabilityZones[].{Name:ZoneName,Id:ZoneId}".

Glossary

Region — a named geographic area (e.g. ap-south-1) of fully independent AWS infrastructure, containing multiple Availability Zones and a local copy of regional services.
Availability Zone (AZ) — one or more physically separate datacenters within a Region, each with independent power, cooling, network and physical security; the failure domain you spread across for HA.
AZ ID — the stable, cross-account identifier for a physical AZ (e.g. apse1-az1), as opposed to the per-account-randomized friendly name (ap-south-1a).
Subnet — a CIDR range bound to exactly one AZ; “spreading across AZs” means creating a subnet per AZ.
Multi-AZ — running a workload’s tiers across two or more AZs in one Region so a single AZ failure is survived.
Multi-Region — running across two or more Regions to survive a whole-Region disaster or serve global users with low latency.
Quorum — the majority a consensus system needs to make consistent decisions; why three AZs beat two for such systems.
Cross-zone load balancing — a load-balancer setting (on for ALB, off for NLB by default) that lets it route to healthy targets in any AZ, smoothing uneven capacity.
AZ Rebalance — Auto Scaling behaviour that re-spreads instances across AZs after a failed AZ recovers.
NAT gateway — a zonal (single-AZ) managed NAT for outbound internet from private subnets; production needs one per AZ to keep egress resilient.
Edge location / PoP — a CloudFront / Global Accelerator point of presence for CDN caching, TLS termination near users, and L3/4 DDoS absorption — a latency/security tool, not DR.
Local Zone — AWS compute/storage placed in a specific metro and tied to a parent Region for single-digit-millisecond latency.
Wavelength Zone — AWS compute embedded inside a telco’s 5G network for ultra-low-latency mobile/edge applications.
DB subnet group — the set of subnets (which must span ≥2 AZs) an RDS instance can place its primary and standby in; required for Multi-AZ.
RTO / RPO — Recovery Time Objective (how fast you must recover) and Recovery Point Objective (how much data loss is tolerable); these drive whether and how you go multi-Region.
Pilot light / warm standby / active-active — multi-Region DR patterns of increasing cost and decreasing RTO, from a minimal warm core to a fully live second site.

Next steps

You can now place every tier across AZs correctly and know exactly what each layer survives. Build outward:

Next: Amazon VPC, Subnets and Security Groups Explained — the per-AZ subnet, route-table and NAT design that is your Multi-AZ wiring.
Related: Amazon RDS vs DynamoDB vs Aurora Compared — go deep on how each data service handles Multi-AZ, failover, and cross-Region replication.
Related: AWS Backup and Disaster Recovery Strategies — the cross-Region story: backup & restore, pilot light, warm standby, RTO/RPO, and Route 53 cutover.
Related: AWS Compute: EC2 vs Lambda vs ECS vs EKS — what you’re actually spreading across AZs, and how each compute model handles AZ placement.
Related: Amazon S3 Storage Classes and Lifecycle — durability and the One Zone-IA single-AZ exception to use deliberately.
Related: ALB vs NLB vs API Gateway Compared — the load-balancing front door, cross-zone behaviour, and where each fits in a Multi-AZ design.