AWS Fundamentals

AWS Regions and Availability Zones: Resiliency from the Ground Up

Quick take: An AWS Region is a geographic area (like ap-south-1); an Availability Zone is one or more physically separate datacenters inside that Region with independent power, cooling and network. Spreading a workload across AZs survives a single datacenter failure for almost no extra money. Spreading across Regions survives a whole-area disaster — at real cost and complexity. Do the first before you pay for the second.

A media startup ran its entire platform in one AWS Region, inside a single Availability Zone, to keep the bill down. A power event took that AZ offline and the site was dark for four hours. The post-mortem was brutal in its simplicity: the application was stateless, the database was a managed service, and the whole thing could have spread across three AZs for almost the same money — the compute capacity would have been identical, just placed in three buildings instead of one. Nobody had told them that an AZ is an independent failure domain, that a subnet lives in exactly one AZ, or that an Auto Scaling group listing one subnet is a single-AZ deployment wearing a multi-AZ costume.

This article is the reference that prevents that outage. We treat Regions and AZs not as trivia for the cloud-practitioner exam but as the resiliency hierarchy every production design rests on: pick the right failure domains, place each tier across them correctly, and know precisely which failures each layer does and does not survive. You will learn what a Region, an AZ, an AZ ID, a Local Zone, a Wavelength Zone and an edge location actually are; how Multi-AZ works for EC2, ALB, RDS, Aurora, EFS, S3 and DynamoDB; how an AZ failure is detected and drained; what cross-AZ traffic costs you; and the handful of misconfigurations that quietly turn “highly available” into “single point of failure.” Every concept comes with the exact aws CLI to confirm it and the CloudFormation/Terraform to set it, and — because this is operational — a symptom → cause → confirm → fix playbook you can open mid-incident.

By the end you will stop guessing about placement. You will know why three AZs beats two for quorum systems, why apse1-az1 in your account may be a different physical building than in mine, why a 60-second RDS failover is normal, what a NAT gateway per AZ saves you when an AZ dies, and when multi-Region is genuinely warranted versus cargo-culted. Resiliency is a hierarchy: master Multi-AZ first, add multi-Region only when the business case is real.

What problem this solves

Every datacenter fails eventually — power, cooling, a network device, a fibre cut, a bad deploy of the facility’s own software. If your whole application lives in one building, that building’s worst day is your worst day. Regions and AZs exist to give you a menu of failure domains so a single failure stops being a single point of failure.

What breaks without this knowledge is depressingly common. A team launches in the default AZ the console happened to pick, scales “out” by adding instances that all land in that same AZ, and ships to production believing the load balancer makes them resilient. Then one AZ degrades: the ALB has no healthy targets, the single-AZ RDS has no standby to promote, the NAT gateway that all egress depended on is gone, and the “highly available” system is down hard. The fix was free and structural — list three subnets instead of one — but nobody designed for the failure domain because nobody understood it was there.

The pain shows up in three distinct shapes, and the whole article maps to them:

Who hits this: essentially everyone running anything on AWS. It bites hardest on teams who came from a single on-prem datacenter (where “the server room” was one failure domain and that was just life), cost-sensitive startups who read “single AZ is cheaper” and missed that multi-AZ compute is usually the same price, and anyone who built their VPC by clicking “next” without noticing the subnet-to-AZ mapping. The remedy is rarely “spend more” — it is “place what you already pay for across the failure domains that already exist.”

To frame the whole field before the deep dive, here is the resiliency hierarchy as a single table — each tier, the failure it survives, the rough cost delta, and the one thing people get wrong:

Failure domain What it survives What it does NOT survive Typical cost delta The classic mistake
Single AZ (1 datacenter) Nothing — it is the blast radius Any AZ event Baseline Running prod here “to save money”
Multi-AZ (2 AZs) One AZ failure Two-AZ or Region event; loses quorum ~0% on compute; data-transfer + standby Two AZs for a 3-node quorum system
Multi-AZ (3 AZs) One AZ failure with quorum intact Region event ~0% on compute; more cross-AZ transfer NAT gateway in one AZ only
Multi-Region (active/passive) A whole-Region disaster Global control-plane edge cases High (duplicate stacks + replication) Building this before Multi-AZ is solid
Multi-Region (active/active) Region disaster + serves global users Data-consistency complexity you now own Highest Underestimating split-brain/conflict handling
Edge (CloudFront / GA / Local Zones) Latency for far users; absorbs L3/4 DDoS It is not a DR strategy Per-GB / per-hour Treating a CDN as availability

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with the AWS console and aws CLI, understand that a VPC is your private network in a Region and that it is carved into subnets, and know roughly what EC2, an Auto Scaling group (ASG), an Application Load Balancer (ALB) and RDS are. Familiarity with basic networking (CIDR, route tables) and HTTP health checks helps. You do not need prior HA experience — building it correctly is what this article teaches.

This is a foundations piece in the AWS fundamentals track and sits upstream of almost everything else. The networking detail it assumes is covered in Amazon VPC, Subnets and Security Groups Explained — that is where subnets, route tables and the per-AZ structure live. The data-tier resiliency it references in depth is in Amazon RDS vs DynamoDB vs Aurora Compared. The cross-Region story it points at is the subject of AWS Backup and Disaster Recovery Strategies. The compute choices that determine what you are spreading across AZs are in AWS Compute: EC2 vs Lambda vs ECS vs EKS. Where this article ends — “you survived the AZ, now what about the Region?” — those three pick up.

A quick map of who owns what during a placement decision or an incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure class it can cause
Global edge (Route 53, CloudFront) DNS routing, CDN, anycast Frontend / SRE Misrouting, stale TTL; not an AZ outage
Region selection Latency, residency, service set Architect / compliance Wrong Region → latency or legal exposure
VPC / subnets CIDR, AZ-to-subnet mapping Network team Single-AZ subnet design → fake HA
Load balancing ALB/NLB subnet attachment, cross-zone Platform / network Single-subnet LB → blackhole on AZ loss
Compute ASG AZ list, instance distribution App / platform ASG pinned to one AZ → outage
Data tier RDS Multi-AZ, Aurora, replication DBA / platform Single-AZ DB → no failover
Egress NAT gateway per AZ, IGW Network team Single-AZ NAT → egress dead on AZ loss

Core concepts

Five mental models make every later decision obvious.

Before the five models, fix one distinction that trips up every beginner: every AWS resource has a scope — global, regional, or zonal (AZ-bound) — and that scope decides what fails with what. Knowing a resource’s scope tells you instantly whether an AZ event can touch it:

Scope What it means Examples What an AZ failure does to it
Global Exists outside any single Region IAM, Route 53, CloudFront, Organizations, WAF (for CloudFront) Unaffected by an AZ (or single-Region) event
Regional Lives in one Region, spans its AZs S3, DynamoDB, SQS, SNS, Lambda, ECR, ELB (the service) Survives one AZ — the service spreads across them
Zonal (AZ-bound) Pinned to a single AZ EC2 instance, EBS volume, subnet, NAT gateway, RDS instance Dies with its AZ unless you placed peers elsewhere

The reading: you achieve HA by taking zonal resources and deploying copies across multiple AZs; regional services do that for you; global services are above the whole concern. Now the five models.

A Region is a geography; an AZ is a building (or buildings) you can fail independently. A Region is a named geographic area — ap-south-1 (Mumbai), us-east-1 (N. Virginia), eu-west-1 (Ireland) — each a fully independent island with its own copy of regional services. Inside a Region are Availability Zones: distinct locations, each one or more physically separate datacenters with independent power, cooling, physical security and network, far enough apart that a fire/flood/power event in one will not take another, yet close enough (typically within ~100 km, single-digit-millisecond latency) that synchronous replication between them is practical. Most Regions have three or more AZs; some have four to six. AZs are interconnected by high-bandwidth, low-latency private fibre — that link is what makes Multi-AZ synchronous databases feasible.

A subnet lives in exactly one AZ — this is the rule that governs all VPC HA. When you create a subnet you choose its AZ, and it never spans two. Therefore “spread across AZs” concretely means “create a subnet in each AZ and place resources in each subnet.” An ALB attached to one subnet is single-AZ. An ASG listing one subnet is single-AZ. A NAT gateway lives in one subnet, hence one AZ. Every resilience decision in a VPC reduces to which subnets (and therefore which AZs) did I list?

AZ names are randomized per account; AZ IDs are stable. The friendly name ap-south-1a is mapped to a different physical AZ in different accounts — AWS shuffles the name-to-hardware mapping so load spreads evenly and so two accounts don’t both pile onto “the first one.” The stable identifier is the AZ ID (e.g. apse1-az1), which refers to the same physical location across every account. This matters when you share resources across accounts (a shared VPC, a cross-account peering, a PrivateLink) and need them in the same physical AZ to avoid cross-AZ charges — you coordinate on the AZ ID, not the name.

Failure domains nest, and you choose how deep to go. A single instance can fail (host issue). A single AZ can fail (datacenter event). A single Region can fail (rare, area-wide). A global edge service can have a control-plane issue (rarer still). Each level up costs more to defend and removes a larger class of failure. The discipline is to defend the level your risk and budget justify — always Multi-AZ for production, multi-Region only when RTO/RPO demands it — and to know what you have not defended, rather than discover it during an incident.

Multi-AZ is mostly free on compute and cheap on data; multi-Region is expensive. Running three EC2 instances across three AZs costs the same as three in one AZ — you pay per instance, not per AZ. The Multi-AZ costs are subtler: cross-AZ data transfer (billed per GB each way), a standby for Multi-AZ RDS (you pay for the standby instance), and a NAT gateway per AZ if you want egress to survive an AZ loss. Multi-Region, by contrast, duplicates whole stacks and adds cross-Region replication bandwidth — a different order of cost. This asymmetry is the entire reason “Multi-AZ first” is the rule.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Failure domain Why it matters
Region A geographic area of independent infrastructure Whole area Latency, residency, service availability, cost
Availability Zone (AZ) 1+ physically separate datacenters in a Region One datacenter cluster The unit you spread across for HA
AZ ID Stable cross-account ID for a physical AZ (apse1-az1) Align shared resources to the same physical AZ
Subnet A CIDR range bound to exactly one AZ One AZ “Spread across AZs” = a subnet per AZ
Local Zone AWS compute placed in a metro, tied to a parent Region Metro extension Single-digit-ms latency to a specific city
Wavelength Zone Compute embedded in a telco 5G network Carrier edge Ultra-low latency for mobile/5G apps
Edge location / PoP CloudFront / Global Accelerator point of presence Global anycast CDN caching, DDoS absorption, fast TLS
Multi-AZ Resources running in 2+ AZs in one Region Survives 1 AZ The baseline for production HA
Multi-Region Resources running in 2+ Regions Survives 1 Region DR and global low-latency serving
Quorum Majority needed for a consistent decision Spans AZs Why 3 AZs beats 2 for consensus systems
Cross-AZ data transfer Bytes moving between AZs Billed per GB each way; a real cost line

The beliefs that cause outages — a decision table

Most single-AZ disasters trace to a handful of wrong beliefs. Map what you think you have to what you actually have, and what to do:

If you believe… It’s actually… Do this
“I have an ALB, so I’m highly available” True only if it’s attached to subnets in ≥2 AZs describe-load-balancers — confirm multiple AZs
“Auto Scaling makes me multi-AZ” True only if VPCZoneIdentifier lists multiple subnets List one subnet per AZ; set min ≥ 2–3
“Managed RDS is automatically resilient” False unless MultiAZ=true Enable Multi-AZ; subnet group spans ≥2 AZs
“Single AZ is cheaper” Compute is the same price; you only save a standby/NAT Spread compute (free); pay only for real HA costs
“Two AZs is enough for everything” False for quorum systems (no majority after one loss) Use three AZs for stateful/consensus tiers
“My CDN makes me available” False — CloudFront is latency/DDoS, not DR Fix the origin’s AZ spread
ap-south-1a is the same AZ everywhere” False — names are per-account randomized Coordinate on the AZ ID
“Multi-Region is the responsible default” Premature if Multi-AZ isn’t solid first Nail Multi-AZ, then justify multi-Region by RTO/RPO

Regions: what they are and how to choose one

A Region is the largest unit of isolation AWS gives you and the first decision you make. Regions are fully independentap-south-1 and eu-west-1 share no failure domain, and most regional services (EC2, RDS, SQS, etc.) are scoped to one Region; a resource in one Region is invisible to another unless you explicitly replicate. A handful of services are global (IAM, Route 53, CloudFront, WAF for CloudFront, Organizations) and exist outside any single Region.

List the Regions enabled for your account and inspect one:

# All Regions visible to your account (some are opt-in and disabled by default)
aws ec2 describe-regions \
  --query "Regions[].{Region:RegionName, Endpoint:Endpoint, OptIn:OptInStatus}" \
  --output table

# How many AZs does a Region have, and what are their stable IDs?
aws ec2 describe-availability-zones --region ap-south-1 \
  --query "AvailabilityZones[].{Name:ZoneName, Id:ZoneId, State:State}" \
  --output table

Choosing a Region — the four real criteria

You choose a Region on four axes, roughly in this priority order. Never default to whatever the console shows (often us-east-1):

Criterion Why it matters How to evaluate Common trap
Latency to users Round-trip time dominates UX Measure from user geographies; closer Region wins Picking us-east-1 for an all-India user base
Service availability Not every service/feature is in every Region Check the AWS regional services list before committing Designing for a service the Region lacks
Data residency / compliance Law may require data stay in-country Map regulatory requirement to Region geography Storing regulated PII in the wrong jurisdiction
Cost Per-unit prices vary by Region Compare the same SKU across candidate Regions Assuming all Regions cost the same

A worked comparison makes the four criteria concrete. Suppose Streamly (an India-first product) is choosing among three Regions for its primary stack — the grid that drives the call:

Candidate Region Latency to Indian users AZ count Data residency fit Relative cost Verdict
ap-south-1 (Mumbai) Lowest (in-country) 3 In-India (meets local rules) Baseline Chosen — latency + residency
ap-southeast-1 (Singapore) Higher (sea hop) 3 Out-of-country Slightly higher DR Region candidate
us-east-1 (N. Virginia) Highest (trans-Pacific) 6 Out-of-country Often cheapest Rejected for primary (latency/residency)

The lesson the grid teaches: us-east-1 being cheapest and having the most AZs does not make it right — latency to the actual user base and data-residency law decided it. Pick the Region for your users, not the console default.

Region types and their quirks

Not all Regions behave identically. Some are opt-in (disabled by default; you must enable them and they have separate STS endpoints), some are isolated for government/sovereign use, and us-east-1 has a special role as the home of some global control planes:

Region type Examples Enabled by default? Notable quirk
Standard (commercial) ap-south-1, eu-west-1, us-west-2 Yes (most) The normal case
Opt-in newer Regions (e.g. some EU/ME/AF Regions) No — must enable Separate STS endpoint; IAM must be configured
us-east-1 (special) N. Virginia Yes Home of IAM/Route53/CloudFront control planes; some global ops only here
GovCloud us-gov-east-1, us-gov-west-1 No (separate accounts) Physically/logically isolated; US-person access controls
China cn-north-1, cn-northwest-1 No (separate partition) Operated by local partners; separate accounts/credentials
Wavelength (region-attached) carrier 5G zones No — must opt in Compute inside a telco network; ultra-low latency
Local Zones (region-attached) us-west-2-lax-1a etc. No — must enable Metro compute tied to a parent Region

The practical reading notes that save time:

Distinction The trap How to tell them apart
Region vs AZ Treating “Mumbai” as a single failure domain A Region contains multiple independent AZs; design across the AZs
Regional vs global service Expecting EC2 in us-east-1 to appear in eu-west-1 Regional services are Region-scoped; only IAM/Route53/CloudFront/Org are global
us-east-1 outage scope Assuming a us-east-1 blip is “just one Region” Some global control planes are anchored there; design global services with that in mind
Opt-in Region surprises API calls fail with auth errors in a new Region Opt-in Regions need enabling and a regional STS endpoint

Availability Zones: the failure domain that does the work

An Availability Zone is the unit you actually engineer around. Each AZ is isolated — its own power, cooling, networking and physical security — so a failure in one is contained. AZs in a Region are connected by redundant, low-latency private fibre (typically <1–2 ms between them), which is what lets a database in AZ-a synchronously commit to a standby in AZ-b without crippling write latency. AWS designs Regions so that AZs are meaningfully far apart (different flood plains, power grids, often kilometres of separation) while staying close enough for synchronous replication.

Why a subnet is one AZ — and what that forces

Because a subnet is bound to exactly one AZ, your VPC’s AZ coverage is literally the set of subnets you create. The canonical production VPC has, per AZ, a public subnet (for the ALB and NAT gateway) and one or more private subnets (for compute and data). To go Multi-AZ you replicate that subnet set into each AZ you want to use:

# Subnets in a VPC and the AZ each one lives in — your AZ coverage at a glance
aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-0abc123" \
  --query "Subnets[].{Subnet:SubnetId, AZ:AvailabilityZone, AZId:AvailabilityZoneId, CIDR:CidrBlock, Public:MapPublicIpOnLaunch}" \
  --output table
# Terraform: a subnet per AZ, driven off the Region's AZ list — the idiomatic Multi-AZ VPC
data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 4, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  tags = { Name = "private-${data.aws_availability_zones.available.names[count.index]}" }
}

Two AZs or three? Decide on quorum and capacity, not habit

The two-vs-three question has real answers. Two AZs survives one AZ failure for stateless tiers and is the minimum for production. But quorum systems — anything using a majority vote for consistency (etcd/control planes, ZooKeeper, Aurora’s storage, many distributed databases) — need three AZs so that losing one still leaves a majority (2 of 3) and the cluster keeps writing. There is also a capacity argument: when one of two AZs fails, the survivor must absorb 100% of load (each AZ must be sized for 2×); with three AZs, a single failure shifts load to two survivors (each sized for 1.5×), which is cheaper headroom.

Factor 2 AZs 3 AZs
Survives one AZ failure Yes (stateless) Yes
Quorum after one AZ loss No (1 of 2 = no majority) Yes (2 of 3 = majority)
Spare capacity each AZ must hold 100% (size for 2×) 50% (size for 1.5×)
Cross-AZ data transfer Lower Slightly higher
Cost of idle headroom Higher per AZ Lower per AZ
Recommended for Simple stateless web/app tiers Quorum systems, critical prod, most real workloads

The rule: default to three AZs for anything stateful or critical; two is acceptable only for simple stateless tiers where you’ve accepted the 2× sizing cost.

What actually happens, second by second, when an AZ fails

For a correctly built Multi-AZ stack, an AZ loss is a sequence of automatic events, not a manual scramble. Knowing the timeline tells you what to expect (and what not to touch):

~Time after failure What happens Which mechanism Your action
0 s An AZ’s power/cooling/network faults; its hosts stop responding (the event) None — don’t restart blind
0–30 s ALB health checks start failing for that AZ’s targets ELB health checks Watch per-AZ HealthyHostCount
~30 s ALB stops routing to the dead AZ; serves from survivors ALB target draining Confirm survivors absorb load
60–120 s RDS Multi-AZ promotes the standby in another AZ; DNS endpoint flips RDS automatic failover Confirm app reconnected (it retries)
1–3 min ASG marks the AZ’s instances unhealthy, launches replacements in healthy AZs Auto Scaling + ELB health Verify capacity recovers
Minutes–hours AWS recovers the AZ (AWS) Personal Health Dashboard
On recovery ASG AZ Rebalance re-spreads instances back across all AZs AZ Rebalance Watch for brief extra instances

The single most important row is the first: on a correctly designed stack there is nothing to fix during the event — the platform drains, fails over and replaces automatically. Blind restarts and panic scaling (as Streamly’s first incident showed) only help when the design was wrong to begin with.

AZ IDs and cross-account alignment

When two accounts share infrastructure, the friendly AZ names lie — ap-south-1a is a different building in each. Align on the AZ ID (apse1-az1). The common case is a shared/peered VPC or a PrivateLink endpoint you want to keep same-AZ with the producer to avoid cross-AZ data-transfer charges:

# Map names to stable IDs in THIS account — coordinate cross-account on the Id, not the Name
aws ec2 describe-availability-zones --region ap-south-1 \
  --query "AvailabilityZones[].{Name:ZoneName, Id:ZoneId}" --output table

The settings that govern AZ behaviour, end to end:

Setting / control What it does Default When to change Gotcha
Subnet AZ Binds a subnet to one AZ Chosen at create One subnet per AZ you use Cannot be changed after creation
AZ ID vs name Stable physical ID vs per-account name Name shown in console Cross-account same-AZ alignment Names are randomized per account
ALB subnets Which AZs the ALB serves You choose ≥2 Always list ≥2 (ideally 3) One subnet = single-AZ ALB
Cross-zone load balancing LB spreads to targets in all AZs On for ALB, off for NLB Turn on for NLB to balance evenly NLB cross-zone billed as cross-AZ transfer
ASG subnets/AZs Which AZs compute spreads across You list them Always list ≥2–3 One subnet = single-AZ ASG
ASG AZ Rebalance Re-spreads capacity after an AZ recovers On (managed) Leave on Brief extra instances during rebalance

The numbers and quotas that bound AZ design

Real limits shape what you can build across AZs. The figures that matter (defaults; many are raisable via Service Quotas, some are hard):

Limit / quota Typical value Raisable? Why it matters for AZ design
AZs per Region 3–6 (varies; some have 3, a few up to 6) No (physical) Caps how wide you can spread in one Region
Subnets per VPC 200 (default) Yes Plenty for a subnet-per-AZ-per-tier design
VPCs per Region 5 (default) Yes Multi-VPC architectures need an increase
NAT gateways per AZ 5 (default) Yes One-per-AZ for HA is well within this
Elastic IPs per Region 5 (default) Yes Per-AZ NAT each needs an EIP
Route tables per VPC 200 (default) Yes Per-AZ route tables are cheap on this budget
RDS DB instances per Region 40 (default) Yes Multi-AZ standby counts toward usage
Cross-AZ latency <1–2 ms typical No (physics) Why synchronous Multi-AZ replication works
S3 / DynamoDB durability 11 nines (≥3 AZ) No (design) The free multi-AZ baseline you build on
RDS Multi-AZ failover 60–120 s (instance mode) No Budget this into RTO and client retries
WEBSITE/instance boot for ASG replace 1–3 min typical No Survivor capacity must cover the gap

The takeaways: the platform limits almost never constrain a sane Multi-AZ design (subnets, route tables and NAT quotas are generous), but the physical ones do — you cannot have more AZs than the Region offers, and you cannot make an RDS failover instantaneous. Design within the physics, raise the soft quotas as needed.

Multi-AZ for each tier: what it actually buys

“Multi-AZ” means something slightly different for every service. Knowing exactly what each one gives you — automatic or not, synchronous or not, free or not — is the difference between a design that survives an AZ loss and one that merely looks like it does.

Load balancers — the front door must face every AZ

An ALB/NLB is a regional, AZ-aware service, but only for the AZs whose subnets you attach. Attach it to one subnet and an AZ blip blackholes all inbound traffic; attach it to subnets in every AZ your targets live in and it routes around a dead AZ automatically via target health checks. Cross-zone load balancing (on by default for ALB, off for NLB) lets the LB send traffic to healthy targets in any AZ, smoothing load when AZs hold uneven capacity.

# Confirm the load balancer faces multiple AZs — one entry here is a single point of failure
aws elbv2 describe-load-balancers --names web-alb \
  --query "LoadBalancers[].{Name:LoadBalancerName, Scheme:Scheme, AZs:AvailabilityZones[].ZoneName}" \
  --output json
# CloudFormation: an ALB explicitly across three subnets in three AZs
WebALB:
  Type: AWS::ElasticLoadBalancingV2::LoadBalancer
  Properties:
    Type: application
    Scheme: internet-facing
    Subnets:
      - !Ref PublicSubnetAZa
      - !Ref PublicSubnetAZb
      - !Ref PublicSubnetAZc   # three AZs — never one
    SecurityGroups: [ !Ref AlbSecurityGroup ]

Compute — Auto Scaling across AZs is the whole game

An Auto Scaling group distributes instances across the subnets (AZs) you list and, on an AZ failure, marks instances there unhealthy, launches replacements in the survivors, and (via AZ Rebalance) re-spreads once the AZ recovers. List one subnet and your “auto-scaled, highly available” tier is a single-AZ deployment that dies with its AZ.

# Does the ASG actually span AZs? One AZ here is the classic fake-HA failure
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names web-asg \
  --query "AutoScalingGroups[].{Name:AutoScalingGroupName, Min:MinSize, Desired:DesiredCapacity, AZs:AvailabilityZones, Subnets:VPCZoneIdentifier}" \
  --output json
# Terraform: ASG across three private subnets (three AZs), min sized for quorum/headroom
resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  min_size            = 3
  max_size            = 9
  desired_capacity    = 6
  vpc_zone_identifier = aws_subnet.private[*].id   # all three AZ subnets
  health_check_type   = "ELB"
  target_group_arns   = [aws_lb_target_group.web.arn]

  instance_distribution {
    on_demand_base_capacity = 0
  }
}

Data tier — Multi-AZ semantics differ sharply by service

This is where the “what does Multi-AZ mean” question has the most variety. RDS Multi-AZ is a synchronous standby you pay for, with automatic failover. Aurora replicates storage across three AZs inherently. S3 and DynamoDB are multi-AZ by default, for free, with no knob to turn. EFS is multi-AZ when you mount it via per-AZ targets. ElastiCache needs Multi-AZ explicitly enabled. Get this table wrong and you put a single-AZ database behind a three-AZ app:

Service Multi-AZ mechanism Automatic on AZ loss? Synchronous? Cost of Multi-AZ The gotcha
EC2 You place instances per AZ (ASG) Via ASG/ELB health N/A ~0% (same instance count) One subnet = single AZ
ALB / NLB Attach subnets per AZ Yes (target health) N/A Cross-zone transfer (NLB billed) One subnet = blackhole
RDS (Multi-AZ instance) Synchronous standby in 2nd AZ Yes (60–120 s failover) Yes ~2× DB instance cost Standby can’t serve reads (instance mode)
RDS Multi-AZ DB cluster 2 readable standbys across 3 AZs Yes (faster failover) Yes ~3 instances Engine/Region support varies
Aurora Storage replicated across 3 AZs Yes (replica promotion) Storage-level Per-replica compute No reader = slower failover
DynamoDB Replicated across ≥3 AZs by default Yes (transparent) Yes Free (built in) Nothing to configure — don’t overthink it
S3 (Standard) ≥3 AZ object replication by default Yes (transparent) Yes Free (built in) One-Zone-IA is the single-AZ exception
EFS (Standard) Data across multiple AZs; per-AZ mount targets Yes Yes Standard storage price Need a mount target per AZ to survive
ElastiCache (Redis) Replicas across AZs + Multi-AZ failover Only if Multi-AZ enabled Async Replica node cost Off by default — enable it
NAT gateway One per AZ (not automatic) No — manual design N/A ~per-AZ hourly + per-GB One NAT = egress dies with its AZ

Recovery is not instantaneous, and the time differs sharply by service. Budget for it in your RTO and your client retry logic:

Service Recovery mechanism on AZ loss Typical recovery time What the client experiences
ALB targets Health-check draining to survivors Seconds (health-check interval) A few failed/retried requests
EC2 via ASG Relaunch in healthy AZs 1–3 min (boot + warm) Reduced capacity briefly
RDS Multi-AZ instance Standby promotion + DNS flip 60–120 s Connection drop; reconnect succeeds
RDS Multi-AZ DB cluster Promote a readable standby ~35 s or less Shorter blip than instance mode
Aurora Promote a reader (or rebuild from storage) ~30 s with a reader; longer without Faster with a provisioned reader
DynamoDB Transparent (multi-AZ internally) ~0 (no visible failover) Nothing
S3 Transparent (multi-AZ internally) ~0 Nothing
ElastiCache (Multi-AZ on) Replica promotion Tens of seconds Brief cache unavailability

Two design consequences fall out of this table: always provision an Aurora reader (failover with no reader is much slower), and make clients retry with backoff — a 60–120 s RDS failover is invisible to a user only if the app reconnects rather than erroring out.

RDS Multi-AZ in CLI and CloudFormation, because it is the single most commonly missed data-tier control:

# Turn on Multi-AZ (synchronous standby + automatic failover) for an existing instance
aws rds modify-db-instance --db-instance-identifier prod-db \
  --multi-az --apply-immediately

# Confirm it stuck
aws rds describe-db-instances --db-instance-identifier prod-db \
  --query "DBInstances[].{Id:DBInstanceIdentifier, MultiAZ:MultiAZ, AZ:AvailabilityZone, Secondary:SecondaryAvailabilityZone}" \
  --output table
# CloudFormation: Multi-AZ RDS from the start (the only correct default for prod)
ProdDB:
  Type: AWS::RDS::DBInstance
  Properties:
    Engine: postgres
    DBInstanceClass: db.r6g.large
    MultiAZ: true                 # synchronous standby in a second AZ
    AllocatedStorage: 100
    DBSubnetGroupName: !Ref DbSubnetGroup   # the subnet group must list ≥2 AZ subnets

A DB subnet group must contain subnets in at least two AZs or RDS refuses to enable Multi-AZ — a frequent first-time error.

Egress — the NAT gateway trap

A NAT gateway is zonal: it lives in one subnet, hence one AZ. If all your private subnets route egress through a single NAT gateway and that AZ fails, every instance in every AZ loses outbound internet — an AZ failure in one zone takes down egress for all of them. The fix is one NAT gateway per AZ, with each AZ’s private route table pointing at its own zone’s NAT:

# How many NAT gateways, and in which AZs? Fewer than your AZ count = a hidden SPOF
aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=vpc-0abc123" \
  --query "NatGateways[].{Id:NatGatewayId, Subnet:SubnetId, State:State}" \
  --output table

The trade-off is cost: a NAT gateway per AZ multiplies the hourly charge and keeps cross-AZ NAT traffic same-AZ (cheaper and more resilient). For dev environments a single NAT gateway is a reasonable cost saving; for production it is a false economy.

Availability math: what each design actually promises

Resilience choices translate into hard availability numbers, and the AWS SLA depends on how you deploy. The figures that frame the conversation (illustrative SLA tiers and the downtime they imply):

Design Representative availability target Approx. downtime / year What it survives
Single instance, single AZ ~99.5% (instance-level) ~1.8 days Nothing structural
Single-AZ, redundant instances ~99.9% ~8.8 hours Instance/host failures only
Multi-AZ (2–3 AZs) ~99.95–99.99% ~4.4 h–53 min One AZ failure
Multi-Region (active/passive) ~99.99%+ ~53 min or less One Region failure
Multi-Region (active/active) ~99.999% achievable ~5 min Region failure + global serving

Two honest caveats: these are design targets, not guarantees your app will hit (your code and dependencies matter), and AWS publishes specific SLAs per service — EC2, RDS, S3 each have their own. The point of the table is directional: each rung up the hierarchy removes roughly an order of magnitude of downtime, at increasing cost. Multi-AZ is the rung with the best return.

Who guarantees what — the shared-responsibility split

AWS runs the physical AZ/Region infrastructure; you decide how to deploy across it. Outages happen when teams assume AWS covers a layer they actually own:

Layer AWS provides You own
Physical AZ (power/cooling/network) Independent, redundant facilities Choosing to use more than one
Inter-AZ network Low-latency redundant fibre Sending traffic across it (and the bill)
Regional service durability S3/DynamoDB multi-AZ by default Not picking the single-AZ option (One Zone-IA)
Managed-service failover RDS Multi-AZ mechanism Enabling Multi-AZ; subnet group across AZs
Compute placement Capacity across AZs Listing multiple subnets on ASG/ALB
DNS failover Route 53 health checks/policies Configuring failover routing + low TTL
Multi-Region The second Region’s infrastructure Replicating data, mirroring controls, cutover

The recurring theme of every outage in this article lives in the right-hand column: AWS gives you independent failure domains; using them is your job.

Cross-AZ data transfer: the cost nobody budgets for

Multi-AZ is cheap, not free, and the line item people miss is data transfer between AZs. AWS bills traffic that crosses an AZ boundary — typically per GB in each direction — while same-AZ traffic (by private IP) and most traffic to/from regional services is free. On a chatty microservice mesh or a high-throughput app↔DB path, cross-AZ bytes become a real monthly number.

What’s free, what’s billed — the rules that actually matter:

Traffic path Billed as cross-AZ? Notes
EC2 ↔ EC2, same AZ, via private IP Free Use private IPs; public-IP hops can route oddly
EC2 ↔ EC2, different AZ, private IP Billed (per GB each way) The main cross-AZ cost driver
EC2 ↔ EC2 via public IP / Elastic IP (same Region) Billed Avoid public IPs for internal traffic
App ↔ RDS in a different AZ Billed Multi-AZ failover may move the active to another AZ
Within the same AZ to RDS Free Keep read replicas same-AZ as readers where possible
App ↔ ElastiCache in a different AZ Billed Hot cache paths benefit from same-AZ placement
Inter-AZ replication (RDS sync, Aurora storage) Included in service cost Not a separate transfer line you control
Traffic to S3 / DynamoDB in-Region (via gateway endpoint) Free Use a gateway VPC endpoint to also avoid NAT cost
Traffic to S3/DynamoDB in-Region without an endpoint (via NAT) NAT data-processing charge The hidden cost a gateway endpoint removes
Traffic through an interface VPC endpoint (PrivateLink) Hourly + per-GB Cheaper than NAT for many AWS-service calls
NLB cross-zone load balancing Billed as cross-AZ Why NLB cross-zone is off by default
ALB cross-zone Not separately billed On by default; effectively free to you
Data into AWS from the internet Free Ingress is generally free
Data out to the internet Billed (per-GB, tiered) The other big transfer line beyond cross-AZ
Cross-Region transfer Billed (higher rate) A different, larger cost than cross-AZ

How to keep the bill sane without sacrificing resilience:

Technique What it saves Trade-off
Use private IPs for internal traffic Avoids public-IP routing surcharges None — just discipline
Gateway VPC endpoints for S3/DynamoDB Removes NAT data-processing + makes S3/DDB free Endpoint setup; route-table entries
Keep chatty app↔cache same-AZ where viable Cuts cross-AZ GB on hot paths Reduced AZ spread for that link; balance vs HA
AZ-aware service routing (where the app supports it) Prefers same-AZ targets App/mesh complexity
Right-size 3 AZs vs 2 for actual traffic Fewer cross-AZ hops if traffic is heavy Quorum needs may force 3 anyway
NAT per AZ Keeps NAT traffic same-AZ (cheaper + resilient) More NAT gateways to pay for
Interface endpoints for chatty AWS-service calls Avoids NAT data-processing for those calls Per-endpoint hourly cost
VPC peering / Transit Gateway intra-Region over private IP Avoids public-IP routing surcharges TGW per-attachment + data cost

The honest tension: spreading across AZs adds cross-AZ transfer, and minimizing transfer pushes toward fewer AZs — but never sacrifice the AZ spread your availability target needs to shave a transfer bill. Optimize the paths (private IPs, endpoints, same-AZ for the hottest links), not the AZ count.

Edge, Local Zones and Wavelength: latency, not availability

Three constructs sit in front of or beside Regions and are routinely confused with HA. They are about latency and proximity, not surviving an AZ or Region failure. Using a CDN does not make you available; it makes you fast and absorbs some attacks.

Construct What it is Primary purpose Failure-domain role Example
Edge location / PoP CloudFront / Global Accelerator point of presence Cache content, terminate TLS near users, anycast routing Absorbs L3/4 DDoS; not a DR strategy CloudFront cache for static assets
Regional edge cache Larger mid-tier cache behind PoPs Improve cache-hit ratio None (performance only) CloudFront origin shielding
Local Zone AWS compute/storage in a metro, tied to a parent Region Single-digit-ms latency to a specific city Metro extension of a Region; not separate DR us-west-2-lax-1 for LA media workloads
Wavelength Zone Compute embedded in a telco’s 5G network Ultra-low latency for mobile/5G/edge apps Carrier-edge; tied to a Region AR/VR, real-time mobile gaming
Outposts AWS racks in your datacenter AWS APIs on-prem (residency, latency) On-prem extension; you own the building’s risk Low-latency factory floor

A decision table for which proximity construct fits a given need:

If you need… Reach for Not for
Cache static/dynamic content near global users CloudFront Compute or DR
Lowest-latency, static-anycast entry + L3/4 DDoS Global Accelerator Caching (use CloudFront)
Single-digit-ms compute latency to a specific city Local Zone Surviving a Region failure
Ultra-low latency inside a 5G/mobile network Wavelength Zone General workloads
AWS APIs running in your datacenter (residency/latency) Outposts Offloading building risk to AWS
Survive one datacenter failing Multi-AZ (not any of the above)
Survive a whole Region failing Multi-Region (not any of the above)

The reading note: CloudFront/Global Accelerator improve latency and DDoS posture; Local Zones/Wavelength improve metro/carrier latency; none of them is a substitute for Multi-AZ or multi-Region resilience. If someone says “we’re safe, we have a CDN,” they have confused performance with availability.

Multi-Region: when it’s actually warranted

Multi-AZ survives the common failure. Multi-Region survives the rare, total one — a whole-Region event — and lets you serve users on another continent with low latency. It is also a large step up in cost and operational burden: duplicate stacks, cross-Region replication, DNS failover, and (for active/active) data-consistency problems you now own. The senior judgment is to add it only when RTO/RPO requirements or a global user base demand it — not because it sounds robust.

The patterns, in increasing cost/complexity:

Pattern What runs where RTO RPO Cost When to choose
Backup & restore Backups copied cross-Region; rebuild on disaster Hours–day Hours Low Non-critical; tight budget
Pilot light Core (DB replica) warm in DR; rest off ~Tens of min Minutes Low–medium Important but cost-sensitive
Warm standby Scaled-down full stack always running in DR Minutes Seconds–minutes Medium–high Critical workloads, fast recovery
Active/active (multi-site) Full stack live in both, traffic split ~Zero ~Zero (with global data) Highest Global low-latency + near-zero RTO

A side-by-side that maps each pattern to the AWS building blocks:

Concern Backup & restore Pilot light Warm standby Active/active
DB replication Snapshot copy Cross-Region read replica Read replica (promotable) Global table / Aurora Global DB
Compute in DR None Minimal Scaled-down, running Full, serving traffic
DNS strategy Manual repoint Route 53 failover Route 53 failover/health Route 53 latency/geolocation
Data store fit Any RDS/Aurora RDS/Aurora DynamoDB Global Tables, Aurora Global
Main risk Long RTO Promotion + scale-out time Cost of idle stack Split-brain / conflict resolution

The DNS layer is what actually steers traffic between Regions, and Route 53 routing policies are the lever. Picking the wrong one is a common multi-Region failure (traffic that won’t fail over, or that ignores latency):

Routing policy What it does Use it for Watch-out
Failover Primary; switch to secondary when a health check fails Active/passive DR Health check must probe a real endpoint; low TTL
Latency Routes each user to the lowest-latency Region Active/active global serving Needs a healthy stack in each Region
Geolocation Routes by the user’s location Residency / localized content Define a default for unmatched locations
Geoproximity Routes by geography with a bias “shift” Tuning traffic between Regions More complex; needs Traffic Flow
Weighted Splits traffic by assigned weights Canary / gradual cutover Weights are coarse; not health-aware alone
Multivalue answer Returns multiple healthy records Simple client-side spread Not a substitute for a real LB
Simple One record, no health logic Single-Region only No failover — wrong for multi-Region

The pairing that matters: active/passive DR uses Failover routing with a health check and a low TTL (60 s), while active/active global serving uses Latency or Geolocation with a live stack in every Region. A high record TTL (e.g. 3600 s) will pin resolvers to a dead Region for an hour — set it to 60 s for anything that needs to fail over.

The hard rule and a pointer onward: get Multi-AZ rock-solid first. A multi-Region design built on top of single-AZ tiers is theatre — you’ll fail to the DR Region for an AZ incident you should have absorbed locally, paying a huge RTO for a small failure. The cross-Region mechanics (snapshot copy, Vault Lock, Route 53 cutover, RTO/RPO tiers) are covered in depth in AWS Backup and Disaster Recovery Strategies.

Architecture at a glance

The diagram traces one request through the resiliency hierarchy, left to right, and marks the controls that — set wrong — collapse a failure domain. Start at the global edge: Route 53 applies latency and health-based routing, and CloudFront serves cached content from edge PoPs (with AWS Shield absorbing L3/4 attacks) — both global, in front of every Region. The request lands in Region ap-south-1, enters the VPC (10.0.0.0/16, three AZ subnets) and hits a single regional ALB on port 443 with cross-zone load balancing on. That one ALB fans the request across three independent Availability Zones (apse1-az1/2/3) — each a physically separate datacenter with its own power, cooling and network — where an EC2 Auto Scaling group spans all three (min 3, desired 6, one subnet per AZ). When an AZ’s power/cooling/network faults, it becomes an isolated failure domain; the ALB’s health checks drain the unhealthy AZ and capacity shifts to the survivors.

Behind compute sits the data tier: RDS Multi-AZ with a synchronous standby in a second AZ (60–120 s automatic failover), DynamoDB replicating across three AZs at eleven-nines durability for free, and S3 writing every object to ≥3 AZs, also free. Finally, a second Region (ap-southeast-1) holds a warm-standby passive stack plus cross-Region replication (an RDS read replica and S3 CRR) — the only thing in the picture that survives a whole-Region event. Each numbered badge marks a control that, misconfigured, breaks the chain: a single-subnet ALB (1), an ASG pinned to one AZ (2), the AZ failure domain itself (3), a database that isn’t Multi-AZ (4), and the absence of any region-failure plan (5). Read the legend as symptom · how to confirm · fix for each.

AWS resiliency hierarchy showing one request flowing from the global edge (Route 53 latency and health routing, CloudFront edge PoPs with AWS Shield) into Region ap-south-1, through a VPC with a single regional ALB on port 443 that spreads traffic across three independent Availability Zones apse1-az1/2/3, where an EC2 Auto Scaling group spans all three (min 3, desired 6, one subnet per AZ) and ALB health checks drain an AZ that suffers a power/cooling/network fault; behind compute a data tier of RDS Multi-AZ with a synchronous standby (60-120s failover), DynamoDB replicated across three AZs at eleven-nines, and S3 writing to three or more AZs; and a DR Region ap-southeast-1 holding a warm-standby passive stack with cross-Region RDS read replica and S3 CRR — with five numbered failure-point badges for single-subnet ALB, ASG pinned to one AZ, the AZ failure domain, a database not in Multi-AZ, and no region-failure plan

Real-world scenario

Streamly is a fictional but realistic Indian video-streaming startup: a three-tier app (React front end on CloudFront/S3, a Node API on EC2, PostgreSQL on RDS) serving ~250,000 daily users, almost all in India, out of ap-south-1 (Mumbai). The platform team is five engineers; the monthly AWS bill is about ₹6,80,000. To launch fast and cheap they had done what many do: clicked through the console, accepted the default AZ, and built everything in ap-south-1a — one public subnet, one private subnet, one NAT gateway, a single-AZ RDS instance, and an Auto Scaling group whose VPCZoneIdentifier listed exactly one subnet. On paper they had “an ALB, Auto Scaling, and managed RDS” — the vocabulary of high availability. In reality every tier shared one failure domain.

The incident hit on a Saturday evening during a cricket-final livestream — peak traffic. At 20:42 a power/cooling event degraded apse1-az1 (the physical AZ their ap-south-1a mapped to). The symptoms cascaded in seconds. The ALB, attached to one subnet, had no healthy targets and started returning 503. The Auto Scaling group tried to launch replacement instances — into the same dead AZ, because that was the only subnet it knew — and they failed to come up. The RDS instance, single-AZ, had no standby to promote; the database was simply gone. And because the lone NAT gateway lived in that AZ, even the few instances that mattered elsewhere had lost egress. The “highly available” stack was fully down. The on-call engineer’s reflexes — restart the app, scale the ASG — did nothing, because every lever pointed back into the failed AZ.

The breakthrough was diagnostic, not heroic. The senior on-call ran aws elbv2 describe-load-balancers and saw a single entry under AvailabilityZones. aws autoscaling describe-auto-scaling-groups showed one AZ and one subnet. aws rds describe-db-instances showed MultiAZ: false. The picture was unmistakable: this was never a multi-AZ system. There was no fast in-incident fix — you cannot conjure a standby database or new subnets mid-outage — so they waited for AWS to recover apse1-az1, which took about two hours and forty minutes. Total downtime during their highest-revenue window of the quarter.

The remediation, done deliberately over the next two weeks, was almost entirely structural and barely moved the bill. They added public and private subnets in apse1-az2 and apse1-az3, re-attached the ALB to all three public subnets, set the ASG’s VPCZoneIdentifier to all three private subnets with min_size = 3 and AZ Rebalance on, converted RDS to Multi-AZ (a synchronous standby in a second AZ), and deployed a NAT gateway per AZ with per-AZ route tables. The compute cost was unchanged — the same six instances, now spread across three buildings instead of stacked in one. The genuine new costs were the RDS standby (~+₹38,000/mo), two extra NAT gateways (~+₹9,000/mo), and a modest rise in cross-AZ data transfer (~₹12,000/mo) — together under 9% on a bill that had just eaten a multi-hour outage during the cricket final.

Six weeks later, a different AZ in ap-south-1 had a brief network event. This time describe-target-health showed the affected AZ’s targets draining, the ASG launched replacements in the two healthy AZs within minutes, RDS executed an automatic failover to its standby in about 70 seconds, and users saw a blip in buffering, not an outage. The line the team pinned to the wall: “Auto Scaling and Multi-AZ are not features you enable — they are subnets you list. We had the words without the wiring.”

Advantages and disadvantages

Multi-AZ as the default production posture is overwhelmingly the right call, but weigh it honestly:

Advantages (why Multi-AZ is the baseline) Disadvantages (the costs and limits)
Survives the common failure (a single AZ) — by far the most likely datacenter incident Does not survive a whole-Region event — multi-Region is a separate, costlier effort
Essentially free on compute — same instance count, spread across buildings Cross-AZ data transfer is billed per GB each way; chatty paths add up
Managed services do it for you — S3/DynamoDB are multi-AZ by default, RDS Multi-AZ failover is automatic A Multi-AZ RDS standby is a paid, idle instance (instance mode can’t even serve reads)
Operationally simple — identity, networking and most services stay within one Region A NAT gateway per AZ for resilient egress multiplies that hourly charge
Low-latency HA — AZs are close enough for synchronous replication, so failover is fast and consistent Quorum systems need three AZs, so two-AZ designs can still lose availability on one failure
Health-based draining routes around a dead AZ automatically when wired correctly Capacity headroom: each AZ must hold spare for a neighbour’s failure (2× for two AZs)
AZ Rebalance re-spreads capacity automatically once an AZ recovers Defaults are unsafe — console-default single-subnet placement looks HA but isn’t

The model is right for essentially every production workload: you want the common failure absorbed for almost no money, and managed services hand you most of the resilience. It bites when (a) you skip it and run single-AZ “to save money” — the costliest false economy in this article, (b) you build two AZs for a three-AZ quorum need, or © you minimize AZ spread to shave a data-transfer bill and reintroduce a single point of failure. Each disadvantage is manageable once you know it exists — which is the whole point.

Hands-on lab

Build a genuinely Multi-AZ web tier — VPC across three AZs, ALB across three subnets, an Auto Scaling group spanning all three — then prove it by draining an AZ’s targets and watching traffic survive. Free-tier-friendly where possible (t3.micro); delete everything at the end to avoid NAT/ALB charges. Run in CloudShell or any shell with the CLI configured for a three-AZ Region (we use ap-south-1).

Step 1 — Variables and the three AZs.

REGION=ap-south-1
VPC_CIDR=10.20.0.0/16
# Grab the first three AZ names in the Region
mapfile -t AZS < <(aws ec2 describe-availability-zones --region $REGION \
  --query "AvailabilityZones[?State=='available'].ZoneName" --output text | tr '\t' '\n' | head -3)
echo "Using AZs: ${AZS[@]}"   # expect three, e.g. ap-south-1a ap-south-1b ap-south-1c

Step 2 — VPC and an Internet Gateway.

VPC_ID=$(aws ec2 create-vpc --cidr-block $VPC_CIDR --region $REGION \
  --query Vpc.VpcId --output text)
IGW_ID=$(aws ec2 create-internet-gateway --region $REGION \
  --query InternetGateway.InternetGatewayId --output text)
aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID --region $REGION
echo "VPC=$VPC_ID IGW=$IGW_ID"

Step 3 — One public subnet per AZ (this is the Multi-AZ part).

SUBNETS=()
for i in 0 1 2; do
  SID=$(aws ec2 create-subnet --vpc-id $VPC_ID --region $REGION \
    --cidr-block 10.20.$((i*10)).0/24 --availability-zone ${AZS[$i]} \
    --query Subnet.SubnetId --output text)
  aws ec2 modify-subnet-attribute --subnet-id $SID --map-public-ip-on-launch --region $REGION
  SUBNETS+=($SID)
  echo "Subnet in ${AZS[$i]} = $SID"
done

Expected: three subnet IDs, one per AZ. This is your AZ coverage.

Step 4 — Route the subnets to the internet.

RT_ID=$(aws ec2 create-route-table --vpc-id $VPC_ID --region $REGION \
  --query RouteTable.RouteTableId --output text)
aws ec2 create-route --route-table-id $RT_ID --destination-cidr-block 0.0.0.0/0 \
  --gateway-id $IGW_ID --region $REGION
for SID in "${SUBNETS[@]}"; do
  aws ec2 associate-route-table --route-table-id $RT_ID --subnet-id $SID --region $REGION
done

Step 5 — Security group, then an ALB across all three subnets.

SG_ID=$(aws ec2 create-security-group --group-name lab-alb-sg --description "lab alb" \
  --vpc-id $VPC_ID --region $REGION --query GroupId --output text)
aws ec2 authorize-security-group-ingress --group-id $SG_ID --protocol tcp --port 80 \
  --cidr 0.0.0.0/0 --region $REGION

ALB_ARN=$(aws elbv2 create-load-balancer --name lab-alb --type application \
  --subnets "${SUBNETS[@]}" --security-groups $SG_ID --region $REGION \
  --query "LoadBalancers[0].LoadBalancerArn" --output text)

# Confirm it faces THREE AZs — the whole point of the lab
aws elbv2 describe-load-balancers --load-balancer-arns $ALB_ARN --region $REGION \
  --query "LoadBalancers[0].AvailabilityZones[].ZoneName" --output table

Expected: a table with three AZ rows. One row would mean a single-AZ ALB — the failure we’re avoiding.

Step 6 — A target group and an Auto Scaling group across all three subnets.

TG_ARN=$(aws elbv2 create-target-group --name lab-tg --protocol HTTP --port 80 \
  --vpc-id $VPC_ID --target-type instance --region $REGION \
  --query "TargetGroups[0].TargetGroupArn" --output text)
aws elbv2 create-listener --load-balancer-arn $ALB_ARN --protocol HTTP --port 80 \
  --default-actions Type=forward,TargetGroupArn=$TG_ARN --region $REGION

# Minimal launch template that serves an AZ-aware hello page
AMI=$(aws ssm get-parameter --region $REGION \
  --name /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \
  --query Parameter.Value --output text)
USERDATA=$(printf '#!/bin/bash\ndnf install -y httpd\nAZ=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)\necho "OK from $AZ" > /var/www/html/index.html\nsystemctl enable --now httpd\n' | base64 -w0)
LT_ID=$(aws ec2 create-launch-template --launch-template-name lab-lt --region $REGION \
  --launch-template-data "{\"ImageId\":\"$AMI\",\"InstanceType\":\"t3.micro\",\"SecurityGroupIds\":[\"$SG_ID\"],\"UserData\":\"$USERDATA\"}" \
  --query "LaunchTemplate.LaunchTemplateId" --output text)

SUBNET_CSV=$(IFS=,; echo "${SUBNETS[*]}")
aws autoscaling create-auto-scaling-group --auto-scaling-group-name lab-asg \
  --launch-template "LaunchTemplateId=$LT_ID" --min-size 3 --max-size 6 --desired-capacity 3 \
  --vpc-zone-identifier "$SUBNET_CSV" --target-group-arns $TG_ARN \
  --health-check-type ELB --region $REGION

# Prove the ASG spans three AZs
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names lab-asg --region $REGION \
  --query "AutoScalingGroups[0].{AZs:AvailabilityZones, Subnets:VPCZoneIdentifier}" --output json

Expected: AZs lists three zones. After a couple of minutes, curl http://<ALB-DNS>/ (from describe-load-balancers --query LoadBalancers[0].DNSName) returns OK from ap-south-1a/b/c, varying by which instance answered.

Step 7 — Simulate an AZ loss and watch survival. Pull the instances of one AZ out of the target group (a stand-in for that AZ failing) and confirm the ALB keeps serving from the other two:

# Watch target health; deregister one AZ's targets; traffic should continue from the rest
aws elbv2 describe-target-health --target-group-arn $TG_ARN --region $REGION \
  --query "TargetHealthDescriptions[].{Id:Target.Id, AZ:Target.AvailabilityZone, State:TargetHealth.State}" \
  --output table
# (Deregister the instance(s) whose AZ you want to 'fail', then re-run a curl loop against the ALB DNS — it stays up.)

Validation checklist. You built a VPC across three AZs, attached the ALB to three subnets (confirmed three AvailabilityZones), spread an ASG across all three (min 3), and demonstrated that removing one AZ’s targets does not take the service down. No application code resilience was involved — the availability came entirely from which subnets you listed.

Step What you did What it proves Real-world analogue
3 One subnet per AZ AZ coverage = the subnets you create The structural HA decision
5 ALB across 3 subnets The front door faces every AZ Avoiding the single-subnet blackhole
6 ASG VPCZoneIdentifier = 3 subnets Compute spreads across AZs The fix for “fake HA” Auto Scaling
7 Drain one AZ’s targets Service survives one AZ loss An actual AZ incident

Cleanup (avoid NAT/ALB/instance charges).

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name lab-asg --force-delete --region $REGION
aws elbv2 delete-listener --listener-arn $(aws elbv2 describe-listeners --load-balancer-arn $ALB_ARN --region $REGION --query "Listeners[0].ListenerArn" --output text) --region $REGION
aws elbv2 delete-load-balancer --load-balancer-arn $ALB_ARN --region $REGION
aws elbv2 delete-target-group --target-group-arn $TG_ARN --region $REGION
aws ec2 delete-launch-template --launch-template-id $LT_ID --region $REGION
# Then detach/delete IGW, subnets, route table, SG, and finally the VPC.
aws ec2 detach-internet-gateway --internet-gateway-id $IGW_ID --vpc-id $VPC_ID --region $REGION
aws ec2 delete-internet-gateway --internet-gateway-id $IGW_ID --region $REGION
for SID in "${SUBNETS[@]}"; do aws ec2 delete-subnet --subnet-id $SID --region $REGION; done
aws ec2 delete-vpc --vpc-id $VPC_ID --region $REGION

Cost note. An ALB and three t3.micro instances for an hour run a few tens of rupees; we used no NAT gateway in the lab (public subnets) precisely to keep it cheap. Delete promptly — an ALB left running is the main lingering charge.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First a scannable table you can read mid-incident, then the full reasoning for the entries that bite hardest. Every “fake HA” failure has the same shape: a resource that looks spread across AZs but is listed against one.

# Symptom Root cause Confirm (exact command) Fix
1 One AZ degrades and the whole site 503s despite an ALB ALB attached to a single subnet (one AZ) aws elbv2 describe-load-balancers --query "LoadBalancers[].AvailabilityZones" shows one entry Attach the ALB to subnets in ≥2 (ideally 3) AZs; turn on cross-zone
2 “Auto-scaled” tier dies entirely with one AZ; replacements won’t launch ASG VPCZoneIdentifier lists one subnet aws autoscaling describe-auto-scaling-groups --query "...AvailabilityZones" shows one AZ List a subnet per AZ; set min_size ≥ 2–3; enable AZ Rebalance
3 AZ loss takes the database down, no failover RDS is single-AZ (MultiAZ=false) aws rds describe-db-instances --query "...MultiAZ" returns false modify-db-instance --multi-az (subnet group must list ≥2 AZs)
4 One AZ fails and ALL instances (every AZ) lose internet egress Single NAT gateway in the failed AZ aws ec2 describe-nat-gateways shows fewer NAT GWs than AZs One NAT gateway per AZ + per-AZ route tables
5 Shared/peered resources cost more or can’t co-locate across accounts Aligned on AZ name not AZ ID Compare ZoneId across accounts via describe-availability-zones Coordinate on the AZ ID (apse1-az1), not ap-south-1a
6 NLB traffic piles onto one AZ; targets in others idle NLB cross-zone load balancing is off (default) aws elbv2 describe-target-group-attributes shows cross-zone false Enable cross-zone on the NLB (note: billed as cross-AZ transfer)
7 Multi-AZ RDS enabled but reads still hammer one instance Multi-AZ instance standby can’t serve reads describe-db-instances shows MultiAZ true but one endpoint Use Multi-AZ DB cluster or read replicas for read scaling
8 API call fails with auth/endpoint error in a new Region Region is opt-in and not enabled / wrong STS endpoint aws ec2 describe-regions --query "...OptInStatus" Enable the Region; use its regional STS endpoint
9 Surprise data-transfer bill after going Multi-AZ Heavy cross-AZ traffic (app↔DB, mesh) over private IP Cost Explorer → data transfer; check inter-AZ GB Gateway endpoints for S3/DDB; keep hottest paths same-AZ
10 App still down after failover; can’t reach the internet NAT per AZ exists but route tables not per-AZ aws ec2 describe-route-tables — all subnets point at one NAT Give each AZ’s private subnet its own route to its AZ’s NAT
11 “We have a CDN, we’re highly available” — but an AZ event still broke us Confusing edge/latency with availability Architecture review: where do origins actually live? CloudFront is performance/DDoS, not DR; fix origin AZ spread
12 Two-AZ quorum cluster loses availability when one AZ fails Quorum needs majority; 1 of 2 is no majority Cluster shows no quorum / read-only after one AZ down Run quorum systems across three AZs
13 One-Zone-IA S3 data lost after an AZ event S3 One Zone-IA stores in a single AZ aws s3api get-bucket... / object storage class is ONEZONE_IA Use S3 Standard (≥3 AZ) for anything not reproducible
14 EFS-backed app loses access in one AZ No mount target in the surviving AZs aws efs describe-mount-targets shows targets in <all AZs Create an EFS mount target in every AZ the app runs in
15 ElastiCache cluster has no failover when its AZ dies Multi-AZ not enabled on the replication group aws elasticache describe-replication-groupsAutomaticFailover disabled Enable Multi-AZ + automatic failover; add a replica in another AZ
16 RDS won’t enable Multi-AZ (“subnet group must cover 2 AZs”) DB subnet group lists subnets in only one AZ aws rds describe-db-subnet-groups --query "...Subnets[].SubnetAvailabilityZone" Add a subnet from a second AZ to the DB subnet group
17 Spot/On-Demand capacity error when one AZ is constrained All capacity requested in a single AZ ASG activity history shows InsufficientInstanceCapacity in one AZ Spread the ASG across 3 AZs; use mixed instances/capacity-optimized
18 “Highly available” stack on EBS won’t recover in another AZ EBS volume is AZ-bound; can’t attach across AZs aws ec2 describe-volumes --query "...AvailabilityZone" Don’t rely on a single EBS volume for HA; use snapshots/EFS/managed stores

The expanded form, for the failures that cost the most:

1. One AZ degrades and the whole site 503s despite “having an ALB.” Root cause: The ALB is attached to a single subnet (one AZ). When that AZ’s targets go unhealthy, the ALB has nowhere to route. The presence of an ALB created the illusion of HA. Confirm: aws elbv2 describe-load-balancers --names <alb> --query "LoadBalancers[].AvailabilityZones[].ZoneName" returns one zone. Fix: Attach the ALB to subnets in ≥2 (ideally 3) AZs; ensure cross-zone load balancing is on (default for ALB). Re-run the confirm command and expect multiple zones.

2. The “auto-scaled, highly available” tier dies entirely with one AZ; replacements never come up. Root cause: The ASG’s VPCZoneIdentifier lists one subnet, so it can only ever launch into one AZ — including the replacements it tries to launch during that AZ’s failure. Confirm: aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names <asg> --query "AutoScalingGroups[].{AZs:AvailabilityZones, Subnets:VPCZoneIdentifier}" shows a single AZ. Fix: Set VPCZoneIdentifier to one subnet per AZ, min_size at least equal to the AZ count (≥3 for quorum/headroom), and leave AZ Rebalance on so capacity re-spreads after recovery.

3. An AZ loss takes the database down with no failover. Root cause: Single-AZ RDS — there is no standby to promote, so the database simply disappears with its AZ. Confirm: aws rds describe-db-instances --db-instance-identifier <db> --query "DBInstances[].MultiAZ" returns false. Fix: aws rds modify-db-instance --db-instance-identifier <db> --multi-az --apply-immediately. The DB subnet group must contain subnets in ≥2 AZs or the change is rejected. Expect a 60–120 s automatic failover on a real AZ event thereafter.

4. One AZ fails and every instance in every AZ loses outbound internet. Root cause: A single NAT gateway in the failed AZ that all private subnets route through; when its AZ dies, egress dies fleet-wide — an AZ failure amplified into a Region-wide egress outage. Confirm: aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=<vpc>" shows fewer NAT gateways than AZs; aws ec2 describe-route-tables shows multiple AZ subnets pointing at the same NAT. Fix: Deploy one NAT gateway per AZ and give each AZ’s private subnet its own route table pointing at its own zone’s NAT (see also #10).

5. Shared resources across accounts cost more or can’t be co-located. Root cause: You aligned on the AZ name (ap-south-1a), which maps to different physical AZs in different accounts, so “same AZ” wasn’t actually same — adding cross-AZ charges or breaking same-AZ placement assumptions. Confirm: aws ec2 describe-availability-zones --query "AvailabilityZones[].{Name:ZoneName,Id:ZoneId}" in each account and compare the ZoneId. Fix: Coordinate cross-account placement on the AZ ID (apse1-az1), not the name.

9. A surprise data-transfer bill after going Multi-AZ. Root cause: Correctly spreading across AZs added cross-AZ data transfer on chatty paths (app↔DB, service mesh), billed per GB each way. Confirm: Cost Explorer → filter on data-transfer usage types for inter-AZ; correlate with your hottest internal paths. Fix: Add gateway VPC endpoints for S3/DynamoDB (makes that traffic free and removes NAT processing), keep the hottest links same-AZ where availability allows, and use private IPs. Do not collapse AZ spread to save transfer — optimize paths, not the AZ count.

Best practices

The alerts and signals worth wiring before the next AZ event — the leading indicators, not just “site down”:

Watch Signal / source What it tells you Action
Per-AZ target health ALB HealthyHostCount by AZ An AZ’s targets draining Confirm AZ event; verify survivors absorb load
AWS health events Personal Health Dashboard AWS-reported AZ/Region issues Match symptoms; avoid blind restarts
RDS failover RDS events / DBInstanceClass AZ change Standby promoted in another AZ Confirm app reconnected to the new active
Egress reachability NAT gateway metrics / synthetic egress check An AZ’s NAT down Confirm per-AZ NAT routing held
Cross-AZ transfer Cost Explorer inter-AZ usage Spend creeping on chatty paths Add endpoints; review hot links
Capacity headroom ASG desired vs max, per-AZ spread Whether survivors can absorb load Raise minimums / max if too tight
Spot interruptions by AZ Spot interruption notices Capacity stress in an AZ Diversify instance types/AZs
Per-AZ error rate ALB target 5xx by AZ An AZ degrading before full failure Pre-emptively drain / investigate

Security notes

Resilience and security overlap more than they look — placement decisions are also blast-radius decisions.

The placement-as-security controls in one view:

Control Mechanism Limits blast radius of Resilience bonus
Private subnets per AZ Route tables without IGW Internet exposure of data tier Multi-AZ without public exposure
Gateway VPC endpoints S3/DynamoDB endpoints Internet path for AWS-service traffic Removes cross-AZ/NAT cost for that traffic
KMS per-Region keys Region-scoped CMKs Cross-Region key compromise Clean per-Region encryption boundary
IAM Region/account scoping Condition keys, SCPs Cross-Region/account credential reach Forces deliberate multi-Region grants
Mirrored controls in DR WAF/SG/GuardDuty in both Regions A weakly-defended failover target DR Region is production-equivalent

Cost & sizing

The bill drivers and how they interact with resilience:

A rough monthly picture for a small production three-tier app in ap-south-1, single-AZ vs proper Multi-AZ:

Cost driver Single-AZ (don’t) Multi-AZ (3 AZ) Notes
Compute (6× app instances) ~₹X ~₹X (same) Spread, not multiplied — identical cost
RDS 1× instance ~2× (standby) The price of automatic failover
NAT gateways 1 3 Per-AZ NAT for resilient egress
Cross-AZ data transfer ~0 small–moderate Per GB each way; optimize hot paths
ALB 1 (single subnet) 1 (three subnets) Same ALB, no extra cost for more AZs
EBS volumes per-AZ, same count same count Snapshots are cross-AZ/Region durable
Data egress to internet same same Unchanged by AZ spread
Net delta of going Multi-AZ standby + 2 NAT + transfer Typically single-digit % of total bill

The headline: Multi-AZ is cheap insurance. For Streamly above it was under 9% on the bill, against a multi-hour outage during peak revenue. The expensive choice is single-AZ — it just defers the cost to your worst day. Multi-Region is where real money appears; spend it only when the recovery objectives demand it (the patterns and their cost/RTO/RPO trade-offs are in AWS Backup and Disaster Recovery Strategies).

Interview & exam questions

1. What is the difference between an AWS Region and an Availability Zone? A Region is a geographic area (e.g. ap-south-1) of fully independent infrastructure with its own copy of regional services. An Availability Zone is one or more physically separate datacenters within a Region, with independent power, cooling and network, connected to the other AZs by low-latency private fibre. You design across AZs to survive a datacenter failure and across Regions to survive an area-wide disaster.

2. Why does a subnet belong to exactly one AZ, and why does that matter? A subnet is bound to a single AZ at creation and can’t span two. It matters because your VPC’s AZ coverage is the set of subnets you create — “spread across AZs” concretely means “a subnet per AZ, with resources in each.” An ALB or ASG listing one subnet is single-AZ no matter what else you do.

3. What does RDS Multi-AZ actually provide, and how fast is failover? Multi-AZ provisions a synchronous standby in a second AZ and fails over to it automatically on an AZ or instance failure, typically in 60–120 seconds (faster for Multi-AZ DB clusters). In the classic instance mode the standby doesn’t serve reads — it’s for availability, not read scaling; use read replicas or a Multi-AZ DB cluster for that.

4. Two AZs or three — how do you decide? Two AZs survive one AZ failure for stateless tiers but cannot hold a majority for quorum systems (1 of 2 is no quorum). Three AZs keep a majority (2 of 3) after one failure and require less spare capacity per AZ (1.5× vs 2×). Default to three for anything stateful or consensus-based; two only for simple stateless tiers.

5. Why are AZ names randomized per account, and when do you use AZ IDs? AWS maps the friendly name (ap-south-1a) to a different physical AZ per account to spread load evenly. The AZ ID (apse1-az1) is stable across accounts. Use AZ IDs when coordinating cross-account placement (shared VPCs, PrivateLink, peering) to land resources in the same physical AZ and avoid cross-AZ charges.

6. A team has an ALB, Auto Scaling and managed RDS but still went fully down in an AZ event. What happened? Almost certainly “fake HA”: the ALB was on one subnet, the ASG’s VPCZoneIdentifier listed one subnet, and RDS was single-AZ — every tier shared one AZ. Confirm with describe-load-balancers, describe-auto-scaling-groups and describe-db-instances. The fix is structural — list multiple AZ subnets everywhere and enable Multi-AZ RDS.

7. How is cross-AZ data transfer billed, and what’s free? Traffic crossing an AZ boundary is billed per GB in each direction. Same-AZ traffic by private IP is free, as is in-Region traffic to S3/DynamoDB via gateway endpoints. ALB cross-zone isn’t separately billed; NLB cross-zone is billed as cross-AZ (why it’s off by default). Optimize the hot paths, but never drop the AZ spread you need for availability.

8. Is scaling out across more AZs the fix for a single-AZ NAT gateway outage? No — the issue is that a single NAT gateway in the failed AZ carried all egress. The fix is one NAT gateway per AZ with per-AZ route tables, so each AZ’s private subnets egress through their own zone’s NAT and one AZ’s failure can’t kill fleet-wide egress.

9. When is multi-Region genuinely warranted, and what comes first? When your RTO/RPO requirements exceed what a single Region can offer, or you must serve a global user base with low latency, or meet data-residency rules in another geography. Multi-AZ must be solid first — multi-Region built on single-AZ tiers makes you fail over for small AZ incidents at huge RTO cost.

10. What’s the difference between an edge location, a Local Zone and a Region for resilience? An edge location (CloudFront/Global Accelerator PoP) is for latency and DDoS, not availability. A Local Zone places compute in a metro tied to a parent Region for low latency — still not a separate DR domain. Only a Region (and AZs within it) is a true resilience boundary. Don’t confuse a CDN with high availability.

11. Which AWS data services are multi-AZ by default, and which need configuring? S3 Standard (≥3 AZ) and DynamoDB (≥3 AZ) are multi-AZ for free, automatically. RDS needs Multi-AZ enabled; ElastiCache needs Multi-AZ turned on; EFS needs a mount target per AZ. The single-AZ exception to watch is S3 One Zone-IA, which deliberately stores in one AZ.

12. How do you choose an AWS Region? On four criteria, roughly in order: latency to your users (measure it), service availability (not every service is in every Region), data residency / compliance (legal constraints on where data lives), and cost (per-unit prices vary by Region). Never default to us-east-1 just because the console opens there.

These map cleanly to AWS Certified Cloud Practitioner (CLF-C02) — global infrastructure, Regions/AZs/edge, the shared-responsibility and reliability pillars — and Solutions Architect Associate (SAA-C03) — designing resilient, multi-AZ and multi-Region architectures, choosing Regions, and Multi-AZ data services. The operational confirm-and-fix material maps to SysOps Associate (SOA-C02). A compact cert-mapping for revision:

Question theme Primary cert Objective area
Region vs AZ vs edge, global infrastructure CLF-C02 Cloud concepts; global infrastructure
Multi-AZ design, choosing a Region SAA-C03 Design resilient architectures
Multi-AZ data services (RDS/Aurora/DynamoDB/S3) SAA-C03 Design high-availability/storage solutions
Multi-Region DR patterns (pilot light/warm standby) SAA-C03 / SAP-C02 Design for reliability and DR
AZ-failure diagnosis, per-AZ monitoring, NAT routing SOA-C02 Reliability and business continuity; networking
Cross-AZ cost, data-transfer optimization SAA-C03 Cost-optimized architectures

Quick check

  1. You have an ALB, an Auto Scaling group and managed RDS, yet a single AZ event took the whole application down. Name the most likely root cause and the one command that confirms it for the ALB.
  2. True or false: running three EC2 instances spread across three AZs costs significantly more than running three in one AZ.
  3. Why do quorum-based systems (etcd, ZooKeeper, many distributed databases) need three AZs rather than two?
  4. Your application loses all outbound internet access — across every AZ — when one AZ fails. What’s the cause and the fix?
  5. Two AWS accounts both want resources in “the same AZ” to avoid cross-AZ charges, but they keep landing in different physical zones. What are they doing wrong?

Answers

  1. Fake HA — the ALB, ASG and/or RDS are each pinned to a single AZ (one subnet). Confirm the ALB with aws elbv2 describe-load-balancers --names <alb> --query "LoadBalancers[].AvailabilityZones[].ZoneName"; one zone means single-AZ. The fix is to attach the ALB to subnets in ≥2 (ideally 3) AZs, list multiple subnets on the ASG, and enable Multi-AZ RDS.
  2. False. You pay per instance-hour, not per AZ — three instances cost the same whether they’re in one AZ or spread across three. The only Multi-AZ costs are cross-AZ data transfer, a paid RDS standby, and (optionally) a NAT gateway per AZ; compute itself is unchanged.
  3. Quorum needs a majority to make consistent decisions. With two AZs, losing one leaves 1 of 2 — no majority, so the cluster stops accepting writes (or goes read-only). With three AZs, losing one leaves 2 of 3 — a majority — so it keeps operating. Three AZs preserve quorum through a single AZ failure.
  4. A single NAT gateway in the failed AZ carried egress for every private subnet, so its AZ’s failure killed fleet-wide outbound internet. Fix: deploy one NAT gateway per AZ and give each AZ’s private subnet its own route table pointing at its own zone’s NAT.
  5. They’re aligning on the AZ name (ap-south-1a), which AWS maps to a different physical AZ in each account. They should coordinate on the stable AZ ID (apse1-az1), visible via aws ec2 describe-availability-zones --query "AvailabilityZones[].{Name:ZoneName,Id:ZoneId}".

Glossary

Next steps

You can now place every tier across AZs correctly and know exactly what each layer survives. Build outward:

AWSRegionsAvailability ZonesResiliencyHigh AvailabilityMulti-AZDisaster RecoveryVPC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading