AWS Resiliency

AWS Backup and Disaster Recovery: Protect Workloads Across Regions

Quick take: a backup is not a DR strategy. AWS Backup gives you centralized, policy-driven copies; disaster recovery is the separate discipline of deciding how fast you must recover (RTO), how much data you can lose (RPO), and how traffic cuts over when a Region burns. Most “DR plans” are a nightly job and a hope — they fail the first time they are actually needed.

A healthcare SaaS backed its RDS database up nightly, copied the dumps to S3, and told its board it had “DR covered.” When us-east-1 had a multi-hour control-plane event, the team discovered four things in the worst possible order: the restore took four hours because it rehydrated a 900 GB snapshot cold; there were no AMIs for the app tier in any other Region, so compute had to be rebuilt by hand; the security groups and IAM roles the app needed did not exist in the failover Region; and DNS failover was manual, gated on a person waking up to flip a record with a 3,600-second TTL that then pinned resolvers for an hour. Their measured recovery was nine hours. Rebuilt properly — AWS Backup with cross-Region copy actions, an RDS cross-Region read replica, pre-staged launch templates, and Route 53 health-check failover — the same outage became a 22-minute event with under a minute of data loss.

This article is the blueprint to get there. We treat backup and DR as two coupled but distinct problems. Backup answers “can I get this specific data back?” — and AWS Backup is the centralized service that schedules, encrypts, copies, retains and (critically) locks recovery points for EC2/EBS, RDS, Aurora, DynamoDB, EFS, FSx, S3 and more from one policy plane. DR answers “can I keep the business running when an entire Region or account is gone?” — and that is a spectrum of patterns (Backup & Restore → Pilot Light → Warm Standby → Active/Active) each trading cost against recovery speed. You will learn to set an RTO/RPO target with the business, pick the cheapest pattern that meets it, build the AWS Backup plan and copy chain that feeds it, harden the destination against ransomware with Vault Lock, and wire the Route 53 cutover that actually flips traffic. Every mechanism gets both an aws CLI snippet and a Terraform snippet, the real limits and error codes, and — because you will read this mid-incident — the playbook itself is a table.

By the end you will stop confusing “we have backups” with “we can recover.” You will know whether your workload needs continuous PITR or a nightly snapshot, whether it needs a warm standby or can tolerate a four-hour rebuild, exactly which KMS grant a cross-Region restore needs, and how to prove all of it with a restore test before the outage proves it for you.

What problem this solves

Backups protect against data loss — a dropped table, a ransomware encrypt, a bad migration, an rm -rf against the wrong bucket. Disaster recovery protects against prolonged loss of service — an Availability Zone power event, a Region-wide control-plane degradation, an account compromise, or a fat-fingered Terraform destroy that takes the whole stack. They are different failure domains and they need different controls, and the classic production mistake is using one word (“backup”) to claim coverage of both.

What breaks without this discipline: a team backs up the database religiously and forgets the application is stateless-but-undeployable in a second Region (no AMIs, no launch templates, no IaC parameters for the new Region’s subnets). Or they keep all recovery points in the same account and Region as production, so a compromised root credential or a Region outage takes the backups with the primary. Or they never test a restore, so the four-hour rehydration time of a cold S3-Glacier snapshot is discovered live, blowing a one-hour RTO promise by 300%. Or in-flight transactions are lost on database promotion because the RPO was never actually measured against the replication lag. Each of these is a real incident pattern, and each is preventable with a deliberate design.

Who hits this hardest: regulated workloads (healthcare, finance, public sector) with contractual RTO/RPO and immutability mandates; cost-sensitive teams who over-build Active/Active for an internal tool that could tolerate a day of downtime, or under-build Backup & Restore for a revenue-critical checkout that cannot; and anyone who has never run a DR game day, because runbooks that are never rehearsed fail under pressure exactly when they are needed. The fix is never “buy more backup storage.” It is “decide the target, pick the matching pattern, automate the cutover, and prove it on a schedule.”

To frame the whole field before the deep dive, here is the spectrum of DR patterns, what each costs, and the recovery it buys:

DR pattern What runs in the second Region Typical RTO Typical RPO Relative steady-state cost Use when
Backup & Restore Nothing — only recovery points in a vault Hours (rehydrate + provision) Hours (last backup) Lowest (storage only) Non-critical workloads; a day of downtime is tolerable
Pilot Light Data replicating (DB replica, S3 CRR); compute off Tens of minutes Seconds–minutes Low (data + minimal infra) Core data must survive; compute can be scaled from zero
Warm Standby Scaled-down but running full stack Minutes Seconds Medium (always-on small stack) Revenue-affecting; a few minutes’ downtime is the cap
Active/Active (multi-Region) Full stack serving live traffic Seconds (near-zero) Near-zero Highest (two live stacks + data sync) Zero-downtime mandate; global low-latency

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand AWS account and Region structure — that a Region is an isolated geography of Availability Zones, and that most failures you design for are AZ-level (handled by Multi-AZ) while the rare catastrophic one is Region-level (handled by cross-Region DR). You should be comfortable running the aws CLI with named profiles, reading JSON output, and reasoning about IAM roles, KMS keys, and VPC subnets/security groups. Familiarity with CloudFormation/Terraform matters, because the single biggest DR failure — “the data was safe but there was nothing to restore it onto” — is solved by infrastructure-as-code, not by the backup service.

This sits in the Resiliency track. It assumes the Region/AZ fundamentals from AWS Regions and Availability Zones Explained and the storage-class mechanics from AWS S3 Storage Classes and Lifecycle, since lifecycle-to-cold-storage governs both backup cost and restore time. It pairs tightly with AWS RDS, DynamoDB and Aurora Compared (the database you protect dictates your RPO floor) and Aurora High Availability and Global Database (the lowest-RPO database DR option). For the centralized, org-wide version of everything here — delegated admin, StackSet vault bootstrap, air-gapped accounts — go to Org-wide AWS Backup with Vault Lock and Cross-Account Recovery. For the lift-and-shift, near-zero-RPO server DR alternative, see AWS Elastic Disaster Recovery (DRS) Cross-Region Failover.

A quick map of who owns each layer of a recovery, so you escalate to the right person fast during an incident:

Layer What lives here Who usually owns it Failure class it causes
Backup policy / schedule AWS Backup plans, rules, copy actions Platform / SRE RPO miss (schedule too loose), copy never lands
Recovery points / vaults Snapshots, PITR windows, Vault Lock Platform / security Deleted/ransomed backups; lock billing forever
Encryption (KMS) Source + destination CMKs, grants Security Restore aborts “KMS key cannot be accessed”
Compute templates AMIs, launch templates, ASGs, IaC App / platform RTO blowout — nothing to restore onto
Database failover Replica promotion, PITR restore DBA / platform Lost in-flight transactions; long rehydrate
Traffic cutover Route 53, health checks, TTL Network / SRE DNS won’t flip; resolvers pinned to dead Region
Orchestration Step Functions / SSM runbook SRE Manual steps fail under pressure

Core concepts

Six mental models make every later decision obvious.

RTO and RPO are business numbers, set first, that bound everything else. RTO (Recovery Time Objective) is the maximum tolerable time to restore service. RPO (Recovery Point Objective) is the maximum tolerable data loss, measured in time (e.g. “we can lose 5 minutes of writes”). You do not pick a backup frequency and discover your RPO; you agree the RPO with the business and derive the backup frequency from it. A 5-minute RPO forbids nightly snapshots — it demands continuous backups (PITR) or synchronous replication. A 1-hour RTO forbids a cold 900 GB rehydrate — it demands a warm replica or a pre-provisioned stack.

Backup ≠ replication ≠ DR. A backup is a point-in-time copy retained for restore (recoverable from corruption you only notice later). Replication continuously mirrors current state to another location (great RPO, but it faithfully replicates corruption too — a dropped table replicates instantly). DR is the orchestration that uses backups and/or replicas to restore service, including compute, network, identity and DNS. You need backups for corruption, replication for low RPO, and DR orchestration to tie them into an actual recovery. Confusing them is the root of most “we had backups but couldn’t recover” stories.

AWS Backup is a control plane, not the storage. AWS Backup orchestrates native snapshot/backup mechanisms across services from one place: a backup plan (schedule + lifecycle + retention + copy), a backup vault (the container where recovery points land, encrypted by a KMS key), resource assignments (tag- or ARN-based selection of what to protect), and copy actions (push a recovery point to another vault, Region, or account). The actual bytes are EBS snapshots, RDS snapshots, DynamoDB backups, etc. — AWS Backup schedules and governs them; it does not invent a new storage format.

The destination must exist before the disaster. Cross-Region copy writes to a vault that you must pre-create in the destination Region, encrypted by a key whose policy allows the copy. Restoring compute needs AMIs and launch templates present in the destination Region, plus the VPC, subnets, security groups and IAM roles the workload expects. None of this is created for you at failover time. The recurring catastrophic failure is a perfectly safe recovery point with nowhere to land — data without a target is not recoverable in your RTO.

Immutability is the ransomware control. A recovery point an attacker (or a careless admin) can delete is not a safe backup. AWS Backup Vault Lock makes recovery points immutable for a retention period — in compliance mode, no one, including the AWS account root and AWS itself, can delete them or shorten retention until they expire. Pair that with a separate account (so a compromise of production cannot reach the backups) and a separate Region (so a Region event cannot), and you have a true air gap. Without it, your “backups” are deletable by whoever pops your account.

You don’t have a DR plan until you’ve tested a restore. A backup you have never restored is a hypothesis. AWS Backup’s restore testing runs scheduled restores into an isolated environment and validates them; a game day rehearses the full human runbook. The metrics that matter are measured RTO/RPO from a real restore, not the theoretical ones in a slide. Untested DR is the single most common reason recoveries fail.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters to DR
RTO Max tolerable time to restore service Business agreement Caps your provisioning + rehydrate time
RPO Max tolerable data loss (in time) Business agreement Sets backup frequency / replication mode
AWS Backup plan Schedule + lifecycle + retention + copy AWS Backup The policy that drives every backup
Backup vault KMS-encrypted container for recovery points A Region + account Where copies land; what you lock
Recovery point One point-in-time backup of a resource In a vault The thing you restore from
Copy action Push a recovery point to another vault/Region/account Backup rule Cross-Region / cross-account DR
Continuous backup (PITR) Restore to any second in a window RDS/Aurora/DynamoDB/S3 Sub-5-min RPO
Vault Lock Immutability (governance / compliance) A vault Ransomware / accidental-delete protection
Cross-Region read replica A live DB replica in another Region RDS/Aurora Pilot-light / warm-standby data tier
Route 53 failover DNS routing to a healthy endpoint Global The actual traffic cutover
Pilot Light / Warm Standby DR patterns (data warm, compute off / small) Architecture The RTO/cost trade-off you pick
Restore testing Scheduled, validated restores AWS Backup Proves RTO/RPO before the outage

RTO, RPO, and choosing a DR pattern

Everything starts here. Pick the wrong target and you either over-spend on Active/Active for a back-office tool or under-build Backup & Restore for revenue-critical checkout. The four patterns are a cost-versus-speed spectrum; you choose the cheapest one that meets the agreed RTO/RPO, per workload, not per company.

The four patterns, option by option

The detailed trade-off — what each pattern actually provisions, what drives its cost, and where it breaks:

Pattern Data tier Compute tier Network/DNS RTO driver RPO driver Where it bites
Backup & Restore Recovery points in a DR vault None until failover Create/flip at failover Provision + rehydrate time Backup interval Cold rehydrate is slow; IaC must exist
Pilot Light Live DB replica + S3 CRR Off (AMIs/templates staged) Pre-created, weights flipped Scale-from-zero + promote Replica lag (seconds) Scale-out cold start; capacity in DR AZ
Warm Standby Live replica Small but running Pre-wired, low TTL Scale-up + promote Replica lag Pay for idle stack; drift vs primary
Active/Active Multi-Region writes (Global DB / global tables) Full, serving traffic Latency/geo routing Near-zero Near-zero Cost; write-conflict + data consistency

A decision table — read your constraint, get your pattern:

If the business says… RTO/RPO it implies Pattern that fits Don’t over/under-build with
“Internal tool, a day down is fine” RTO hours, RPO hours Backup & Restore Warm Standby (wasted idle cost)
“Customer-facing, recover within ~30 min” RTO ~30 min, RPO minutes Pilot Light Backup & Restore (rehydrate too slow)
“Revenue site, only minutes of downtime” RTO minutes, RPO seconds Warm Standby Backup & Restore (misses RTO)
“Zero downtime, global users” RTO ~0, RPO ~0 Active/Active Warm Standby (single-Region writes)
“We must never lose a committed write” RPO ~0 Synchronous (Aurora Global / Multi-AZ) Snapshot-only (loses in-flight)
“Ransomware/compliance immutability” + immutable copy Any + Vault Lock + isolated account Same-account vault (no air gap)

The same patterns mapped to a realistic cost multiple and the AWS building blocks that implement them:

Pattern Steady-state cost (relative) Primary building blocks Failover human steps
Backup & Restore 1× (storage only) AWS Backup plan + cross-Region copy; IaC Provision stack → restore → flip DNS
Pilot Light ~2–3× + cross-Region replica; staged AMIs/LTs Promote replica → scale ASG → flip DNS
Warm Standby ~4–6× + always-on small ASG + ALB in DR Scale up → promote → flip DNS
Active/Active ~8–12× + Aurora Global / DynamoDB global tables (Automatic) shift weights

Translating RPO into a backup frequency

The mechanical link people miss: your backup/replication mechanism sets a floor on achievable RPO. You cannot promise a tighter RPO than your mechanism allows:

Mechanism RPO floor (best achievable) How RPO is set Cost note
Nightly snapshot (cron 1×/day) ~24 h Schedule interval Cheapest; loosest
Hourly snapshot ~1 h Schedule interval More storage, more API calls
RDS/Aurora continuous backup (PITR) ~5 min (typically) Transaction-log shipping Included; bounded by log frequency
DynamoDB PITR ~5 min Continuous log Per-GB charge for PITR
S3 versioning + CRR Seconds–minutes Async replication lag Replication + storage cost
Aurora Global Database Typically ~1 s Storage-level async replication Cross-Region replica cost
RDS Multi-AZ (sync, same Region) 0 (no loss) — but not cross-Region DR Synchronous standby Standby instance cost

Read this twice: Multi-AZ gives RPO 0 but is not DR — it survives an AZ, not a Region. Cross-Region DR trades a small RPO (replica lag) for surviving the Region. Combine them: Multi-AZ for the common AZ failure, cross-Region replica/copy for the rare Region failure.

Building AWS Backup plans

A backup plan is the policy engine: one or more backup rules, each with a schedule, a target vault, lifecycle (transition to cold + expiration), and optional copy actions to other vaults/Regions/accounts. Resource assignments select what the plan protects — by tag (the scalable way) or by ARN. AWS Backup then runs the native snapshot for each resource on schedule.

The plan, rule by rule

Create a plan with a daily rule that copies cross-Region, transitions to cold storage, and retains for a year:

# 1) A backup vault in the PRIMARY region (encrypted by a customer-managed KMS key)
aws backup create-backup-vault \
  --backup-vault-name prod-local-vault \
  --encryption-key-arn arn:aws:kms:us-east-1:111122223333:key/abcd-1234 \
  --region us-east-1

# 2) A plan with a daily rule: 5am UTC, cold after 30 days, expire after 365,
#    plus a cross-Region copy to us-west-2
aws backup create-backup-plan --backup-plan '{
  "BackupPlanName": "prod-daily-dr",
  "Rules": [{
    "RuleName": "daily-5am-crossregion",
    "TargetBackupVaultName": "prod-local-vault",
    "ScheduleExpression": "cron(0 5 * * ? *)",
    "StartWindowMinutes": 60,
    "CompletionWindowMinutes": 180,
    "Lifecycle": { "MoveToColdStorageAfterDays": 30, "DeleteAfterDays": 365 },
    "CopyActions": [{
      "DestinationBackupVaultArn": "arn:aws:backup:us-west-2:111122223333:backup-vault:dr-vault",
      "Lifecycle": { "MoveToColdStorageAfterDays": 30, "DeleteAfterDays": 365 }
    }]
  }]
}'
# Terraform: vault + plan + tag-based selection
resource "aws_backup_vault" "local" {
  name        = "prod-local-vault"
  kms_key_arn = aws_kms_key.backup.arn
}

resource "aws_backup_plan" "prod" {
  name = "prod-daily-dr"
  rule {
    rule_name         = "daily-5am-crossregion"
    target_vault_name = aws_backup_vault.local.name
    schedule          = "cron(0 5 * * ? *)"
    start_window      = 60
    completion_window = 180
    lifecycle {
      cold_storage_after = 30
      delete_after       = 365
    }
    copy_action {
      destination_vault_arn = aws_backup_vault.dr.arn # vault in us-west-2 provider alias
      lifecycle {
        cold_storage_after = 30
        delete_after       = 365
      }
    }
  }
}

resource "aws_backup_selection" "by_tag" {
  name         = "dr-tier-resources"
  plan_id      = aws_backup_plan.prod.id
  iam_role_arn = aws_iam_role.backup.arn
  selection_tag {
    type  = "STRINGEQUALS"
    key   = "dr-tier"
    value = "critical"
  }
}

Every field on a backup rule, what it controls, its default/limit, and when to change it:

Rule field What it controls Default / limit When to change Gotcha
ScheduleExpression When the backup runs cron/rate; min ~1 h between runs Tighten for lower RPO Sub-hour RPO needs PITR, not more crons
StartWindowMinutes How long AWS Backup waits to start 60 (min 60) Widen for busy schedules Job is canceled if not started in window
CompletionWindowMinutes Max time the job may run Must exceed start window Large datasets Job fails if it overruns
Lifecycle.MoveToColdStorageAfterDays Transition to cold storage Optional; min 1 Cut storage cost Min 90-day retention once cold (warm+cold)
Lifecycle.DeleteAfterDays Retention / expiry Optional Compliance retention Must be ≥ cold + 90 days
CopyActions[].DestinationBackupVaultArn Cross-Region/account copy target None Any real DR Destination vault must pre-exist
RecoveryPointTags Tags on the recovery point None Cost allocation, automation Useful for restore-testing selection
EnableContinuousBackup PITR for supported resources false Sub-5-min RPO (RDS/S3) Only some resource types support it

The backup-vault knobs (separate from the plan), and what each is for:

Vault setting What it does When to set Limit / note
EncryptionKeyArn KMS key encrypting recovery points Always (use a CMK, not AWS-managed) Cross-account/Region copy needs key-policy grants
Access policy (resource policy) Who/what can use the vault Restrict deletes; allow CopyIntoBackupVault Required to receive cross-account copies
Vault Lock Immutability Compliance/ransomware needs Compliance mode is irreversible
Notifications (SNS) Job state events Always (alert on failures) Wire BACKUP_JOB_FAILED, COPY_JOB_FAILED
Vault type (Backup vault vs Logically air-gapped vault) Standard vs shareable, isolated, always-immutable High-assurance recovery LAG vault is immutable by design, shareable via RAM

Selecting what to protect (tags beat ARNs)

Tag-based selection scales — tag a resource dr-tier=critical and it is automatically in the plan, no plan edit on each new resource:

aws backup create-backup-selection \
  --backup-plan-id <plan-id> \
  --backup-selection '{
    "SelectionName": "dr-tier-critical",
    "IamRoleArn": "arn:aws:iam::111122223333:role/AWSBackupDefaultServiceRole",
    "ListOfTags": [
      { "ConditionType": "STRINGEQUALS", "ConditionKey": "dr-tier", "ConditionValue": "critical" }
    ]
  }'

Selection strategies compared:

Selection method How it scales Best for Risk
By tag (ListOfTags) Automatic — new tagged resources join Fleets, dynamic infra A missing tag = silently unprotected
By ARN (explicit list) Manual — edit on every new resource A few critical, named resources Drift; forgotten resources
By resource type + condition Type-wide with tag filter “All RDS tagged prod” Broad blast radius if mis-scoped
Combined (type AND tag) Precise Compliance-scoped protection More complex policy

A guardrail: tag-based protection is only as good as your tagging discipline. Use an AWS Config rule or SCP to flag resources missing dr-tier, or untagged critical resources silently fall out of every backup plan.

Backup job states and the error reference

Every backup, copy and restore job moves through a state machine; knowing the terminal states tells you instantly whether you have a recovery point. The job lifecycle:

Job state Meaning What to do
CREATED Job accepted, not yet started Wait; within the start window
PENDING Queued, waiting on dependencies/throttle Wait; check service quotas if stuck
RUNNING Snapshot/copy/restore in progress Watch PercentDone
COMPLETED Recovery point created/copied/restored Verify it landed where expected
ABORTED Canceled (often start window expired) Widen the start/completion window
EXPIRED Didn’t start before the window closed Widen StartWindowMinutes
FAILED Job errored Read StatusMessage; fix per error table
PARTIAL Some resources in a selection failed Inspect per-resource job detail

The error/status reference — the messages you actually see, what they mean, how to confirm, and the fix. This is the table you scan first when a job is FAILED:

Error / message fragment Job type Likely cause How to confirm Fix
“KMS key cannot be accessed” Copy / Restore Destination key policy lacks CreateGrant for the service Job StatusMessage/AbortReason cites KMS Add Decrypt+GenerateDataKey+CreateGrant (GrantIsForAWSResource) for backup.amazonaws.com; use an MRK
“Access Denied” on copy Copy Destination vault access policy missing source/org list-recovery-points-by-backup-vault (dest) empty put-backup-vault-access-policy allowing CopyIntoBackupVault
“vault not found” Copy Destination vault not pre-created describe-backup-vault (dest) 404 Pre-create vault via StackSet/Terraform
“role/insufficient permissions” Backup / Restore AWS Backup role lacks required policy IAM simulate on the role Attach AWSBackupServiceRolePolicyForBackup/...ForRestores
“window expired” / ABORTED Backup Start/completion window too short Job state EXPIRED/ABORTED Widen StartWindowMinutes/CompletionWindowMinutes
“resource is in an invalid state” Backup Resource modifying (e.g. RDS mid-change) describe-db-instances Status not available Retry when the resource is stable
“ThrottlingException” Any API rate / concurrent job limits CloudTrail throttle events Stagger schedules; request quota increase
“Lock in place — cannot delete” Delete RP Vault Lock (governance/compliance) blocks deletion describe-backup-vault shows Locked Expected for compliance; for governance use a privileged principal
“ValidationException: lifecycle” Plan create Cold transition + delete violate the 90-day rule Plan create rejected Ensure DeleteAfterDays ≥ cold + 90
“continuous backup not supported” Backup EnableContinuousBackup on an unsupported type Job rejected Use snapshot for that type; PITR only where supported

A quick note on concurrency and quotas: AWS Backup runs jobs against per-account, per-service limits (concurrent backup/copy jobs, snapshot counts). A wall of jobs all scheduled at cron(0 5 * * ? *) throttles itself — stagger schedules across the window, and treat ThrottlingException as a signal to spread load, not to retry harder.

Per-service backup mechanisms

AWS Backup orchestrates native mechanisms, but each service’s RPO floor, restore behaviour, and quirks differ. Pick the mechanism that meets the RPO, then let AWS Backup schedule and copy it.

The cross-service matrix — what each supports, its RPO floor, and how a restore behaves:

Service Backup mechanism Continuous (PITR)? RPO floor Restore behaviour DR copy method
EBS Snapshot (incremental) No Schedule interval New volume from snapshot AWS Backup copy / snapshot copy
EC2 AMI + EBS snapshots No Schedule interval Launch from AMI Copy AMI / AWS Backup
RDS Snapshot + automated backups Yes (PITR) ~5 min (PITR) New instance; PITR to a second Cross-Region read replica / snapshot copy
Aurora Snapshot + continuous Yes ~5 min; ~1 s with Global DB Clone/restore; Global DB failover Aurora Global Database
DynamoDB On-demand + PITR Yes ~5 min New table; PITR to a second Global tables (multi-Region, active-active)
S3 Versioning + replication + Object Lock n/a (continuous CRR) Seconds–minutes Object versions; replicate CRR / SRR; S3 backup in AWS Backup
EFS AWS Backup No Schedule interval New/in-place file system AWS Backup copy
FSx Snapshot / AWS Backup No (varies) Schedule interval New file system AWS Backup copy

RDS and Aurora — the database is your RPO floor

For RDS, automated backups enable PITR within a retention window (1–35 days); a manual snapshot is a point-in-time copy you keep indefinitely. For cross-Region DR you have two levers: a cross-Region read replica (live, low-lag, promotable — Pilot Light / Warm Standby) or cross-Region snapshot copy (cheaper, slower — Backup & Restore).

# Cross-Region read replica (live DR data tier, promotable on failover)
aws rds create-db-instance-read-replica \
  --db-instance-identifier app-db-dr \
  --source-db-instance-identifier arn:aws:rds:us-east-1:111122223333:db:app-db-primary \
  --region us-west-2 \
  --kms-key-id arn:aws:kms:us-west-2:111122223333:key/dr-key

# Restore to a point in time (PITR) — to any second in the retention window
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier app-db-primary \
  --target-db-instance-identifier app-db-restored \
  --restore-time 2026-06-22T14:30:00Z

RDS/Aurora DR options compared:

Option RPO RTO Cost Survives Region? Best pattern
Automated backups (PITR, same Region) ~5 min Minutes–hours Included No (same Region) In-Region recovery
Manual snapshot + cross-Region copy Schedule interval Hours (rehydrate) Storage Yes Backup & Restore
Cross-Region read replica Seconds (lag) Minutes (promote) Replica instance Yes Pilot Light / Warm Standby
Aurora Global Database Typically ~1 s <1 min (managed failover) Replica + storage Yes Warm Standby / Active-Active
Multi-AZ (sync standby) 0 ~1–2 min (AZ failover) Standby No (AZ only) HA, not DR

The promotion gotcha: promoting a read replica breaks replication and makes it a standalone writable primary — irreversible. In a real failover that is what you want; in a test, promote a copy or you cannot re-attach it.

DynamoDB, S3, EFS — the rest of the data estate

Resource Recommended DR setup Why
DynamoDB PITR on + global tables for active-active multi-Region Global tables give near-zero RPO and a live second-Region copy
S3 (critical data) Versioning + CRR to DR Region + Object Lock (WORM) CRR is continuous; Object Lock makes objects ransomware-proof
S3 (backups themselves) Object Lock in compliance mode, separate account Immutable backup target
EFS AWS Backup with cross-Region copy Native EFS replication or Backup copy for file shares
FSx AWS Backup or FSx replication Per-file-system DR copy

S3 replication has a critical subtlety for backups: enable Replication Time Control (RTC) if you need an SLA on replication lag (15-minute objects-replicated SLA), and turn on delete-marker replication carefully — you usually do not want deletes to propagate to a backup target.

The DynamoDB protection options, side by side, because PITR and on-demand/global tables solve different problems:

DynamoDB option What it protects RPO Cross-Region? Cost model Use when
PITR (continuous) Restore to any second in last 35 days ~5 min No (same Region) Per-GB Recover from a bad write/delete
On-demand backup A kept point-in-time snapshot At backup time Via AWS Backup copy Per-GB Long-retention / compliance
Global tables Live multi-Region replicas (active-active) Near-zero Yes Replicated write/storage Multi-Region serving + DR
AWS Backup (managed) Centralized policy + cross-Region copy At backup time Yes Per-GB Org-wide policy uniformity

The S3 backup-and-DR settings that matter, and what each is for:

S3 setting What it does DR relevance Gotcha
Versioning Keeps every object version Recover from overwrite/delete Must be on before the bad event; costs per version
CRR (Cross-Region Replication) Async copy to a DR-Region bucket Region-failure survival Replicates new writes only unless backfilled
Replication Time Control (RTC) 15-min replication SLA + metrics Bounded RPO on replication Extra cost; needs versioning
Object Lock (governance) WORM, removable by privileged Accidental-delete protection Bucket must be created with lock enabled
Object Lock (compliance) WORM, irreversible until retention ends Ransomware-proof backups Cannot shorten/delete until expiry
Delete-marker replication Propagate deletes to the replica Usually off for backups On = a source delete removes the DR copy
MFA Delete Require MFA to delete versions/disable versioning Extra delete guard Root-only to configure; operational friction

And the EBS/EC2 snapshot specifics, since the most common AWS Backup target is block storage:

EBS/EC2 fact Detail DR implication
Snapshots are incremental Only changed blocks since the last snapshot Cheap to snapshot often; first/full is largest
Restore is lazy-loaded Volume is usable immediately, blocks fetch on demand First-touch I/O is slow; pre-warm hot volumes
Fast Snapshot Restore (FSR) Pre-initializes a snapshot in an AZ Eliminates lazy-load latency; per-AZ-hour cost
AMI = snapshots + metadata An AMI references EBS snapshots Copy the AMI (not just the snapshot) to DR for compute
Cross-Region copy re-encrypts With the destination Region’s key Destination key must permit the copy
Snapshot copy is async Completes independently of the source Confirm it landed before relying on it

Vaults, encryption, and Vault Lock (the immutability layer)

A recovery point an attacker can delete is not protection. The hardening stack is three layers: a vault access policy (who can touch the vault), a KMS key policy (who can decrypt and copy), and Vault Lock (immutability for a retention period).

Vault Lock — governance vs compliance

Governance mode prevents deletes/changes except by principals with explicit backup:DeleteRecoveryPoint-class permissions — a guardrail against accident and most misuse, but a sufficiently privileged admin can remove it. Compliance mode is absolute: once the cooling-off period ends, no one — not root, not AWS — can delete recovery points or shorten retention until they expire. It is irreversible.

# Governance lock (reversible by privileged principals)
aws backup put-backup-vault-lock-configuration \
  --backup-vault-name dr-vault \
  --min-retention-days 35 \
  --max-retention-days 365

# Compliance lock (IRREVERSIBLE after the changeable window) — note --changeable-for-days
aws backup put-backup-vault-lock-configuration \
  --backup-vault-name dr-compliance-vault \
  --min-retention-days 35 \
  --max-retention-days 2555 \
  --changeable-for-days 3

The two modes side by side — read this before you lock anything in compliance mode:

Aspect Governance mode Compliance mode
Who can remove the lock Privileged IAM principals No one (including root, AWS)
Reversible? Yes No (after cooling-off)
Cooling-off (--changeable-for-days) n/a (omit) Required; 3–N days to undo a mistake
Deletes before expiry Allowed for privileged principals Blocked for everyone
Use when Operational guardrail Regulatory WORM, ransomware air gap
Risk A rogue admin can still delete A bad retention value bills forever

The compliance-mode pitfalls that have cost teams real money:

Pitfall What happens How to avoid
No cooling-off testing You lock a typo’d config permanently Always use a multi-day --changeable-for-days; verify in that window
“Always”/indefinite retention + compliance lock Recovery points bill forever, undeletable Never combine indefinite retention with a compliance lock
min-retention-days too high Even short-lived backups pinned for years Match retention to the actual policy, not “max safe”
Wrong vault locked Production churn locked at 7 years Lock only the dedicated DR/compliance vault

KMS and cross-Region copy — the #1 restore failure

A cross-Region or cross-account restore fails with “KMS key cannot be accessed” when the destination key policy does not let AWS Backup create a grant. The fix is to allow kms:CreateGrant (with GrantIsForAWSResource) plus Decrypt/GenerateDataKey for backup.amazonaws.com, and ideally use a multi-Region key (MRK) so the key ARN is consistent across Regions.

{
  "Sid": "AllowAWSBackupCrossRegionRestore",
  "Effect": "Allow",
  "Principal": { "Service": "backup.amazonaws.com" },
  "Action": ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant", "kms:DescribeKey"],
  "Resource": "*",
  "Condition": { "Bool": { "kms:GrantIsForAWSResource": "true" } }
}

The encryption decisions and their consequences:

Decision Option A Option B Recommendation
Key type AWS-managed (aws/backup) Customer-managed (CMK) CMK — required for cross-account/Region control
Cross-Region key Re-encrypt with a regional CMK Multi-Region key (MRK) MRK — ARNs line up, fewer grant headaches
Grant for restore Manual per-restore Key policy allows CreateGrant for service Policy grant — restores just work
Key deletion window 7 days 30 days Longer for DR keys — never orphan a recovery point
Cross-account Source key only Share/replicate key to DR account DR account must decrypt to restore

Cross-Region and cross-account architecture

Geographic and account separation are different defenses. Cross-Region survives a Region-wide event. Cross-account survives an account compromise (a popped root credential cannot reach a vault in an account it has no access to). Real DR uses both: copy recovery points from the production account/Region to an isolated DR account in a different Region, into a Vault-Locked vault.

The separation matrix — what each axis protects against:

Separation Protects against Does NOT protect against Cost
Same account, same Region Resource deletion (with versioning) Region outage; account compromise Lowest
Same account, cross-Region Region outage Account compromise; rogue admin + transfer + storage
Cross-account, same Region Account compromise Region outage + cross-account copy
Cross-account, cross-Region Both — true air gap (covered) Highest, and worth it

To receive a cross-account copy, the destination vault needs an access policy allowing the source to copy in:

aws backup put-backup-vault-access-policy \
  --backup-vault-name dr-airgap-vault \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "AllowOrgCopyIn",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "backup:CopyIntoBackupVault",
      "Resource": "*",
      "Condition": { "StringEquals": { "aws:PrincipalOrgID": "o-abcd1234" } }
    }]
  }'

The cross-account/Region copy checklist — every prerequisite that, if missing, fails the copy:

Prerequisite Where Symptom if missing Fix
Destination vault exists DR Region/account Copy job fails “vault not found” Pre-create via StackSet/Terraform
Vault access policy allows CopyIntoBackupVault Destination vault Copy denied Add org/account-scoped policy
Destination KMS key grants the service Destination key “KMS cannot be accessed” Add CreateGrant/Decrypt for backup.amazonaws.com
Copy action references the right ARN Backup rule Copy lands nowhere / errors Fix DestinationBackupVaultArn
IAM role can copy AWS Backup role Job role error AWSBackupServiceRolePolicyForBackup + copy perms
(Org) trusted access enabled Organizations Org-wide policy won’t apply Enable AWS Backup trusted access

For the full org-scale build of this — delegated admin, service-managed StackSets to bootstrap vaults/roles in every account, and a logically air-gapped vault shared via RAM — see Org-wide AWS Backup with Vault Lock and Cross-Account Recovery.

Orchestrating failover with Route 53

Backups and replicas get the data to the DR Region. Failover is the orchestration that turns a recovered stack into the live one: promote the database, scale the compute, and — the step most often botched — cut traffic over with DNS. Route 53 health checks + failover routing automate the traffic flip; a too-high TTL or a health check probing the wrong path defeats it.

Route 53 failover, configured

A primary/secondary failover record set, where Route 53 serves the secondary when the primary health check fails:

# Health check on the primary origin's real health path
aws route53 create-health-check --caller-reference dr-$(date +%s) \
  --health-check-config '{
    "Type": "HTTPS", "FullyQualifiedDomainName": "app.example.com",
    "Port": 443, "ResourcePath": "/healthz",
    "RequestInterval": 30, "FailureThreshold": 3
  }'

# Primary record (failover=PRIMARY) with a LOW TTL so resolvers don't pin the dead Region
aws route53 change-resource-record-sets --hosted-zone-id Z123 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "app.example.com", "Type": "A", "TTL": 60,
        "SetIdentifier": "primary", "Failover": "PRIMARY",
        "HealthCheckId": "<hc-id>",
        "AliasTarget": { "HostedZoneId": "<alb-zone>", "DNSName": "primary-alb...", "EvaluateTargetHealth": true }
      }
    }]
  }'

The Route 53 knobs that decide whether failover actually works:

Setting What it controls Recommended Failure if wrong
Record TTL How long resolvers cache the record 60 s High TTL pins users to the dead Region for the TTL
Health check Type TCP / HTTP / HTTPS / CloudWatch alarm HTTPS to /healthz TCP-only check passes on a broken app
ResourcePath What the health check probes A real readiness path Probing / can pass while the app is down
RequestInterval Probe frequency (10 s / 30 s) 30 s (10 s = faster, costs more) Slow detection delays failover
FailureThreshold Consecutive fails before unhealthy 3 Too high = slow failover; too low = flapping
Routing policy Failover / latency / weighted / geolocation Failover for DR Wrong policy won’t cut over on health
EvaluateTargetHealth Alias follows target health true Stale routing to an unhealthy ALB

The routing policies you can use for cross-Region traffic, and which DR pattern each suits:

Routing policy Behaviour Best DR pattern Watch-out
Failover Serve secondary only when primary HC fails Pilot Light / Warm Standby Needs a working health check + low TTL
Weighted Split traffic by weight (e.g. 100/0 → flip) Controlled cutover / canary failback Manual weight flip unless automated
Latency Route to the lowest-latency healthy Region Active/Active Both Regions must serve correctly
Geolocation / Geoproximity Route by user location Active/Active (data residency) A Region loss needs a fallback record
Multivalue answer Return multiple healthy IPs Simple resilience Not a true failover; client picks

The health-check types and what each can (and can’t) tell you:

HC type Probes Good for Limitation
TCP Port reachability “Is something listening?” Passes even if the app is broken
HTTP/HTTPS A path returns 2xx/3xx App-level readiness Only as good as the path you choose
HTTPS + string match Response body contains a string Deep readiness signal Slightly more setup
CloudWatch alarm An alarm’s state Composite/derived health Indirect; alarm config must be right
Calculated Combine child health checks Multi-component health Logic must reflect real dependency

The DR runbook (automate the human steps)

Manual runbooks fail under pressure. Codify promotion + scale-out + DNS in Step Functions or SSM Automation, triggered by a CloudWatch alarm or a human “break glass.” The canonical failover sequence:

Step Action Tool / API Confirm
1 Detect outage CloudWatch alarm / Route 53 HC Alarm in ALARM; HC unhealthy
2 Promote DB replica → primary rds promote-read-replica / Aurora failover New writer endpoint available
3 Restore/scale compute ASG set-desired-capacity; restore from AMI Instances InService behind ALB
4 Re-point app config SSM Parameter Store / Secrets Manager (DR values) App reads DR endpoints
5 Flip DNS Route 53 failover (automatic on HC) or manual UPSERT dig resolves to DR ALB
6 Validate Synthetic checks; smoke tests Real requests succeed in DR
7 Communicate SNS / status page Stakeholders notified
# Promote the cross-Region replica to a standalone writable primary (irreversible)
aws rds promote-read-replica \
  --db-instance-identifier app-db-dr --region us-west-2

# Scale the pre-staged DR Auto Scaling group out from its warm/pilot size
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name app-dr-asg --desired-capacity 6 --region us-west-2

Architecture at a glance

Read the diagram left to right; it traces a single recovery point from the live workload to an immutable copy in a second Region and then to a recovered stack that traffic cuts over to. On the far left, the PRIMARY (us-east-1) zone holds the live workload — an EC2 app tier behind an Auto Scaling group, an RDS Multi-AZ primary, and the S3 buckets and artifacts. The BACKUP CONTROL zone is the AWS Backup policy plane: a backup plan (cron schedule, tag-targeted to dr-tier), the local vault encrypted by a source CMK, and that customer-managed key. From there a copy_action ships the recovery point cross-Region into the DR REGION (us-west-2) zone, where it lands in a Vault-Locked DR vault, alongside a cross-Region RDS read replica (the warm data tier) and the destination multi-Region key. When disaster strikes, the RECOVERED STACK zone takes over: restore-IaC stands up pre-staged AMIs and launch templates, a Step Functions DR runbook promotes the replica and scales out, and the DR app tier comes up warm. Finally the TRAFFIC CUTOVER zone flips the world: a Route 53 health check on the primary origin fails, and DNS flips (60-second TTL, ALIAS swap) to the healthy DR stack.

The numbered badges mark the five places this path most often breaks, and the legend narrates each as symptom · how to confirm · fix: an RPO promise bigger than the actual schedule (badge 1), a cross-Region copy that never lands because the destination vault or its access policy is missing (badge 2), a restore blocked because the destination KMS key denies the grant (badge 3), a recovery that has data but no compute to land on (badge 4), and a DNS layer that won’t fail over because the health check probes the wrong path or a high TTL pins resolvers to the dead Region (badge 5). The lesson the diagram encodes: data safely copied is necessary but not sufficient — the target (compute, keys, DNS) has to exist and be correct before the disaster, or your RTO blows out exactly when it matters.

AWS Backup and cross-Region disaster-recovery architecture: a recovery point flows left to right from a primary us-east-1 workload (EC2 app tier with Auto Scaling, RDS Multi-AZ primary, S3 data) through the AWS Backup control plane (tag-targeted backup plan, local vault, source CMK) via a cross-Region copy action into a us-west-2 DR Region (Vault-Locked DR vault, cross-Region RDS read replica, destination multi-Region KMS key), then into a recovered warm-standby stack (CloudFormation-staged AMIs and launch templates, a Step Functions DR runbook that promotes the replica and scales the ASG, the DR app tier) and finally to a Route 53 traffic cutover (health check on the primary origin, 60-second-TTL DNS failover) — with five numbered failure points marked: RPO larger than the backup schedule, a cross-Region copy that never lands, a KMS-denied restore, no compute to restore onto, and DNS that won't fail over

Real-world scenario

MediCloud runs a patient-portal SaaS: a stateless app tier on EC2 (Auto Scaling, behind an ALB) in us-east-1, an RDS for PostgreSQL Multi-AZ database (900 GB), patient documents in S3, and session state in DynamoDB. They are HIPAA-regulated, with a contractual RTO of 1 hour and RPO of 15 minutes, and a board mandate for ransomware-immutable backups. Their pre-incident “DR plan”: a nightly pg_dump to S3 and a belief that S3’s durability equalled disaster recovery. Monthly AWS spend was about ₹6,80,000.

The incident was a multi-hour control-plane degradation in us-east-1 — the database was reachable intermittently, new EC2 launches failed, and the console was flaky. The on-call engineer’s first move was to “restore the latest backup” — which surfaced three compounding failures. First, the latest usable backup was the nightly dump from 02:00; it was now 14:30, so they faced 12.5 hours of data loss against a 15-minute RPO. Second, restoring a 900 GB Postgres dump into a fresh instance took ~4 hours — four times the 1-hour RTO — because it was a logical restore, not a snapshot. Third, even with data, the app tier could not come up in another Region: there were no AMIs, no launch template, and no security-group/IAM definitions for us-west-2. The “DR plan” was, in practice, nonexistent. The incident ran nine hours; the regulator was notified.

The rebuild took the patterns in this article. The team set the agreed target explicitly — RTO 1 h, RPO 15 min — and derived a Warm Standby pattern. They turned on RDS automated backups (PITR, 5-min RPO floor) and stood up a cross-Region read replica in us-west-2 (seconds of lag). They built an AWS Backup plan, tag-targeted to dr-tier=critical, with a cross-Region copy action into a DR vault in us-west-2 in a separate, isolated DR account, and locked that vault in compliance mode (35-day minimum, 3-day cooling-off) for ransomware immutability. Crucially they put the whole app tier in Terraform, pre-staging AMIs and launch templates in us-west-2 and running a small warm ASG (min=2). They wrote a Step Functions runbook to promote the replica, scale the ASG, and re-point config, and configured Route 53 failover with a 60-second TTL and a health check on /healthz.

Two failures showed up during the first game day — which is the entire point of testing. The first cross-Region restore test failed with “KMS key cannot be accessed”: the destination key policy lacked kms:CreateGrant for backup.amazonaws.com. They switched to a multi-Region key and added the grant. The second: the Route 53 health check was probing / (which returned 200 from a static page even when the API was down), so DNS would not have failed over on a real API outage — they re-pointed it at /healthz, which checks the database connection. After fixes, a full game day measured RTO 22 minutes (replica promote ~90 s, ASG scale-out ~6 min, DNS propagation ~60 s, validation buffer) and RPO under 1 minute. Steady-state DR cost rose to about ₹9,40,000/month (the warm replica, the small DR ASG, cross-Region transfer, the locked vault) — roughly 1.4× — which the board approved instantly against the prior nine-hour, regulator-notifying outage. The lesson on the wall: “A backup is a hypothesis until you’ve restored it. Test the restore, or the outage tests it for you.”

The incident as a timeline, because the order of discovery is the lesson:

Time Symptom Action taken Effect What it should have been
14:30 us-east-1 degraded; launches failing “Restore the latest backup” Latest = 02:00 dump PITR replica ready to promote
14:45 12.5 h data loss realized Accept the nightly dump RPO blown 50× 5-min PITR / replica lag
15:00 Restore started Logical 900 GB restore ~4 h ETA, RTO blown 4× Promote replica in ~90 s
16:30 Need app tier in DR No AMIs/LTs/SGs exist Rebuild by hand Pre-staged IaC + warm ASG
+rebuild Game day #1 Cross-Region restore test “KMS cannot be accessed” MRK + CreateGrant grant
+rebuild Game day #1 DNS failover test HC on / wouldn’t flip HC on /healthz
+rebuild Game day #2 Full failover rehearsal RTO 22 min, RPO <1 min The actual, proven DR

Advantages and disadvantages

AWS Backup plus cross-Region/account DR is the right model for most regulated, multi-service estates — but it has real costs and sharp edges. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
One policy plane across EC2/EBS/RDS/Aurora/DynamoDB/EFS/FSx/S3 — no per-service backup glue The control plane abstracts native quirks; you still must know each service’s RPO floor and restore behaviour
Cross-Region and cross-account copy gives a true air gap against Region outage and account compromise Cross-Region copy adds storage and data-transfer cost; it is not free insurance
Vault Lock (compliance) makes recovery points genuinely immutable — real ransomware protection A bad retention value under a compliance lock is irreversible and bills forever
Tag-based selection auto-protects new resources — no plan edit per resource A missing tag silently drops a critical resource from every plan
Restore testing + game days turn theoretical RTO/RPO into measured numbers DR testing is operationally heavy and chronically neglected — most teams never do it
Continuous backups (PITR) deliver ~5-min RPO without bespoke replication Snapshot-only/nightly patterns lose in-flight transactions; replica promotion still loses lag-window writes
Native Route 53 failover automates the traffic cutover A high TTL or a health check on the wrong path silently defeats failover
Centralized monitoring shows backup/copy job state org-wide Without alerting on COPY_JOB_FAILED, a silently failing copy leaves you with no DR copy

The model fits revenue-critical and regulated workloads that need provable recovery and immutability. It is over-built for an internal tool that can tolerate a day down (use simple Backup & Restore, skip the warm standby) and under-built if you stop at “we have backups” and never stage compute or test a restore. The disadvantages are all manageable — but only if you know they exist, which is the point of this article.

Hands-on lab

Stand up a minimal but real cross-Region backup: create a KMS key, two vaults (primary + DR Region), a backup plan with a cross-Region copy action, back up an EBS volume, watch the copy land in the DR Region, then restore it. Free-tier-friendly where possible (a tiny EBS volume + a few snapshots cost a few rupees); we delete everything at the end. Run in CloudShell or any shell with the CLI configured.

Step 1 — Variables.

PRIMARY=us-east-1
DR=us-west-2
ACCT=$(aws sts get-caller-identity --query Account --output text)

Step 2 — A customer-managed KMS key in the primary Region.

KEY_ID=$(aws kms create-key --region $PRIMARY \
  --description "lab-backup-key" --query KeyMetadata.KeyId --output text)
echo "Key: $KEY_ID"

Expected: a key UUID printed.

Step 3 — A vault in each Region. (The DR vault must exist before any copy.)

aws backup create-backup-vault --region $PRIMARY \
  --backup-vault-name lab-local-vault \
  --encryption-key-arn arn:aws:kms:$PRIMARY:$ACCT:key/$KEY_ID

# DR vault — use an AWS-managed key here for lab simplicity
aws backup create-backup-vault --region $DR \
  --backup-vault-name lab-dr-vault

Step 4 — A tiny EBS volume to protect, tagged for selection.

AZ=${PRIMARY}a
VOL=$(aws ec2 create-volume --region $PRIMARY --availability-zone $AZ \
  --size 1 --volume-type gp3 \
  --tag-specifications 'ResourceType=volume,Tags=[{Key=dr-tier,Value=lab}]' \
  --query VolumeId --output text)
echo "Volume: $VOL"

Step 5 — A backup plan with a cross-Region copy action.

PLAN_ID=$(aws backup create-backup-plan --region $PRIMARY --backup-plan '{
  "BackupPlanName": "lab-dr-plan",
  "Rules": [{
    "RuleName": "hourly-copy",
    "TargetBackupVaultName": "lab-local-vault",
    "ScheduleExpression": "cron(0 * * * ? *)",
    "StartWindowMinutes": 60,
    "CompletionWindowMinutes": 120,
    "Lifecycle": { "DeleteAfterDays": 7 },
    "CopyActions": [{
      "DestinationBackupVaultArn": "arn:aws:backup:'$DR':'$ACCT':backup-vault:lab-dr-vault",
      "Lifecycle": { "DeleteAfterDays": 7 }
    }]
  }]
}' --query BackupPlanId --output text)
echo "Plan: $PLAN_ID"

Step 6 — Don’t wait for the cron; trigger an on-demand backup now (it honors the rule’s copy).

aws backup start-backup-job --region $PRIMARY \
  --backup-vault-name lab-local-vault \
  --resource-arn arn:aws:ec2:$PRIMARY:$ACCT:volume/$VOL \
  --iam-role-arn arn:aws:iam::$ACCT:role/service-role/AWSBackupDefaultServiceRole

(If that role doesn’t exist, create the default service role via the AWS Backup console once, or attach AWSBackupServiceRolePolicyForBackup to a role.)

Step 7 — Watch the backup, then the copy job.

# Backup job state (COMPLETED expected in a few minutes)
aws backup list-backup-jobs --region $PRIMARY \
  --query "BackupJobs[?contains(ResourceArn, '$VOL')].{state:State, pct:PercentDone}" --output table

# Copy job into the DR Region
aws backup list-copy-jobs --region $PRIMARY \
  --query "CopyJobs[].{state:State, dest:DestinationBackupVaultArn}" --output table

Step 8 — Confirm the recovery point landed in the DR Region.

aws backup list-recovery-points-by-backup-vault --region $DR \
  --backup-vault-name lab-dr-vault \
  --query "RecoveryPoints[].{arn:RecoveryPointArn, status:Status}" --output table

Expected: at least one recovery point with Status=COMPLETEDthat is your cross-Region DR copy.

Step 9 — Restore it in the DR Region (creates a new EBS volume there).

RP_ARN=$(aws backup list-recovery-points-by-backup-vault --region $DR \
  --backup-vault-name lab-dr-vault --query "RecoveryPoints[0].RecoveryPointArn" --output text)

aws backup start-restore-job --region $DR \
  --recovery-point-arn "$RP_ARN" \
  --iam-role-arn arn:aws:iam::$ACCT:role/service-role/AWSBackupDefaultServiceRole \
  --metadata '{"availabilityZone":"'${DR}a'","volumeType":"gp3"}' \
  --resource-type EBS

What each lab step proves:

Step What you did What it proves Real-world analogue
3 DR vault before any copy The destination must pre-exist The #1 cross-Region prerequisite
5 Plan with CopyActions Copy is a property of the rule Production DR plan
7–8 Watched copy land in DR A copy job is separate and can fail alone Why you alert on COPY_JOB_FAILED
9 Restored in the DR Region The copy is actually usable The restore test you must run

Cleanup (avoid lingering charges).

aws backup delete-backup-plan --region $PRIMARY --backup-plan-id $PLAN_ID
aws ec2 delete-volume --region $PRIMARY --volume-id $VOL
# Delete recovery points before deleting vaults (omit if Vault Lock is on)
aws kms schedule-key-deletion --region $PRIMARY --key-id $KEY_ID --pending-window-in-days 7
# Then delete both vaults once empty, and any restored DR volume

Cost note. A 1 GB gp3 volume plus a couple of snapshots and a cross-Region copy is well under ₹100 for the hour; deleting the resources stops everything. The KMS key has a tiny monthly charge until its scheduled deletion completes.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table for 02:14, then the high-bite entries expanded with the exact confirm command and fix.

# Symptom Root cause Confirm (exact cmd / console path) Fix
1 Promised 5-min RPO, but recovery is hours old Nightly snapshot schedule, no PITR describe-db-instances shows BackupRetentionPeriod/no continuous; backup-job gaps Enable RDS automated backups (PITR) / EnableContinuousBackup; reserve cron for loose tiers
2 Cross-Region copy never appears in DR vault DR vault missing, or access policy/role lacks copy perms list-copy-jobs shows FAILED; list-recovery-points-by-backup-vault (DR) empty Pre-create DR vault; allow backup:CopyIntoBackupVault; fix copy IAM role
3 Restore aborts “KMS key cannot be accessed” Destination key policy lacks kms:CreateGrant for the service Restore job StatusMessage cites KMS Add Decrypt+GenerateDataKey+CreateGrant (GrantIsForAWSResource) for backup.amazonaws.com; use an MRK
4 Data recovered, but nothing to run it on No AMIs/launch templates/SGs/IAM in DR Region describe-images/describe-launch-templates in DR empty Pre-stage AMIs + LTs + IaC every release (Pilot Light / Warm Standby)
5 DNS keeps serving the dead Region High TTL, or health check on wrong path get-health-check-status Success on a down origin; record TTL high TTL 60 s; failover routing; probe /healthz, not /
6 Restore much slower than expected Recovery point in cold storage (rehydrate) Recovery point StorageClass=COLD/Glacier; restore time long Keep DR-tier points warm; only archive long-retention/compliance copies
7 Backup job canceled, never ran Start window too short for the schedule list-backup-jobs State=ABORTED/EXPIRED; reason “window” Widen StartWindowMinutes/CompletionWindowMinutes
8 Promoted replica then couldn’t undo it Promotion is irreversible (breaks replication) Replica now a standalone writer In tests promote a copy; in real DR it’s intended
9 Compliance-locked vault billing forever Indefinite retention + compliance lock describe-backup-vault Locked, no expiry on RPs Never combine “Always” retention with a compliance lock; set finite retention
10 Critical resource silently not backed up Missing dr-tier tag; selection by tag list-protected-resources doesn’t include it Config rule to flag untagged criticals; add the tag
11 Cross-account copy denied Destination vault access policy missing PrincipalOrgID/account Copy job FAILED “access denied”; DR vault RPs empty put-backup-vault-access-policy allowing the source org/account
12 Lost in-flight transactions on failover Async replication lag at the moment of failure Replica ReplicaLag > 0 at cutover Aurora Global (sync-ish) for tighter RPO; accept lag-window loss otherwise
13 Failover “worked” but app errored DR config still points at primary endpoints App logs show primary DB/endpoint; SSM params stale Maintain DR-Region SSM/Secrets values; re-point in the runbook
14 Game day passes, real outage fails Tested restore but not the full failover path Only restore tested; DNS/compute/identity untested Rehearse the whole runbook end to end, not just the restore

The expanded form for the entries that bite hardest:

1. The RPO promise is bigger than the schedule. Root cause: a nightly (or hourly) snapshot cron can never beat its own interval — promising 5-minute RPO on a daily schedule is a contradiction. Confirm: aws backup list-backup-jobs shows ~24 h between completions; aws rds describe-db-instances --query "DBInstances[].BackupRetentionPeriod" is the PITR window (0 means automated backups off). Fix: enable RDS/Aurora automated backups (continuous, ~5-min RPO) or DynamoDB PITR; reserve scheduled snapshots for cheaper, looser-RPO tiers. Match the mechanism to the RPO floor table above.

2. The cross-Region copy never lands. Root cause: the backup succeeds locally but the copy_action fails — the destination vault doesn’t exist, its access policy doesn’t allow the copy, or the copy IAM role lacks permission. A backup job COMPLETED is not proof a DR copy exists. Confirm: aws backup list-copy-jobs --query "CopyJobs[?State=='FAILED']"; aws backup list-recovery-points-by-backup-vault --backup-vault-name <dr-vault> --region <dr> returns empty after a run. Fix: pre-create the DR vault (StackSet/Terraform); put-backup-vault-access-policy allowing backup:CopyIntoBackupVault; ensure the role has copy permissions. Alert on COPY_JOB_FAILED via SNS — a silent copy failure leaves you with no DR copy at all.

3. Restore aborts “KMS key cannot be accessed.” Root cause: the destination key policy doesn’t let AWS Backup create the grant it needs to decrypt/re-encrypt during a cross-Region/account restore. Confirm: the restore job’s StatusMessage/AbortReason cites KMS. Fix: on the destination CMK, allow kms:Decrypt, kms:GenerateDataKey, and kms:CreateGrant (with kms:GrantIsForAWSResource=true) for backup.amazonaws.com; prefer a multi-Region key so the ARN is consistent and you avoid re-encrypt mismatches.

4. Data recovered, nothing to run it on. Root cause: the recovery point is fine, but the DR Region has no AMIs, launch templates, security groups, or IAM roles for the workload — so RTO blows out while you rebuild compute by hand. Confirm: aws ec2 describe-images --owners self --region <dr> and aws ec2 describe-launch-templates --region <dr> return nothing for the app. Fix: pre-stage AMIs + launch templates + the full IaC (VPC/subnets/SGs/roles) on every release — this is the difference between Backup & Restore (slow) and Pilot Light/Warm Standby (fast). Data without a target is not recoverable in your RTO.

5. DNS won’t fail over. Root cause: Route 53 keeps serving the dead primary — the health check passes against the wrong path (/ returns 200 from a static page even when the API is down), or a high record TTL pins resolvers to the dead Region for the TTL window. Confirm: aws route53 get-health-check-status shows Success on a down origin; the record TTL is large (e.g. 3,600). Fix: set a 60-second TTL, use failover routing, and probe a real readiness path (/healthz that checks the DB), so an unhealthy origin actually flips traffic.

Proving recovery: restore testing and game days

An untested restore is a hypothesis. AWS Backup restore testing runs scheduled restores from a selection of recovery points into an isolated environment and (optionally) runs a validation; a game day rehearses the full human runbook. Configure restore testing so it picks recent points, restores them, and reports pass/fail:

aws backup create-restore-testing-plan --restore-testing-plan '{
  "RestoreTestingPlanName": "weekly-dr-validation",
  "ScheduleExpression": "cron(0 6 ? * MON *)",
  "RecoveryPointSelection": {
    "Algorithm": "LATEST_WITHIN_WINDOW",
    "RecoveryPointTypes": ["CONTINUOUS", "SNAPSHOT"],
    "SelectionWindowDays": 7,
    "IncludeVaults": ["arn:aws:backup:us-west-2:111122223333:backup-vault:dr-vault"]
  }
}' --region us-west-2

What restore testing covers — and, crucially, what it does not, so you don’t mistake a green restore test for proven DR:

Validates Restore testing Full game day
Recovery point is restorable Yes Yes
KMS/grant path works Yes (restore runs) Yes
Measured restore time (data RTO) Yes Yes
Compute provisioning in DR No Yes
Replica promotion / DB failover No Yes
Route 53 DNS cutover No Yes
App config re-pointing (SSM/Secrets) No Yes
End-to-end synthetic user success No Yes
Human runbook under time pressure No Yes

The cadence and ownership that keep DR honest:

Activity Frequency Owner Output
Restore testing (automated) Weekly Platform Pass/fail + measured restore time
Backup/copy job alert review Continuous SRE No silent backup/copy failures
DR game day (full failover) Twice a year SRE + app Measured RTO/RPO + gap list
Runbook review/update After each game day SRE Current, accurate runbook
RTO/RPO target review Annually / on workload change Business + platform Re-validated targets

Best practices

The alerts worth wiring before the next incident — leading indicators, not the lagging “site down”:

Alert on Signal / event Threshold (starting point) Why it’s leading
Backup failed BACKUP_JOB_FAILED (SNS) Any No new recovery point this cycle
Copy failed COPY_JOB_FAILED (SNS) Any No DR copy — invisible until disaster
Restore-test failed RESTORE_JOB_FAILED (SNS) Any Your recovery is unproven/broken
Replica lag RDS/Aurora ReplicaLag > your RPO seconds Predicts data loss at failover
Health-check status Route 53 HC Unhealthy 3 intervals Failover trigger; catch a flapping origin
PITR window shrank BackupRetentionPeriod < target days RPO/restore window quietly reduced

Security notes

The security controls that also improve recovery — secure and resilient pull the same direction here:

Control Mechanism Secures against Also prevents
Vault Lock (compliance) put-backup-vault-lock-configuration Ransomware/accidental backup deletion A rushed admin deleting the only copy
Cross-account DR vault Separate AWS account + RAM/policy Account compromise reaching backups Blast-radius of a bad root credential
CMK + key policy Customer-managed KMS + grants Unauthorized decrypt of recovery points Cross-Region restore “KMS denied” (if granted right)
Vault access policy CopyIntoBackupVault only, deny Delete* Rogue principals removing recovery points Misconfigured copies landing nowhere
Least-priv restore role Scoped IAM for start-restore-job Unauthorized data recreation Accidental restores overwriting prod
SCP on DR account Deny Delete* backup/vault APIs Insider/compromise deleting DR Fat-finger vault deletion

Cost & sizing

The bill drivers and how they interact with the pattern you chose:

A rough monthly picture for the MediCloud-style workload (900 GB DB + documents): Backup & Restore might be ₹40,000–80,000 (storage + transfer), Warm Standby ₹2,00,000–4,00,000 (replica + small DR ASG + transfer + locked vault) on top of production. MediCloud landed at ~1.4× production after choosing the least pattern that met a 1-hour RTO — proof the lever is the pattern, not raw backup spend. The cost drivers and what each buys:

Cost driver What you pay for Rough INR / month What it buys Watch-out
Warm backup storage Per-GB instantly-restorable Scales with data × retention Fast RTO from warm points Don’t keep everything warm
Cold backup storage Per-GB archived ~⅕ of warm Cheap long retention 90-day min; rehydrate delay
Cross-Region transfer Per-GB on each copy Scales with change rate Region-failure survival Chatty data inflates it
Cross-Region read replica Always-on DR DB instance Instance price Seconds RPO, minutes RTO Idle cost; size for post-scale
Warm DR ASG (Warm Standby) Small always-on compute Min-size instance cost Minutes RTO Drift vs primary
KMS (CMK / MRK) Per-key + per-request Small Cross-acct/Region control MRK counts per Region
Restore testing Scheduled restores Small recurring Proven recovery Worth every paisa

Interview & exam questions

1. What’s the difference between RTO and RPO, and why set them before designing DR? RTO is the maximum tolerable time to restore service; RPO is the maximum tolerable data loss, in time. You set them with the business first because they bound everything downstream — RPO dictates backup frequency/replication mode (5-min RPO forbids nightly snapshots), and RTO dictates the DR pattern (a 1-hour RTO forbids a 4-hour cold rehydrate). Designing backups first and discovering your RTO/RPO is backwards.

2. A team backs up RDS nightly to S3 and calls it DR. What’s wrong? Several things: nightly backups give a ~24-hour RPO (not DR-grade for most workloads); a logical restore of a large DB is slow (RTO blowout); and a backup is only the data — DR also needs compute (AMIs/launch templates), network/identity, and DNS failover in the recovery Region, none of which a nightly dump provides. Backups protect against data loss; DR protects against loss of service.

3. Compare the four DR patterns by RTO/RPO and cost. Backup & Restore — nothing running in DR, RTO hours, RPO hours, lowest cost. Pilot Light — data replicating, compute off, RTO tens of minutes, RPO seconds-minutes, low cost. Warm Standby — scaled-down full stack running, RTO minutes, RPO seconds, medium cost. Active/Active — full stack serving traffic in both Regions, RTO/RPO near-zero, highest cost. Choose the cheapest that meets the agreed targets.

4. What does an AWS Backup copy action do, and what must exist for a cross-Region copy to succeed? A copy action pushes a recovery point to another vault, Region, or account. For cross-Region it needs: the destination vault pre-created; the destination KMS key policy allowing AWS Backup to create a grant (kms:CreateGrant for backup.amazonaws.com); and for cross-account, the destination vault access policy allowing backup:CopyIntoBackupVault from the source. A backup completing locally does not mean the copy landed — alert on COPY_JOB_FAILED.

5. Difference between AWS Backup Vault Lock governance and compliance mode? Governance prevents deletes/changes except by sufficiently privileged IAM principals — a guardrail, but removable. Compliance is absolute and irreversible after a mandatory cooling-off period: no one, including the account root and AWS, can delete recovery points or shorten retention until they expire. Use compliance for regulatory WORM and ransomware air gaps — but never with indefinite retention, or recovery points bill forever.

6. A cross-Region restore fails with “KMS key cannot be accessed.” Cause and fix? The destination KMS key policy doesn’t permit AWS Backup to create the grant it needs to decrypt/re-encrypt during restore. Fix by adding kms:Decrypt, kms:GenerateDataKey, and kms:CreateGrant (with kms:GrantIsForAWSResource=true) for backup.amazonaws.com on the destination key — and prefer a multi-Region key so the ARN is consistent across Regions.

7. How do you achieve a sub-5-minute RPO for an RDS database in a DR Region? Enable continuous backups (PITR) for in-Region point-in-time recovery, and stand up a cross-Region read replica for the DR data tier — replica lag is typically seconds, and you promote it on failover. For the tightest RPO (~1 second) and managed cross-Region failover, use Aurora Global Database. Nightly snapshots cannot meet a 5-minute RPO regardless of how you schedule them.

8. Why is Multi-AZ not a DR solution? Multi-AZ provides synchronous replication to a standby in another Availability Zone within the same Region — it survives an AZ failure with RPO 0, but a Region-wide event takes both the primary and the standby. DR requires a copy/replica in a different Region. Use Multi-AZ for the common AZ failure and cross-Region replication/copy for the rare Region failure; they are complementary, not alternatives.

9. You promote a cross-Region read replica during a DR test and can’t undo it. Why, and what should you do in tests? Promotion makes the replica a standalone writable primary and breaks replication — it’s irreversible. In a real failover that’s exactly what you want. In a test, promote a copy (or restore a snapshot to a throwaway instance) so you don’t sever your live replication chain.

10. Route 53 is configured for failover but DNS keeps serving the dead Region. What are the two most likely causes? Either the record TTL is too high, so resolvers cache the dead endpoint for the TTL window; or the health check probes the wrong path (e.g. / returns 200 from a static page even when the API is down), so Route 53 never marks the primary unhealthy. Fix with a 60-second TTL and a health check on a real readiness path (/healthz) that fails when the app truly can’t serve.

11. What’s the danger of combining indefinite retention with a compliance-mode Vault Lock? Compliance mode makes recovery points undeletable until they expire — and indefinite retention means they never expire. The result is recovery points that bill forever and that no one, including root, can delete. Always use finite retention under a compliance lock, and verify the configuration during the mandatory cooling-off window.

12. How do you prove your DR works, and how often? Run AWS Backup restore testing (scheduled, validated restores into an isolated environment) plus DR game days that rehearse the full human runbook — promote, scale, re-point, flip DNS, validate. Do it at least twice a year, and treat the measured RTO/RPO as the real numbers. An untested restore is a hypothesis; the most common reason recoveries fail is that they were never tested.

These map to AWS Certified Solutions Architect – Associate (SAA-C03)design resilient architectures (backup/restore, multi-Region, RTO/RPO) — and Solutions Architect – Professional (SAP-C02)design for business continuity and DR (the four patterns, cross-account/Region, failover orchestration). The data-durability and immutability angle touches AWS Certified Security – Specialty. A compact cert-mapping for revision:

Question theme Primary cert Exam domain
RTO/RPO, DR patterns SAA-C03 / SAP-C02 Design resilient / BC-DR architectures
AWS Backup plans, copy actions SAA-C03 Resilient, decoupled, backup architectures
Vault Lock, immutability, KMS Security Specialty Data protection; key management
Cross-Region replica / Aurora Global SAP-C02 Continuity; advanced data strategies
Route 53 failover, runbook automation SAP-C02 Failover orchestration; resilience
Cross-account air gap Security Specialty / SAP-C02 Account isolation; data protection

Quick check

  1. Your contract says RPO 15 minutes, but your only backup is a nightly snapshot. What’s the gap, and what mechanism actually meets the target?
  2. An AWS Backup job shows COMPLETED, but the DR Region’s vault is empty. What single thing should you check, and what alert prevents this surprising you?
  3. True or false: AWS Backup Vault Lock in compliance mode can be removed by the account root in an emergency.
  4. A cross-Region restore aborts with “KMS key cannot be accessed.” Name the specific permission to add and on which key.
  5. Route 53 failover is configured but traffic never leaves the dead Region. Name the two most likely misconfigurations.

Answers

  1. A nightly snapshot has a ~24-hour RPO — it misses a 15-minute target by ~96×. The mechanism that meets it is continuous backups (PITR) for RDS/Aurora/DynamoDB and/or a cross-Region read replica (seconds of lag, promoted on failover); for ~1-second RPO use Aurora Global Database. No snapshot schedule can meet 15 minutes.
  2. Check the copy jobaws backup list-copy-jobs for a FAILED state — because a backup completing locally does not mean the copy landed; the destination vault may be missing or its access policy/KMS grant may be wrong. Wire an SNS alert on COPY_JOB_FAILED so a silent copy failure (which leaves you with no DR copy) pages you instead of surprising you during a disaster.
  3. False. In compliance mode, after the cooling-off period no one — including the account root and AWS — can delete recovery points or shorten retention until they expire. That irreversibility is the point (ransomware/regulatory WORM), and it’s why you must verify the config during the cooling-off window.
  4. Add kms:CreateGrant (with kms:GrantIsForAWSResource=true), along with kms:Decrypt and kms:GenerateDataKey, for the backup.amazonaws.com service principal on the destination KMS key’s policy. Prefer a multi-Region key so the ARN lines up across Regions.
  5. Either the record TTL is too high (resolvers cache the dead endpoint for the TTL window) or the health check probes the wrong path (e.g. /, which can return 200 while the API is down). Fix with a 60-second TTL, failover routing, and a health check on a real readiness path (/healthz).

Glossary

Next steps

You can now choose a DR pattern to a real RTO/RPO, build the AWS Backup plan and cross-Region/account copy chain, lock the destination, and orchestrate failover. Build outward:

AWSBackupDisaster RecoveryRDSS3Route 53Vault LockRTO/RPO
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading