AWS AWS

Automate Cross-Account RDS and EBS Snapshot Copy with AWS Backup and EventBridge

A fintech’s platform team gets a finding from their auditor that lands like a brick: every production backup — the RDS Aurora clusters holding ledger data, the EBS volumes under the matching-engine hosts — lives in the same AWS account as the workloads it protects. If that account is compromised, ransomwared, or a privileged operator fat-fingers a DeleteBackupVault, the backups die with the primary. The mandate that comes down is specific: copies of every production database and volume snapshot must land, encrypted, in a separate “recovery” account that production has no write path into, on a schedule, with proof. This guide builds exactly that — an automated cross-account snapshot copy pipeline using AWS Backup as the copy engine, a dedicated KMS key for re-encryption, Vault Lock in compliance mode to make the copies immutable, and EventBridge to catch the moments AWS Backup does not surface on its own (failed copies, vault-lock drift) and turn them into tickets and pages.

The reason this is hard is not the wiring — it is the quiet failure modes. AWS Backup will happily run for weeks with a failing copy job while the source-snapshot dashboard stays green. A copy will land in the recovery account and then refuse to restore because one line is missing from a KMS key policy. A snapshot encrypted with the AWS-managed aws/ebs key will silently skip the cross-account copy with an opaque error. Compliance-mode Vault Lock is irreversible after a cooling-off window, so a learning exercise pointed at a seven-year retention becomes a vault you cannot delete for seven years. This article treats those as first-class: every setting gets its values, default, trade-off and limit; every copy-job and KMS error gets a confirm-command and a fix; and the whole operational surface is a symptom → root cause → confirm → fix playbook you keep open during an incident. It is the kind of control a SOC 2 / DORA audit actually wants to see, not a cron job someone wrote on a Friday.

By the end you will be able to stand this up in Terraform, prove a copy is restorable (not merely present), wire the silent failures to your pager, and explain to an auditor — with a single describe-backup-vault line returning Locked: true — why a full compromise of production cannot reach the copies.

What problem this solves

The pain is concrete and it shows up at the worst possible time: during a real recovery. A team that backs up in place — RDS automated backups, EBS snapshots, even a same-account AWS Backup plan — has protected itself against hardware failure and accidental deletion of a row. It has not protected itself against the failure modes that actually destroy companies: a compromised account whose attacker enumerates and deletes every recovery point, a ransomware operator who encrypts the snapshots alongside the volumes, or an insider with backup:Delete* who removes the evidence. In all three the backups and the thing they protect share a blast radius, so they die together.

What breaks without this control is the recovery itself. You discover, mid-incident, that the snapshots you were counting on are gone, or encrypted, or in the same account the attacker still controls. The secondary failure is subtler and more common: you have off-account copies, but nobody verified they restore, so the quarterly DR drill becomes the first time anyone learns the KMS key policy has a gap and the copies are undecryptable. The third failure is silent drift — a copy job that has been failing for three weeks while the source dashboard is green, because the source backup and the cross-account copy are two separate jobs and only one of them was being watched.

Who hits this: any team running regulated or revenue-critical data on AWS — fintech, health, anyone under SOC 2, PCI-DSS, DORA, ISO 27001 — where “is every production resource being copied off-account, immutably, and is that copy restorable” must be a continuously answered question, not an annual scramble. The fix is architectural, not procedural: the recovery account is built so production holds no principal that can write to or delete from it, and the only path data travels is AWS Backup’s own copy mechanism, which the recovery KMS key policy explicitly grants and nothing else.

To frame the whole field before the build, here is every failure class this control defends against, why in-place backup does not, and the mechanism that does:

Threat class What in-place backup does Why it fails What this control adds
Account compromise Backups in the same account Attacker deletes recovery points alongside data Copies live in an account production cannot write to
Ransomware Snapshots encrypted with data Operator encrypts snapshots too Re-encrypted with a recovery-owned CMK; locked
Malicious insider backup:Delete* removes evidence Privileged delete with no immutability Vault Lock compliance mode — even root cannot delete early
Accidental deletion DeleteBackupVault fat-finger One command destroys both Copies are separate, in a separate account, locked
Silent copy drift Source job green, nothing copied Copy job fails unwatched EventBridge on Copy Job State Change → page
Undecryptable copy N/A (single account) Cross-account KMS policy gap Key policy grants source role; quarterly restore drill
Region loss Same-region only AZ/region outage takes both Optional cross-region copy on the long-retention rule

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already have, or be ready to create:

This control sits in the resilience and governance layer, downstream of your account structure and upstream of your DR runbooks. It assumes the multi-account foundations from AWS Organizations and IAM Foundations: Accounts, OUs and Roles and the guardrails of AWS Control Tower Guardrails: Building a Secure Multi-Account Foundation. It complements — does not replace — the strategy in AWS Backup and Disaster Recovery: Protect Workloads Across Regions, which covers the why and the RTO/RPO maths; this article is the how for the cross-account, immutable-copy slice. The audit trail it produces feeds AWS CloudTrail and Config: Audit and Compliance at Scale.

A quick map of who owns which layer during a build or an incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Org management account The two cross-account Backup features Cloud platform / landing-zone team Every copy Access Denied if features are off
Source account IAM/KMS Backup role, source CMK on resources App + platform Default-key snapshots skip copy; role can’t assume
Recovery account KMS Destination CMK + its key policy Recovery/security team Copy lands but won’t restore (policy gap)
Recovery vault + lock Destination vault, Vault Lock config Recovery/security team Lock too rigid (stuck) or absent (deletable)
Backup plan + selection Schedule, lifecycle, tag selection Platform team Untagged resource silently unprotected
EventBridge + SNS Copy/lock event rules, alert fan-out Platform + SRE Silent copy failure (no rule)
CI / Terraform Apply pipeline, Vault creds, Wiz gate DevOps Drift, over-broad policy shipped unreviewed

Core concepts

Five mental models make every later step obvious.

The copy is a second, independent job — not a property of the backup. AWS Backup takes a snapshot (a BACKUP_JOB) and writes it to a source vault; a copy_action on the plan then runs a separate COPY_JOB that re-encrypts and writes the recovery point into the destination vault. The two have separate states, separate lifecycles, and separate failure surfaces. A green source snapshot tells you nothing about whether the copy landed — you must watch Copy Job State Change, not just Backup Job State Change. This single fact is behind most “we thought we had backups” incidents.

The recovery account owns the encryption, and its key policy is the contract. The destination vault is encrypted with a customer-managed KMS key in the recovery account. For the copy to land and be restorable, that key’s policy must grant the source account’s Backup role the actions to use it (kms:Encrypt, kms:Decrypt, kms:ReEncrypt*, kms:GenerateDataKey*, kms:DescribeKey, kms:CreateGrant). Miss that and the copy writes fine but the restore fails with a KMS error — the single most-missed line in the whole build.

Immutability comes from Vault Lock, and compliance mode is one-way. A copy an attacker can delete is a delay, not a backup. Vault Lock in compliance mode makes retention immutable — once the lock’s cooling-off period (changeable_for_days) elapses, nobody, including the recovery account root, can delete a recovery point early or weaken the policy. Governance mode is the softer cousin: a principal with backup:DeleteBackupVaultLockConfiguration can remove it, so it deters mistakes but not a determined insider. Choose compliance for the copies that matter, and rehearse in throwaway accounts because there is no undo after the window.

Selection by tag, not by ARN, is what keeps the control honest over time. Hard-coding resource ARNs into a backup selection guarantees that the database someone provisions next month is silently unprotected. Select by tag (backup=daily) and enforce the tag with an SCP or a Terraform module default, so protection follows the standard rather than a human remembering to add an ARN.

The architecture is the primary control; everything else is defence in depth. Production holds no IAM principal that can write to or delete from the recovery vault. The only path data travels is AWS Backup’s copy mechanism, which the recovery KMS policy explicitly allows. Vault Lock defends against the insider who does get into the recovery account; re-encryption with a recovery-owned CMK makes the source account’s key material irrelevant to the copies; EventBridge makes silent failure loud. Each layer assumes the one before it might fail.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Backup plan Schedule + rules + lifecycles + copy actions Source account Defines when/what/where to copy
Backup rule One schedule + retention + copy action Inside the plan Daily vs weekly long-term tiers
Backup selection Which resources a plan protects (by tag/ARN) Inside the plan Tag-based = future-proof
Source vault Local store for primary recovery points Source account First write; local retention
Destination vault Off-account store for copies Recovery account The immutable, isolated copy
COPY_JOB The cross-account copy operation Source-initiated Separate state from the backup
Recovery CMK KMS key encrypting the destination vault Recovery account Key policy must grant source role
Vault Lock Immutable retention on a vault Recovery vault Compliance mode = no early delete
changeable_for_days Cooling-off before lock is permanent Lock config Your only window to back out
Backup role IAM role AWS Backup assumes Source account Needs KMS + backup/restore perms
Org features Cross-account backup + monitoring flags Management account Off → every copy Access Denied
EventBridge rule Matches aws.backup events → target Both accounts Turns silent failure into a page

Step 1 — Enable the AWS Backup organization features

Cross-account copy and cross-account monitoring are Org-level features, off by default. Run these once from the Organizations management account. The first call turns AWS Backup into a trusted service so it can act across accounts; the next two flip the actual features.

# From the Organizations management account
aws organizations enable-aws-service-access \
  --service-principal backup.amazonaws.com

# Allow recovery points to be copied between accounts in the Org
aws backup update-global-settings \
  --global-settings isCrossAccountBackupEnabled=true \
  --profile mgmt

# (Optional but recommended) aggregate backup/copy job status Org-wide
aws backup update-global-settings \
  --global-settings isCrossAccountMonitoringEnabled=true \
  --profile mgmt

# Confirm both are "true"
aws backup describe-global-settings --profile mgmt

If isCrossAccountBackupEnabled is not true, every copy job you create later fails with Access Denied no matter how perfect your KMS and IAM are — so verify this first. The two settings, what they do, and what breaks without each:

Org setting What it enables Default Where set Symptom if off
isCrossAccountBackupEnabled Recovery points may be copied between Org accounts false Management account Every COPY_JOBAccess Denied
isCrossAccountMonitoringEnabled Org-wide aggregation of backup/copy job status false Management account No central job dashboard; per-account only
enable-aws-service-access (trusted access) Backup may act across the Org off Management account Features can’t be enabled; copy unsupported

The prerequisites that must all be true before a single cross-account copy can succeed — a checklist you confirm in order:

# Prerequisite Confirm command / path If missing
1 Both accounts in the same Organization aws organizations list-accounts Move/invite account into the Org
2 Trusted service access for Backup aws organizations list-aws-service-access-for-organization enable-aws-service-access
3 isCrossAccountBackupEnabled=true aws backup describe-global-settings update-global-settings
4 Source resource encrypted with a CMK aws ec2 describe-volumes / describe-db-instances Re-encrypt with a customer-managed key
5 Recovery CMK grants the source Backup role KMS key policy Add AllowSourceBackupRoleUse (Step 2)
6 Destination vault exists in recovery account aws backup describe-backup-vault --profile recovery Create it (Step 2)
7 Source Backup role can use the recovery CMK source IAM role inline policy Add the cross-account KMS statement (Step 4)

Step 2 — Create the destination vault and KMS key in the recovery account

The recovery account owns the encryption key and the destination vault. Critically, the KMS key policy must grant the source account’s AWS Backup service role permission to use the key — without that grant, the copy lands but cannot be decrypted on restore. Define this in Terraform under a recovery provider alias.

# providers.tf
provider "aws" {
  alias   = "recovery"
  region  = "ap-south-1"
  profile = "recovery"   # creds injected by Vault AWS secrets engine in CI
}

# recovery_account.tf
data "aws_caller_identity" "recovery" {
  provider = aws.recovery
}

resource "aws_kms_key" "backup_copy" {
  provider                = aws.recovery
  description             = "CMK for cross-account AWS Backup copies (RDS/EBS)"
  enable_key_rotation     = true
  deletion_window_in_days = 30

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "RecoveryAccountAdmin"
        Effect    = "Allow"
        Principal = { AWS = "arn:aws:iam::222222222222:root" }
        Action    = "kms:*"
        Resource  = "*"
      },
      {
        # The SOURCE account's Backup role must be able to use this key
        # to write the re-encrypted copy and to decrypt it on restore.
        Sid    = "AllowSourceBackupRoleUse"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::111111111111:role/AWSBackupCrossAccountRole"
        }
        Action = [
          "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*",
          "kms:GenerateDataKey*", "kms:DescribeKey", "kms:CreateGrant"
        ]
        Resource = "*"
      }
    ]
  })
}

resource "aws_kms_alias" "backup_copy" {
  provider      = aws.recovery
  name          = "alias/backup-cross-account"
  target_key_id = aws_kms_key.backup_copy.key_id
}

resource "aws_backup_vault" "destination" {
  provider    = aws.recovery
  name        = "recovery-destination-vault"
  kms_key_arn = aws_kms_key.backup_copy.arn
}

Every KMS action the copy needs, and why

This is the table to keep open when a restore fails with a KMS error — each action maps to a moment in the copy/restore lifecycle, so a missing one tells you exactly which moment breaks:

KMS action Used when Granted to Symptom if missing
kms:GenerateDataKey* Encrypting the copy’s data Source Backup role COPY_JOB fails at write
kms:Encrypt Encrypting metadata/keys Source Backup role Copy fails / partial
kms:Decrypt Restoring the copied recovery point Restoring role (recovery acct) Restore fails: AccessDenied on key
kms:ReEncrypt* Re-encrypting source → recovery key Source Backup role Cross-account copy fails
kms:DescribeKey Resolving key metadata Both roles Copy/restore can’t find key
kms:CreateGrant Backup creates a grant for async ops Source Backup role Intermittent copy failures
kms:RetireGrant Cleaning up grants Backup service Grant leak (rarely fatal)

KMS key policy vs IAM policy — which controls what

Cross-account KMS access needs grants on both sides; getting only one is the classic half-fix. Where each permission must live:

Permission location Lives in Grants Without it
Recovery CMK key policy Recovery account Source role may be allowed to use the key Source role is never trusted by the key
Source role IAM policy Source account Source role is allowed to call KMS on that ARN Role’s own identity blocks the call
Recovery restore role IAM Recovery account Restoring principal may call kms:Decrypt Restore in recovery account fails
KMS grant (auto) Created at runtime Async copy operations Copy fails mid-flight

The rule: cross-account KMS requires the action be allowed in both the key policy (recovery side) and the caller’s IAM policy (source side). Allowing only one is the most common reason a copy “should work” but doesn’t.

Step 3 — Lock the destination vault (compliance mode)

A copy that an attacker can delete is not a backup, it is a delay. Vault Lock in compliance mode makes retention immutable — once the lock’s cooling-off period (changeable_for_days) elapses, nobody, including the recovery account root, can delete a recovery point early or weaken the policy. Set the cooling-off window honestly: during it you can still back out, after it the lock is permanent.

resource "aws_backup_vault_lock_configuration" "destination" {
  provider            = aws.recovery
  backup_vault_name   = aws_backup_vault.destination.name
  changeable_for_days = 3      # cooling-off: lock becomes immutable after this
  min_retention_days  = 30     # nothing can be deleted before 30 days
  max_retention_days  = 2555   # ~7 years cap to satisfy financial retention
}

Test the entire pipeline end to end in a throwaway pair of accounts before you let changeable_for_days expire in production. Compliance-mode lock is intentionally unforgiving — there is no override, not even from the recovery root.

Governance vs compliance mode

The mode is the most consequential single choice in this build. Side by side:

Property Governance mode Compliance mode
Can an admin remove the lock? Yes, with backup:DeleteBackupVaultLockConfiguration No — after cooling-off, never
Can retention be shortened? Yes, by an authorised principal No
Cooling-off (changeable_for_days) Optional Mandatory, then permanent
Protects against accident Yes Yes
Protects against malicious insider No (they can delete the lock) Yes
Satisfies WORM / SEC 17a-4-style needs No Yes
Reversible Fully Only inside cooling-off window
Use it for Soft guardrail on dev/test vaults The copies that actually matter

The three lock parameters

Each parameter has a distinct failure mode if set wrong — this is where a learning exercise turns into a seven-year mistake:

Parameter What it does Typical value Set too low Set too high
changeable_for_days Cooling-off before lock is permanent 3 (prod) / 0 omit (test) No time to back out of a mistake Long window where insider can still delete lock
min_retention_days Floor below which nothing can be deleted 30 Copies expire before useful Test vault stuck for the duration
max_retention_days Ceiling on any recovery point’s retention 2555 (~7y) Long-term rule can’t reach its target Cost compounds; nothing forces it down

The state machine of a lock

Understanding the lock lifecycle keeps you from doing something irreversible. The transitions:

State How you got here What you can do What you cannot do
No lock Vault created, no lock config Add a lock (governance or compliance) Nothing immutable yet
Locked (cooling-off) Compliance lock set, within changeable_for_days Delete the lock config; change params Restore is blocked? No — restore works; delete early no
Locked (permanent) Cooling-off elapsed Add recovery points; let them expire Delete lock; shorten retention; delete RP early
Governance locked Governance lock set Delete lock with permission; expire RPs (compliance-grade immutability)

Step 4 — Create the source backup vault and AWS Backup service role

Back in the production account, create a local vault to hold the primary recovery points and the IAM role AWS Backup assumes. Attach the two AWS-managed policies plus an inline statement granting use of the recovery account’s KMS key.

# providers.tf (source)
provider "aws" {
  alias   = "prod"
  region  = "ap-south-1"
  profile = "prod"
}

# source_account.tf
resource "aws_backup_vault" "source" {
  provider = aws.prod
  name     = "prod-source-vault"
}

resource "aws_iam_role" "backup" {
  provider = aws.prod
  name     = "AWSBackupCrossAccountRole"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "backup.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

# Managed policies for backup + restore of RDS/EBS
resource "aws_iam_role_policy_attachment" "backup" {
  provider   = aws.prod
  role       = aws_iam_role.backup.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForBackup"
}

resource "aws_iam_role_policy_attachment" "restore" {
  provider   = aws.prod
  role       = aws_iam_role.backup.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForRestores"
}

# Use the recovery account's CMK for the copy
resource "aws_iam_role_policy" "kms_copy" {
  provider = aws.prod
  name     = "use-recovery-cmk"
  role     = aws_iam_role.backup.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*",
        "kms:GenerateDataKey*", "kms:DescribeKey", "kms:CreateGrant"
      ]
      Resource = aws_kms_key.backup_copy.arn   # cross-account ARN, account 222...
    }]
  })
}

The IAM policies on the Backup role

What each attachment grants and why you need it — drop the wrong one and a whole resource type silently fails to back up:

Policy Grants Needed for Omit it and…
AWSBackupServiceRolePolicyForBackup Create snapshots of supported resources The BACKUP_JOB itself No snapshots taken at all
AWSBackupServiceRolePolicyForRestores Restore recovery points DR drills + real restores Can back up but never restore
Inline use-recovery-cmk (KMS) Use the recovery CMK on its ARN The cross-account COPY_JOB Copy Access Denied from source side
AWSBackupServiceRolePolicyForS3Backup Back up S3 buckets Only if protecting S3 (not needed for RDS/EBS)
AWSBackupServiceRolePolicyForS3Restore Restore S3 Only if protecting S3 (not needed for RDS/EBS)

Source vs recovery — what each account holds

The asymmetry is the security model. A glance at what lives where makes the one-way property obvious:

Resource Source account (111…) Recovery account (222…)
Protected resources (RDS/EBS) Yes No (copies only)
Source vault Yes (prod-source-vault) No
Destination vault No Yes (recovery-destination-vault)
Backup plan + selection Yes No
Backup role (assumed by Backup) Yes A separate restore role
KMS CMK Per-resource source key Copy CMK (owns the copies’ encryption)
Vault Lock Optional on source Compliance lock on destination
Write path into recovery vault None Backup copy mechanism only

Step 5 — Define the backup plan with a cross-account copy action

The plan is the schedule plus the rules. Each rule here takes a snapshot on a cron schedule, keeps it locally for a window, and — through copy_action — pushes a re-encrypted copy to the recovery vault with its own retention. The lifecycle blocks are what actually expire recovery points; AWS Backup, not you, deletes them on time.

resource "aws_backup_plan" "cross_account" {
  provider = aws.prod
  name     = "prod-cross-account-copy"

  rule {
    rule_name         = "daily-rds-ebs"
    target_vault_name = aws_backup_vault.source.name
    schedule          = "cron(0 3 * * ? *)"   # 03:00 UTC daily
    start_window      = 60                      # minutes to start
    completion_window = 300                     # minutes to finish

    lifecycle {
      delete_after = 35                         # local copy retention (days)
    }

    copy_action {
      destination_vault_arn = aws_backup_vault.destination.arn  # in 222...
      lifecycle {
        delete_after = 90                       # recovery-account retention
      }
    }
  }

  # Weekly long-retention rule, copied with cold-storage tiering for cost
  rule {
    rule_name         = "weekly-longterm"
    target_vault_name = aws_backup_vault.source.name
    schedule          = "cron(0 4 ? * SUN *)"  # Sundays 04:00 UTC

    lifecycle {
      cold_storage_after = 30
      delete_after       = 365
    }

    copy_action {
      destination_vault_arn = aws_backup_vault.destination.arn
      lifecycle {
        cold_storage_after = 30
        delete_after       = 2555              # ~7 years
      }
    }
  }
}

Every backup-rule setting

The plan rule is where most operational tuning happens. Each setting, its values, default and the trade-off:

Setting What it controls Default When to change Trade-off / limit
schedule Cron for when the backup runs none (required) Match RPO; stagger to spread load Sub-hour cadence increases cost/jobs
start_window (min) How long a job may wait to start 60 Tight windows for time-critical Too tight → EXPIRED if capacity busy
completion_window (min) Max time to finish before abort 100 Large datasets need more Too short → ABORTED on big snapshots
lifecycle.delete_after (days) Local retention none (keep forever) Always set; control cost Must be ≥ cold_storage_after + 90
lifecycle.cold_storage_after (days) Move to cold tier none Long-retention data Min 90 days in cold before delete
enable_continuous_backup PITR for supported resources false RDS/Aurora PITR needs Higher cost; not all resource types
recovery_point_tags Tags on the recovery point none Cost allocation, search
copy_action Cross-account/region copy none This entire control Each copy is a separate COPY_JOB

Daily vs weekly long-term rules

Why two rules, and how their economics differ — the split is a cost and compliance decision, not an accident:

Aspect daily-rds-ebs weekly-longterm
Cadence Daily 03:00 UTC Sundays 04:00 UTC
Purpose Operational recovery (recent state) Compliance / long retention
Local retention 35 days 365 days
Remote retention 90 days 2555 days (~7y)
Cold storage No (warm only) After 30 days
Cost driver Warm storage, frequent Cold storage, sparse
Restore speed Fast (warm) Slower (cold thaw)

The lifecycle constraints that bite

AWS Backup enforces relationships between lifecycle values; violating them makes terraform apply fail or a recovery point behave unexpectedly. The rules:

Rule Constraint If violated
Cold + delete spacing delete_aftercold_storage_after + 90 API rejects the plan
Minimum cold duration A recovery point stays ≥ 90 days in cold Early delete blocked / billed minimum
Lock floor vs lifecycle delete_after ≥ vault min_retention_days Copy rejected by locked vault
Lock ceiling vs lifecycle delete_after ≤ vault max_retention_days Copy rejected by locked vault
Cold-eligible resources Only some types support cold storage cold_storage_after ignored/errors

Cross-region as well as cross-account

For a regional-isolation requirement, the same copy_action can target a vault in a different region of the recovery account. Same-region vs cross-region copy, weighed:

Dimension Same-region copy Cross-region copy
Protects against Account compromise, ransomware + region/AZ-wide outage
Data-transfer cost None Inter-region egress per GB
Restore locality Same region as source Other region (failover-ready)
Latency of copy Lower Higher
Compliance fit Most SOC 2 / DORA Data-residency-sensitive cases
Recommended for Default Tier-0 workloads needing geo-DR

Step 6 — Select resources by tag, not by ARN

Hard-coding ARNs into a selection guarantees that the database someone provisions next month is silently unprotected. Select by tag instead and make backup=daily part of your standard resource tagging (enforce it with an SCP or a Terraform module default). The selection also needs the Backup role’s ARN.

resource "aws_backup_selection" "tagged" {
  provider     = aws.prod
  name         = "rds-and-ebs-tagged"
  plan_id      = aws_backup_plan.cross_account.id
  iam_role_arn = aws_iam_role.backup.arn

  selection_tag {
    type  = "STRINGEQUALS"
    key   = "backup"
    value = "daily"
  }
}

Then tag the actual resources (or, better, set these tags in the modules that create them):

# Tag an RDS/Aurora cluster and an EBS volume for inclusion
aws rds add-tags-to-resource \
  --resource-name arn:aws:rds:ap-south-1:111111111111:cluster:ledger-prod \
  --tags Key=backup,Value=daily --profile prod

aws ec2 create-tags \
  --resources vol-0a1b2c3d4e5f6a7b8 \
  --tags Key=backup,Value=daily --profile prod

Selection methods compared

How AWS Backup can decide what to protect, and why tag-based wins for a control that must stay correct as the estate grows:

Selection method How it matches Pro Con / gotcha
Tag (STRINGEQUALS) Resources with the tag key=value New resources auto-included Needs tag enforcement (SCP)
Tag (STRINGLIKE) Wildcard tag value Flexible grouping Over-broad if pattern loose
Explicit ARN list Named resources only Precise New resources silently skipped
not_resources exclusion Everything except listed Broad coverage Easy to over-protect / cost
Resource type (via type filter) All of a type (e.g. all RDS) Whole-class coverage May sweep in unintended resources

Resource types and their cross-account copy support

Not every supported resource copies the same way; the per-type caveats are where surprises live:

Resource type Cross-account copy Key gotcha
EBS volume Yes Must use a CMK, not aws/ebs
RDS instance Yes Each snapshot is full (not incremental); CMK required
Aurora cluster Yes Cluster-level snapshot; CMK required
EC2 instance (AMI) Yes Copies the backing snapshots; CMK on volumes
EFS Yes Warm/cold; check region support
DynamoDB Yes Via AWS Backup, not native; CMK
FSx Varies by flavour Some flavours limited
S3 Cross-Region/account via Backup Needs the S3 backup/restore policies

Enforcing the tag so coverage can’t silently lapse

A selection is only as honest as your tagging. The mechanisms that keep backup=daily from being optional:

Mechanism Where it runs What it guarantees Limit
SCP requiring the tag on create Org / OU New resources must carry it Coverage depends on resource-type support
Terraform module default IaC Every module-made resource tagged Console-made resources escape it
AWS Config rule (required-tags) Account Flags untagged resources Detective, not preventive
Backup Audit Manager framework Backup Reports resources not in a plan Reporting; needs a follow-up action
Tag Policy Org Standardises tag keys/values Doesn’t force presence alone

Step 7 — Catch the silent failures with EventBridge

AWS Backup will happily run for weeks with a failing copy job and never page anyone — the dashboard goes green on the source snapshot while the cross-account copy quietly errors. EventBridge is how you close that gap. Create a rule on the aws.backup source that matches failed copy jobs and vault-lock changes, and route it to SNS. This rule lives in the source account; mirror a copy-job rule in the recovery account too.

resource "aws_cloudwatch_event_rule" "backup_failures" {
  provider    = aws.prod
  name        = "backup-copy-and-lock-alerts"
  description = "Alert on failed copy jobs and vault-lock drift"
  event_pattern = jsonencode({
    source      = ["aws.backup"]
    detail-type = ["Copy Job State Change", "Backup Vault State Change"]
    detail = {
      state = ["FAILED", "ABORTED"]
    }
  })
}

resource "aws_sns_topic" "backup_alerts" {
  provider = aws.prod
  name     = "backup-alerts"
}

resource "aws_cloudwatch_event_target" "to_sns" {
  provider  = aws.prod
  rule      = aws_cloudwatch_event_rule.backup_failures.name
  target_id = "sns"
  arn       = aws_sns_topic.backup_alerts.arn
}

Wire the SNS topic to the tools that already run your on-call: a subscription to the Datadog (or Dynatrace) AWS integration endpoint so a failed copy raises a monitor and shows on the reliability dashboard, and a subscription that hits a ServiceNow inbound webhook to auto-open a P2 incident with the failed job’s ARN. That way a broken backup is a ticket and a page within minutes, not a discovery during an actual restore.

The AWS Backup events worth wiring

The aws.backup source emits several detail-types; which to alert on and why:

detail-type Fires when Alert on which states Why it matters
Copy Job State Change A COPY_JOB changes state FAILED, ABORTED The silent cross-account failure
Backup Job State Change A BACKUP_JOB changes state FAILED, EXPIRED, ABORTED Source snapshot didn’t happen
Restore Job State Change A restore changes state FAILED DR drill / real restore broke
Backup Vault State Change Vault config (incl. lock) changes any Lock drift / tampering
Recovery Point State Change RP becomes COMPLETED/PARTIAL PARTIAL, EXPIRED Incomplete or aged-out copy

Where each alert should route

Not every event deserves a page; the routing matrix keeps signal high:

Event Severity Route to Page on-call?
Copy job FAILED High ServiceNow P2 + Datadog Yes
Backup job FAILED High ServiceNow P2 Yes
Restore job FAILED (drill) High ServiceNow + Slack Yes
Vault-lock config changed Critical Security channel + SIEM Yes (tamper)
Recovery point PARTIAL Medium Datadog monitor Business hours
Backup job EXPIRED (window) Medium Datadog + capacity review Business hours

SNS subscription targets

How the fan-out reaches each tool, and the auth model for each:

SNS subscription Protocol Auth / setup Result
Datadog AWS integration HTTPS endpoint Datadog-provided URL + external ID Monitor + dashboard event
Dynatrace HTTPS / webhook API token Problem + Davis correlation
ServiceNow HTTPS webhook Inbound integration user Auto-opened P2 incident
PagerDuty HTTPS / email integration Integration key Direct page
Email (fallback) email Confirm subscription Human notification
Lambda (enrichment) lambda Resource policy Add ARN/context before routing

Step 8 — Drive it all from CI with Vault-issued credentials

Run the Terraform from GitHub Actions (Jenkins works identically). The job asks HashiCorp Vault for short-lived AWS credentials via the AWS secrets engine — one lease scoped to the source account, one to the recovery account — so the pipeline never holds a static key. For app-consistent EBS snapshots on the matching-engine hosts, an Ansible play lays down the fsfreeze pre/post scripts that AWS Backup’s Windows VSS equivalent does for you on Linux.

# In the CI runner, before terraform: lease creds from Vault
export VAULT_ADDR="https://vault.internal:8200"
vault login -method=jwt role=backup-ci jwt="$CI_OIDC_TOKEN" >/dev/null

# Source-account lease
eval "$(vault read -format=json aws/creds/prod-backup-admin \
  | jq -r '.data | "export AWS_ACCESS_KEY_ID=\(.access_key)\nexport AWS_SECRET_ACCESS_KEY=\(.secret_key)"')"

terraform init
terraform plan  -out=tfplan
terraform apply -auto-approve tfplan

Gate the apply behind a pull-request review and let Wiz Code scan the Terraform in the PR — it will flag a vault with no lock, a KMS key with an over-broad policy, or public exposure before the plan is ever applied. At runtime, CrowdStrike Falcon sensors on the CI runners and the production hosts watch for tampering with the snapshot/backup agents themselves. If you already run External Secrets, the same Vault path can feed it — see Set Up External Secrets Operator to Sync Vault and AWS Secrets into Kubernetes.

How credentials reach the pipeline (and how they don’t)

The credential model is itself a control surface; the comparison shows why Vault-leased beats static keys:

Approach Lifetime Where the secret lives Blast radius if leaked
Vault AWS secrets engine (used here) Minutes (lease TTL) Nowhere persistent; minted per run Tiny — expires fast, scoped role
GitHub OIDC → IAM role Per-job token No static key; trust policy Small — scoped, short
Static IAM access key in CI secret Until rotated CI secret store Large — long-lived, often over-scoped
Hard-coded in repo Forever Git history Catastrophic — never do this
EC2 instance profile (self-hosted runner) Rotated by AWS Instance metadata Medium — scoped to instance role

The tools in the pipeline and what each guards

The supporting cast, mapped to the specific risk each removes:

Tool Stage Guards against Replaces
Terraform Build Drift, undocumented change Click-ops in the console
HashiCorp Vault Pre-apply Static long-lived keys Stored access keys
Wiz Code PR gate Misconfigured IaC shipped Post-hoc CSPM findings
GitHub Actions / Jenkins Orchestration Manual, unaudited applies Laptop terraform apply
Ansible Host config Crash-inconsistent snapshots Manual fsfreeze
CrowdStrike Falcon Runtime Agent/host tampering Trust without verification
Datadog / Dynatrace Runtime Unnoticed failures Annual audit discovery

Architecture at a glance

The flow is deliberately one-directional, and that direction is the whole point. Read the diagram left to right. In the source account (111…) the protected resources — RDS/Aurora clusters and EBS volumes, each tagged backup=daily and encrypted with a customer-managed key — are snapshotted on a schedule by a backup plan rule (cron 03:00 UTC), which writes the primary recovery point to the local source vault. A copy_action on that rule then triggers a separate COPY_JOB that re-encrypts the recovery point with the recovery account’s CMK and writes it into the destination vault in the recovery account (222…) — a vault carrying Vault Lock in compliance mode (min 30 days, max ~7 years) so even its own root cannot delete a copy early. The arrows only ever point into the recovery account; production holds no principal that can write to or delete from that vault.

The control plane runs alongside the data path. EventBridge rules in both accounts watch the aws.backup stream; a COPY_JOB that goes FAILED/ABORTED, or a Backup Vault State Change that signals lock drift, fans out through SNS to Datadog/Dynatrace (a monitor and a page) and to ServiceNow (an auto-opened P2 incident with the failed job’s ARN). The five numbered badges mark the failure points that actually bite in production: the Org features being off (every copy Access Denied), the recovery key policy missing the source-role grant (copy lands but won’t restore), a default-key snapshot silently skipping the copy, a lock that is either too rigid or absent, and the silent copy failure that no one pages on. The legend narrates each as symptom · confirm · fix — the same playbook the operational sections expand below.

Cross-account RDS and EBS snapshot copy pipeline: source account 111… with CMK-encrypted RDS/Aurora and EBS volumes tagged backup=daily and an AWS Backup role, a backup plan rule on a 03:00 UTC cron writing to a local source vault, a cross-account COPY_JOB that re-encrypts with the recovery account's KMS CMK whose key policy grants the source role, writing immutable copies into a destination vault in recovery account 222… under Vault Lock compliance mode with 30-day-to-7-year retention, and a control plane where EventBridge matches aws.backup FAILED/ABORTED copy-job and vault-lock events and fans them out through SNS to Datadog and a ServiceNow P2 incident — with five numbered failure badges for Org-feature-off, KMS key-policy gap, default-key snapshot skip, rigid-or-absent lock, and silent copy failure

Real-world scenario

Meridian Pay, a payments startup, runs its ledger on Aurora PostgreSQL and its matching engine on EC2 with EBS gp3 volumes, all in ap-south-1, all in one production account (111111111111). Two engineers, a tight INR budget, SOC 2 Type II in progress. Their backups were RDS automated backups (35-day window) plus a same-account AWS Backup plan for EBS — both in the production account. The Type II auditor’s finding was blunt: “Backups share a fate boundary with the workloads. No demonstrated off-account, immutable copy. No evidence copies are restorable.” They had ninety days to remediate.

The team spun up a dedicated recovery account (222222222222) in the same Org, and built this exact pipeline in Terraform. Week one went smoothly — Org features on, destination vault and CMK created, source role and plan applied, daily copies of the Aurora cluster and three EBS volumes flowing to the recovery vault. The dashboards were green. They almost signed it off.

The first thing that went wrong surfaced only because they ran the restore drill the brief insists on. The aws backup list-recovery-points-by-backup-vault --profile recovery call showed the copies present — but start-restore-job for the Aurora recovery point failed with a KMS AccessDenied. The recovery CMK’s key policy granted the source Backup role kms:Decrypt, but the recovery account’s restore role — a different principal — had no IAM permission to call kms:Decrypt on that key, and the key policy didn’t name it either. Cross-account KMS needs the grant on both sides; they had only the source side. Ten minutes to add a key-policy statement for the recovery restore role and an IAM allow on it, and the restore completed into a throwaway cluster ledger-restore-test, which they then dropped.

The second thing went wrong a week later and was the reason the whole EventBridge layer exists. They added a fourth EBS volume to the matching-engine fleet — and forgot it was encrypted with the default aws/ebs key, not their CMK. The daily backup job snapshotted it fine (source dashboard green), but the COPY_JOB silently failed because AWS Backup cannot copy a default-key snapshot across accounts. Nobody would have noticed for weeks — except the EventBridge rule on Copy Job State Change → FAILED fired within minutes, SNS opened a ServiceNow P2 with the failed job’s ARN, and Datadog paged. The fix was to re-encrypt the volume with their CMK (snapshot → copy with CMK → swap), after which the copy landed. That single page, on a control they’d built the week before, was what convinced the auditor the monitoring was real.

Before locking, they rehearsed the entire pipeline in a throwaway pair of accounts with changeable_for_days = 0 and min_retention_days = 1, deliberately because they’d read that compliance-mode lock is irreversible. Only once a restore from a locked test vault succeeded did they apply the production lock: changeable_for_days = 3, min_retention_days = 30, max_retention_days = 2555. The auditor screenshotted aws backup describe-backup-vault --query Locked returning true. Steady-state cost landed around ₹14,500/month — dominated by warm copy storage, with the weekly long-retention rule tiered to cold after 30 days to keep the seven-year obligation cheap. The lesson on the wall: “A copy you haven’t restored is a rumour. Drill it, watch it, then lock it — in that order.”

The remediation as a timeline, because the order of moves is the lesson:

Time Milestone What they did What it caught / cost
Week 1 Pipeline live Org features, CMK, vault, plan, daily copies Copies present, dashboards green
Week 1 First restore drill start-restore-job from recovery vault KMS gap — restore role had no kms:Decrypt
Week 1 Fix the gap Key-policy + IAM allow for recovery restore role Restore succeeds into throwaway cluster
Week 2 New volume added Forgot it used the default aws/ebs key COPY_JOB silently failed
Week 2 EventBridge fires Rule → SNS → ServiceNow P2 + Datadog page Caught in minutes, not weeks
Week 2 Re-encrypt + retry Snapshot → copy with CMK → swap volume Copy lands
Week 3 Lock rehearsal Throwaway accounts, changeable_for_days=0 Proved restore-from-locked works
Week 3 Production lock 3 / 30 / 2555 compliance lock Locked: true — audit evidence

Advantages and disadvantages

The cross-account, AWS-Backup-native, locked-copy model is the right control for regulated data — but it has sharp edges you should weigh openly:

Advantages (why this model helps you) Disadvantages (why it bites)
Copies live in an account production cannot write to — a full source compromise can’t reach them The asymmetry means a misconfigured KMS policy fails silently at restore, not at copy — you must drill to find it
Vault Lock compliance mode makes copies immutable to everyone, including recovery root Compliance lock is irreversible after cooling-off; a test pointed at long retention is stuck for that retention
AWS Backup is native — no agents on RDS, one mechanism for RDS+EBS+more The copy is a separate job; a green source snapshot hides a failing copy unless you wire EventBridge
Re-encryption with a recovery-owned CMK makes source key material irrelevant to copy confidentiality Default-aws/ebs/aws/rds-key snapshots silently skip cross-account copy — easy to miss on new resources
Tag-based selection auto-protects resources nobody has provisioned yet Only if tagging is enforced; an untagged resource is invisibly unprotected
Incremental EBS copies keep daily cost far below full size RDS copies are full per snapshot; daily long-retention RDS gets expensive fast
Same-region copy avoids inter-region transfer charges Same-region doesn’t survive a region outage; geo-DR needs cross-region and its egress cost
Whole thing is Terraform + EventBridge — auditable, reproducible, paged Operational nuance (lock state, KMS, Org features) means it’s not “set and forget”

The model is right whenever data is regulated or revenue-critical and “is every production resource being copied off-account, immutably, restorably” must be continuously true. It is overkill for a stateless app whose data lives entirely in a managed service with its own cross-region replication, and it is the wrong first move for a team that hasn’t yet got a single-account backup working. The disadvantages are all manageable — but only if you know they exist, which is the entire point of the operational sections.

Hands-on lab

Stand up the pipeline against a single EBS volume, force a copy, prove it landed, and tear down — all in a throwaway pair of accounts so the lock can’t trap you. We use the AWS CLI with prod and recovery profiles; keep changeable_for_days unset in the lab.

Step 1 — Confirm the Org features (management account).

aws backup describe-global-settings --profile mgmt
# Expect: "isCrossAccountBackupEnabled": "true"
# If not: aws backup update-global-settings --global-settings isCrossAccountBackupEnabled=true --profile mgmt

Step 2 — Create the recovery CMK and destination vault (recovery account).

KEY_ID=$(aws kms create-key --profile recovery \
  --description "lab cross-account backup CMK" \
  --query KeyMetadata.KeyId -o tsv)
aws kms create-alias --alias-name alias/lab-backup-xacct \
  --target-key-id "$KEY_ID" --profile recovery
# Attach a key policy granting the SOURCE backup role kms:Decrypt/Encrypt/ReEncrypt*/GenerateDataKey*/CreateGrant
aws backup create-backup-vault --backup-vault-name lab-dest-vault \
  --encryption-key-arn "arn:aws:kms:ap-south-1:222222222222:key/$KEY_ID" \
  --profile recovery

Expected: a vault ARN in account 222…. Confirm with aws backup describe-backup-vault --backup-vault-name lab-dest-vault --profile recovery.

Step 3 — Create the source vault + Backup role (source account).

aws backup create-backup-vault --backup-vault-name lab-source-vault --profile prod
# Role AWSBackupCrossAccountRole assumable by backup.amazonaws.com,
# with the two managed backup/restore policies + inline KMS allow on the recovery CMK ARN.

Step 4 — Tag a test EBS volume and create a plan with a copy action.

aws ec2 create-tags --resources vol-0labtestvolume00 \
  --tags Key=backup,Value=daily --profile prod
# Create a plan whose daily rule has copy_action → arn:aws:backup:ap-south-1:222222222222:backup-vault:lab-dest-vault
# and a selection matching tag backup=daily with the role ARN.

Step 5 — Force an on-demand backup (don’t wait for the schedule).

aws backup start-backup-job \
  --backup-vault-name lab-source-vault \
  --resource-arn arn:aws:ec2:ap-south-1:111111111111:volume/vol-0labtestvolume00 \
  --iam-role-arn arn:aws:iam::111111111111:role/AWSBackupCrossAccountRole \
  --profile prod

Step 6 — Watch the copy job to COMPLETED.

aws backup list-copy-jobs --by-state RUNNING --profile prod \
  --query 'CopyJobs[].{Id:CopyJobId,State:State,Dest:DestinationBackupVaultArn}'
# Then by COMPLETED. A FAILED here on a default-key volume is the expected lesson — re-encrypt with the CMK.

Step 7 — Confirm the recovery point exists in the recovery account.

aws backup list-recovery-points-by-backup-vault \
  --backup-vault-name lab-dest-vault --profile recovery \
  --query 'RecoveryPoints[].{Arn:RecoveryPointArn,Status:Status,Created:CreationDate}'
# Expect at least one COMPLETED recovery point.

Step 8 — Prove it’s restorable.

aws backup start-restore-job \
  --recovery-point-arn <ARN-from-step-7> \
  --iam-role-arn arn:aws:iam::222222222222:role/LabRestoreRole \
  --metadata '{"volumeType":"gp3","availabilityZone":"ap-south-1a"}' \
  --resource-type EBS --profile recovery
# A COMPLETED restore here is the win. A KMS AccessDenied = the recovery restore role lacks kms:Decrypt.

Step 9 — Teardown (works because we never locked).

# Delete recovery points, then the vaults, then the plan/selection, then the CMK alias/key.
aws backup delete-recovery-point --backup-vault-name lab-dest-vault \
  --recovery-point-arn <ARN> --profile recovery
aws backup delete-backup-vault --backup-vault-name lab-dest-vault --profile recovery
aws backup delete-backup-vault --backup-vault-name lab-source-vault --profile prod
aws kms schedule-key-deletion --key-id "$KEY_ID" --pending-window-in-days 7 --profile recovery

The lab’s deliberate teaching moment is Step 6 on a default-key volume: the backup succeeds, the copy fails. That is the production trap, reproduced safely.

Common mistakes & troubleshooting

This is the differentiator. Each failure mode below is real, with the symptom, the root cause, the exact command or path to confirm it, and the fix. Scan the playbook table first, then read the detail for whichever row matches.

# Symptom Root cause Confirm (exact command / path) Fix
1 Every COPY_JOBAccess Denied Org cross-account backup not enabled aws backup describe-global-settings shows false update-global-settings isCrossAccountBackupEnabled=true (mgmt)
2 Copy lands but restore fails KMS Recovery CMK doesn’t grant restoring role kms:Decrypt start-restore-jobAccessDenied; read key policy Add key-policy + IAM allow for the restore role
3 New resource’s copy silently fails Snapshot uses default aws/ebs/aws/rds key COPY_JOB FAILED with opaque KMS error Re-encrypt resource with a CMK, then re-copy
4 Source dashboard green, no copies Watching Backup Job, not Copy Job list-copy-jobs --by-state FAILED Add EventBridge on Copy Job State Change
5 Copy rejected by destination vault Lifecycle delete_after < vault min_retention_days Vault lock config vs plan lifecycle Raise delete_after ≥ lock floor
6 Can’t delete a test vault Compliance lock past cooling-off describe-backup-vaultLocked: true Wait out retention; never lock test long
7 New database unprotected Selected by ARN, not tag list-backup-selections; resource has no backup tag Switch to tag selection; enforce tag via SCP
8 Job EXPIRED before starting start_window too tight under load describe-backup-job state EXPIRED Increase start_window; stagger schedules
9 Job ABORTED mid-run completion_window shorter than dataset needs describe-backup-job state ABORTED Increase completion_window
10 Source role can’t be assumed Trust policy missing backup.amazonaws.com get-role assume-role-policy Add the service principal to the trust policy
11 EBS snapshot crash-inconsistent No fsfreeze pre/post hook on Linux host Restored FS needs fsck / DB recovery Ansible fsfreeze hooks; quiesce the app
12 Cold-tier copy won’t delete on time <90 days min in cold storage RP still present after delete_after Respect the 90-day cold minimum in lifecycle

Mistake 1 — Org cross-account backup left off

Everything else is perfect — KMS, IAM, the plan — and every copy still Access Denieds. The feature is Org-level and off by default.

Confirm. aws backup describe-global-settings --profile mgmt and look for isCrossAccountBackupEnabled. Fix. aws backup update-global-settings --global-settings isCrossAccountBackupEnabled=true --profile mgmt. Step 1, always, before you debug anything else.

Mistake 2 — The KMS key-policy gap (the famous one)

The copy lands in the recovery vault and you congratulate yourself — until the restore fails with AccessDenied on the key. The recovery CMK granted the source Backup role decrypt, but the recovery account’s restore role (a different principal) was never granted, in either the key policy or its own IAM. Cross-account KMS needs both sides.

Confirm. aws backup start-restore-job ... returns a KMS AccessDenied; aws kms get-key-policy --key-id <id> --policy-name default --profile recovery shows no statement for the restore role. Fix. Add a key-policy statement granting the restore role kms:Decrypt/kms:DescribeKey, and an IAM allow on the restore role for the same. This is the single most-missed line in the whole build.

Mistake 3 — Default-key snapshots silently skip the copy

AWS Backup cannot copy a snapshot encrypted with the AWS-managed aws/ebs or aws/rds key across accounts. The backup job succeeds (green), the copy job fails with an opaque KMS error, and on a new resource nobody notices.

Confirm. aws backup list-copy-jobs --by-state FAILED --profile prod; cross-reference the resource’s key with aws ec2 describe-volumes --query 'Volumes[].KmsKeyId'. Fix. Re-encrypt the resource with a customer-managed key (snapshot → copy snapshot with the CMK → create volume → swap), then let the copy run.

Mistake 4 — Treating a green source snapshot as success

The source backup and the cross-account copy are two jobs. The console’s snapshot view can be entirely green while every copy fails.

Confirm. aws backup list-copy-jobs --by-state FAILED --profile prod. Fix. Wire the EventBridge rule on Copy Job State Change → FAILED/ABORTED (Step 7) so the copy, not just the backup, is monitored and paged.

Mistake 5 — Lifecycle vs lock-floor conflict

A copy is rejected by the locked destination vault because the rule’s delete_after is below the vault’s min_retention_days. The lock won’t accept anything it would have to delete early.

Confirm. Compare the plan rule’s copy_action.lifecycle.delete_after against aws backup describe-backup-vault --query MinRetentionDays. Fix. Raise delete_after to ≥ the lock floor (and ≤ the ceiling). The lock’s min/max bound every copy that lands.

Mistake 6 — Locking a test vault into long retention

You point a learning exercise at min_retention_days = 30 (or worse, 2555), the cooling-off window passes, and now you cannot delete the vault or its recovery points for the full retention. Compliance mode has no override.

Confirm. aws backup describe-backup-vault --query Locked returns true and you’re past changeable_for_days. Fix. There isn’t a fast one — wait out min_retention_days. Prevent it: in any non-prod vault, omit changeable_for_days (stay in cooling-off) or use min_retention_days = 1, and never point a lab at a production retention.

Mistake 7 — Selecting by ARN

A selection that lists ARNs protects exactly those resources and silently ignores everything provisioned afterwards.

Confirm. aws backup list-backup-selections --backup-plan-id <id> shows explicit ARNs; the new resource lacks a backup tag. Fix. Switch to a selection_tag (Step 6) and enforce the tag with an SCP or a Terraform module default so coverage can’t lapse.

Mistakes 8–9 — EXPIRED and ABORTED jobs

EXPIRED means the job never started within start_window (capacity was busy, window too tight). ABORTED means it started but couldn’t finish within completion_window (dataset too large).

Confirm. aws backup describe-backup-job --backup-job-id <id> --query State. Fix. Increase start_window (and stagger schedules to spread load) for EXPIRED; increase completion_window for ABORTED.

Mistake 10 — Trust policy missing the service principal

The Backup role can’t be assumed because its trust policy doesn’t list backup.amazonaws.com.

Confirm. aws iam get-role --role-name AWSBackupCrossAccountRole --query 'Role.AssumeRolePolicyDocument'. Fix. Add Principal.Service = backup.amazonaws.com with sts:AssumeRole.

Mistake 11 — Crash-inconsistent EBS snapshots

A snapshot taken while the app is mid-write captures torn state; the restored filesystem needs fsck and the database may need crash recovery. On Windows, AWS Backup uses VSS; on Linux you must quiesce yourself.

Confirm. A restored volume’s filesystem reports inconsistencies; DB starts in recovery. Fix. An Ansible play installs fsfreeze pre/post scripts (or app-level quiesce) so the snapshot is application-consistent.

Mistake 12 — Cold-tier copies that won’t expire

A recovery point tiered to cold storage has a 90-day minimum in cold before it can be deleted; a delete_after that ignores this leaves the copy lingering (and billed for the minimum).

Confirm. The recovery point is still present after its nominal delete_after. Fix. Ensure delete_aftercold_storage_after + 90; AWS Backup enforces this on the plan, but a mismatch on an imported plan can surface here.

Best practices

Security notes

The architecture is its own primary control: production holds no IAM principal able to write to or delete from the recovery vault, so a full compromise of the source account cannot reach the copies. The only data path into the recovery vault is AWS Backup’s copy mechanism, which the recovery KMS key policy explicitly grants to one named role and nothing else.

Immutability against the insider and the ransomware operator. Vault Lock compliance mode means that even a principal who gets into the recovery account — including its root — cannot delete a recovery point before min_retention_days or weaken the lock once the cooling-off window has passed. Governance mode would let a privileged principal remove the lock; compliance mode does not. This is what makes the copies ransomware-resistant rather than merely off-box.

Encryption and key isolation. Re-encryption with a recovery-owned CMK means the source account’s key material is irrelevant to the copies’ confidentiality — compromising the source’s keys does not expose the copies. The recovery CMK’s policy is least-privilege: the source Backup role gets exactly the actions needed to write and (for source-side restore) decrypt, and the recovery restore role gets decrypt for DR drills. Nothing else is named.

Human and machine identity. Every human entry point goes through Okta → IAM Identity Center (or Entra ID) SSO with no standing keys and a full audit trail. The CI pipeline uses Vault-leased, short-lived credentials scoped per account, so no static access key exists to leak. Wiz Code gates the IaC for misconfiguration in the PR — a vault with no lock, an over-broad KMS policy, a public exposure — before the plan is ever applied, and CrowdStrike Falcon watches the hosts and CI runners at runtime for tampering with the backup/snapshot agents. The least-privilege and identity controls, at a glance:

Control Mechanism Protects against
One-way write path No source principal can write/delete recovery vault Source compromise reaching copies
Immutable retention Vault Lock compliance mode Insider / ransomware deleting copies
Key isolation Recovery-owned CMK; least-privilege policy Source key compromise exposing copies
Federated human access Okta → IAM Identity Center SSO Long-lived keys, untraceable access
Short-lived CI creds Vault AWS secrets engine Static key leak in the pipeline
Pre-apply IaC scan Wiz Code in the PR Misconfigured policy/lock shipped
Runtime host integrity CrowdStrike Falcon Agent/host tampering
Continuous evidence Backup Audit Manager + SIEM “Are we actually covered?” drift

Cost & sizing

Cross-account copy cost is dominated by two things: storage of the copies in the recovery account (warm tier) and, for the long-retention rule, the much cheaper cold-storage tier that cold_storage_after moves recovery points into after 30 days. Snapshot storage is incremental for EBS, so daily copies of slowly-changing volumes cost far less than their full size suggests; RDS copies are full per snapshot, so right-size the daily-vs-weekly split. Cross-account, same-region copy avoids inter-region data-transfer charges — keep the recovery account in the same region unless a regional-isolation requirement forces otherwise, in which case budget the transfer. Watch the long-retention delete_after: a 7-year cap you never actually need quietly compounds.

What drives the bill, and the lever for each:

Cost driver Roughly what it costs Lever to reduce Gotcha
Warm copy storage (recovery vault) Per-GB-month, warm tier Shorter delete_after; incremental EBS RDS copies are full, not incremental
Cold copy storage Much lower per-GB-month cold_storage_after for long-retention 90-day minimum; thaw cost on restore
Cross-region transfer Per-GB egress Stay same-region unless geo-DR needed Applies to every cross-region copy
Restore (DR drill) Restore + thaw (if cold) Drill quarterly, drop fast Cold thaw adds latency + cost
KMS Per key + per-request Reuse one CMK per purpose Rotation is free; key count adds up
EventBridge + SNS Negligible at this volume Don’t over-fan-out to email noise

Rough sizing for a small estate (one Aurora cluster + a handful of gp3 volumes, ap-south-1):

Scenario Approx monthly Notes
Daily copies, 90-day warm retention, same-region ₹12,000–16,000 Dominated by warm storage of copies
+ Weekly long-retention, cold after 30d, 7y +₹2,000–3,500 Cold tier keeps the 7-year obligation cheap
+ Cross-region copy on tier-0 only +egress per GB Budget transfer for the geo-DR slice
Quarterly restore drills negligible run, low thaw The cost of proving it works

There is no AWS Backup free tier for this, but the lab above — one volume, no lock, torn down in an hour — costs almost nothing. Tag the recovery vault’s spend and surface it on the same Datadog/Dynatrace cost dashboard the rest of the platform reports to, so backup storage is a line the team owns rather than a surprise on the bill.

Interview & exam questions

1. Why does cross-account AWS Backup copy require AWS Organizations, and what happens if the Org feature is off? Cross-account copy is an Org-level capability; AWS Backup must have trusted service access and isCrossAccountBackupEnabled=true set in the management account. Without it, every COPY_JOB fails with Access Denied regardless of how correct the KMS and IAM are. Maps to the Resilient Architectures domain of the AWS Solutions Architect certs.

2. The copy lands in the recovery vault but the restore fails with a KMS error. What’s the most likely cause? The recovery CMK’s policy (and/or the restoring role’s IAM) doesn’t grant the principal performing the restore kms:Decrypt on the key. Cross-account KMS needs the action allowed on both sides — key policy and caller IAM. The source Backup role is commonly granted; the recovery restore role is the one forgotten.

3. Why can a snapshot back up successfully but fail to copy across accounts? If the resource is encrypted with the AWS-managed aws/ebs or aws/rds key, AWS Backup cannot copy it across accounts. The backup job succeeds locally; the copy job fails with an opaque KMS error. The fix is to re-encrypt the resource with a customer-managed key.

4. Compare Vault Lock governance mode and compliance mode. Governance mode can be removed by a principal holding backup:DeleteBackupVaultLockConfiguration — it deters mistakes but not a determined insider. Compliance mode becomes permanent after the changeable_for_days cooling-off window: nobody, including root, can delete a recovery point early or weaken the lock. Use compliance for copies that must be WORM-immutable.

5. What is changeable_for_days, and why is getting it wrong dangerous? It’s the cooling-off window during which a compliance-mode lock can still be removed. After it elapses the lock is permanent. Setting a long min_retention_days and letting the window pass on a test vault traps it for the full retention with no override — so rehearse in throwaway accounts with short retention.

6. Why select backup resources by tag rather than ARN? A tag-based selection (backup=daily) automatically includes any resource carrying the tag, so resources provisioned later are protected the moment they exist. An ARN list silently omits everything created afterwards. Pair it with an SCP enforcing the tag so coverage can’t lapse.

7. The source-snapshot dashboard is green but no copies exist in the recovery account. What went wrong and how do you prevent recurrence? The source backup and the cross-account copy are separate jobs; the copy job has been failing while the backup job succeeds. Prevent it by wiring an EventBridge rule on Copy Job State Change → FAILED/ABORTED to SNS, so the copy is monitored and paged independently.

8. How do start_window and completion_window differ, and what states do their violations produce? start_window is how long a job may wait to start; exceeding it yields EXPIRED. completion_window is the max time to finish; exceeding it yields ABORTED. Tight start windows under capacity pressure cause EXPIRED; large datasets against a short completion window cause ABORTED.

9. Why isn’t scaling the recovery account’s storage the answer to a failing copy, and what is? The copy failure is almost always a permission or encryption problem — Org feature off, KMS policy gap, or a default-key snapshot — not capacity. The fix is the matching configuration change (enable the feature, grant the key, re-encrypt), confirmed via list-copy-jobs and describe-global-settings.

10. How does this architecture defend against ransomware specifically? Copies live in an account production cannot write to, and Vault Lock compliance mode makes them immutable even to the recovery account’s root. A ransomware operator who encrypts or deletes everything in the source account — and even one who breaches the recovery account — cannot delete the recovery points before their retention. Maps to the Security and Resilience pillars of the Well-Architected Framework.

11. What does AWS Backup do for application consistency on EBS, and what must you do on Linux? On Windows, AWS Backup integrates with VSS for application-consistent snapshots. On Linux there is no equivalent, so you must quiesce the application yourself — typically fsfreeze pre/post hooks (via Ansible) — or accept crash-consistent snapshots that may need recovery on restore.

12. Why tier long-retention copies to cold storage, and what constraint applies? Cold storage is far cheaper per GB-month, making a 7-year retention economical. The constraint is a 90-day minimum in cold storage before a recovery point can be deleted, so delete_after must be at least cold_storage_after + 90.

Quick check

  1. Which single Org setting, if false, causes every cross-account copy job to Access Denied no matter how correct your KMS and IAM are?
  2. A copy lands in the recovery vault but the restore fails with a KMS AccessDenied. Name the most-missed grant.
  3. Why does a snapshot encrypted with the aws/ebs key fail to copy across accounts?
  4. After the changeable_for_days window elapses on a compliance-mode lock, who can delete a recovery point early?
  5. Which EventBridge detail-type must you alert on to catch a silent cross-account copy failure?

Answers

  1. isCrossAccountBackupEnabled (set in the Organizations management account via update-global-settings). Confirm with aws backup describe-global-settings.
  2. The recovery account’s restore role needs kms:Decrypt on the recovery CMK — granted in both the key policy and the role’s IAM. The source Backup role is usually granted; the restore role is the one forgotten.
  3. AWS Backup cannot copy default-AWS-managed-key (aws/ebs/aws/rds) snapshots across accounts. Re-encrypt the resource with a customer-managed key first.
  4. Nobody — not even the recovery account root. Compliance mode is permanent after cooling-off; there is no override.
  5. Copy Job State Change (matching FAILED/ABORTED). Watching only Backup Job State Change leaves the copy failure invisible.

Glossary

Next steps

AWSAWS BackupEventBridgeDisaster RecoveryKMSVault LockTerraformCross-Account
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading