AWS Lesson 88 of 123

Centralized AWS Backup with Organizations: Vault Lock, Cross-Account Copy, and Recovery Runbooks

Backups are not a feature you turn on; they are a control plane you operate. The failure mode that ends careers is not “we had no backups” — it is “we had backups, in the same account, encrypted with the same key, deletable by the same blast-radius identity that just got compromised.” Ransomware crews and fat-fingered automation both target the recovery path first, because the recovery path is the one thing standing between them and a paid ransom or a permanent data-loss incident. This guide builds that recovery path as a separate trust boundary: a delegated backup admin that authors policy, compliance-mode Vault Lock vaults that even your root user cannot empty, copies landing in a logically air-gapped account in another Region, and restore tests that prove RTO/RPO instead of asserting them. It assumes AWS Organizations with all features enabled and a multi-account landing zone underneath.

The reason this is hard is that AWS Backup is an orchestrator, not a storage service — it coordinates snapshots that physically live with each source service (EBS, RDS, DynamoDB, S3, EFS, FSx), wrapped in recovery points that a vault references and a KMS key encrypts. Every one of those layers — the policy that schedules the job, the IAM role the job assumes, the vault that holds the recovery point, the key that encrypts it, the lock that makes it immutable, the copy action that fans it cross-Region, the access policy that lets the copy land, the restore role that reads it back — is a place a real program quietly breaks. A green dashboard tells you jobs ran; it does not tell you a copy landed in the isolated account, that the lock actually hardened, or that a restore would succeed. Those are separate facts, each with its own confirming command.

By the end you will stop trusting the dashboard and start proving the recovery path. You will know exactly which two grants delegated administration takes (and why one without the other gives you a blind policy author), why a vault and role must exist before the first job runs, the one flag that separates compliance-mode Vault Lock from governance mode, the kms:CreateGrant condition that is load-bearing for cross-account restore, and how restore testing converts “we believe under four hours” into “measured 2h41m on the last 11 weekly tests.” Because this is a reference you return to mid-incident — or mid-audit — the policy keys, the lock modes, the KMS grants, the per-service restore mechanics, the limits and the failure modes are all laid out as scannable tables. Read the prose once; keep the tables open when the auditor (or the attacker) shows up.

What problem this solves

Per-account backups feel safe and fail catastrophically. The common pattern — every workload account backs itself up to a local vault, lifecycle is correct, dashboards are green — collapses the moment the account itself is the blast radius. A compromised deployment role, a poisoned CI/CD dependency, or a misconfigured trust policy gives an attacker (or a buggy automation) the same backup:DeleteRecoveryPoint and backup:DeleteBackupVault permissions your operators have. The local backups are gone in minutes, encrypted with a key the same identity can schedule for deletion, and there is no second copy anywhere an attacker cannot reach. You discover this not during a drill but during an incident, which is the worst possible time to learn that your recovery path shared a trust boundary with the thing that just got owned.

What breaks without a centralized, air-gapped program: (1) no immutability — recovery points are deletable, so ransomware deletes them; (2) no isolation — copies live in the same account/Region, so one compromise or one regional event takes both primary and backup; (3) no governance — each team configures its own backups, retention drifts, some resources are simply never protected and nobody notices until restore time; (4) no proof — RTO/RPO are slide-deck numbers, not measured facts, so the first real restore is also the first test, and it fails on a KMS grant you forgot. The financial and regulatory exposure is brutal: failing an auditor’s “prove the production copies could not have met the same fate” question can mean a finding, a fine, or a lost certification.

Who hits this: every organization past a handful of accounts. It bites hardest on regulated workloads (finance, healthcare) where a 7-year immutable retention is a legal floor, on teams that grew account-by-account without a landing zone, and on anyone who equates “backup job succeeded” with “I can recover.” The fix is not more backups — it is a recovery path that is a separate trust boundary: a delegated admin authoring policy, immutable vaults, an isolated recovery account in another Region, and restore tests that produce evidence. To frame the whole field before the deep dive, here is every control plane this article builds, the failure it prevents, and the one command that proves it works:

Control plane What it is Failure it prevents First command to prove it
Delegated backup admin An account that authors org-wide policy Operating consoles from the org root; blast-radius concentration aws organizations list-delegated-administrators --service-principal backup.amazonaws.com
Organizations backup policy Tag-targeted JSON attached to OUs Per-team drift; unprotected resources aws organizations describe-effective-policy --policy-type BACKUP_POLICY
Compliance-mode Vault Lock WORM immutability until expiry Ransomware / root deleting recovery points aws backup describe-backup-vault --query '{Locked:Locked,LockDate:LockDate}'
Cross-account/Region copy copy_actions into another account/Region Account compromise or regional event taking both copies aws backup list-recovery-points-by-backup-vault --backup-vault-name airgap-recovery-vault
Dedicated CMK + key policy Customer-managed key shareable cross-account aws/backup key cannot be shared; restore denied aws kms describe-key --key-id <id> --query 'KeyMetadata.MultiRegion'
Recovery-account SCP Deny delete/key-deletion verbs The exact door a compromised role walks through aws organizations describe-effective-policy --policy-type SERVICE_CONTROL_POLICY
Restore testing Scheduled real StartRestoreJob + validate RTO/RPO asserted but never measured aws backup list-restore-jobs --by-restore-testing-plan-arn <arn>
Audit Manager framework Queryable compliance + evidence export “Are we compliant?” answered by opinion, not data aws backup list-frameworks

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the multi-account fundamentals: AWS Organizations with all features enabled, Organizational Units (OUs) as the attachment target for policies, Service Control Policies (SCPs) as the guardrail mechanism, and IAM roles with trust policies and cross-account assumption. You should know how to run the AWS CLI with named profiles (you will switch between the management, delegated-admin, and recovery accounts constantly), read JSON output, and reason about KMS key policies versus IAM policies. Familiarity with at least one source service’s snapshot model (EBS snapshots, RDS automated backups) helps the per-service sections land.

This sits at the top of the Reliability & DR track and assumes the governance layer beneath it. It builds directly on AWS Organizations: SCP Guardrails & Delegated Admin (the delegation and SCP mechanics this whole program rides on) and AWS Control Tower: Guardrails & Multi-Account Foundation (the landing zone that places accounts in OUs). The encryption layer is AWS KMS Deep Dive: Keys, Policies, Envelope Encryption & Rotation, and cross-Region copy specifically wants KMS Multi-Region Keys: Envelope Encryption & Key Policies. It pairs with the strategy-level AWS Backup & Disaster Recovery Strategies and the framing in High Availability vs Disaster Recovery: RTO & RPO. For the threat model that motivates immutability, see Ransomware Resilience: Immutable Backup & Isolated Recovery Environment.

A quick map of which account owns which responsibility, so you run each command from the right place and never operate from the org root by habit:

Concern Owning account Why it lives there What you must NOT do here
Trusted access, policy type, delegate registration Management Only the org root can enable integrations Run day-to-day backup operations
Authoring/attaching backup policies Delegated admin Keeps the org root thin; least standing power Hold the immutable recovery copies
Source backup jobs + local vaults Each workload account Backups run where the data is Share a trust boundary with the air-gap copy
Immutable cross-Region copies Isolated recovery Survives a full workload-account compromise Grant standing human access
KMS key administration Each vault-owning account Keys are account- and Region-scoped Use the aws/backup managed key
Restore testing + validation Recovery account Exercises the air-gapped copies, not the originals Skip it and assume the RTO
Compliance evidence + drift Delegated admin Org-wide aggregation point Log into each account to check manually

Core concepts

Six mental models make every later step obvious.

AWS Backup orchestrates; the data lives with the service. A backup plan is a schedule + lifecycle + target vault + selection. When a rule fires, AWS Backup assumes an IAM role in the source account and calls the source service’s snapshot API; the resulting recovery point is a pointer that lives in a backup vault and is encrypted by a KMS key. The vault is a logical container and a permission boundary — it does not “store” bytes the way S3 does; it governs who can read, copy, and delete the recovery points it references. This is why a single over-broad identity that can manage vaults is catastrophic: it controls the pointers to all your recovery data at once.

Three planes, three accounts — do not collapse them. The management account enables the integration and attaches policies to OUs; that is all it does day to day. The delegated backup admin authors policy and monitors jobs org-wide, so you never run backup consoles from the org root. The isolated recovery account receives cross-account copies into an air-gapped vault and is, ideally, in a separate OU with restrictive SCPs and no standing human access. The whole security argument is that the identity which can delete your last copy must not be the identity that just got compromised — and that requires a different account, a different Region, and an immutable lock, all at once.

Delegated administration is two grants, not one. register-delegated-administrator lets the account manage backup policies. You separately enable cross-account monitoring in the AWS Backup console (Settings) so the delegate can see jobs in member accounts. Enable both, or you get a policy author who is blind to execution — a real and common half-configured state.

The vault and role must pre-exist; AWS Backup does not create them. The vault a policy targets and the IAM role a policy references must already exist in every target account and Region before a backup job runs. Bootstrap them with a CloudFormation StackSet (service-managed permissions) targeting the OUs that receive policies, so vault + role land automatically when an account joins the OU. Forget this and describe-effective-policy renders a perfect policy while every job fails with “role/vault not found.”

Immutability is a mode, decided by one flag. Compliance-mode Vault Lock makes recovery points immutable until their lifecycle expires — no IAM principal, not the account root, not AWS, can shorten retention or delete them once the lock hardens. The mode is selected by whether you include --changeable-for-days: include it (with a grace window of ≥3 days) and you get compliance mode; omit it and you get governance mode (removable by sufficiently privileged IAM). The grace window is your only undo — once LockDate passes, the configuration is set in stone.

Restore is its own permission and KMS path. Reading a recovery point back — especially cross-account — requires the restore role to use the destination KMS key, and AWS Backup creates grants on your behalf during restore. The kms:CreateGrant permission with the kms:GrantIsForAWSResource condition is load-bearing; without it the restore fails with a KMS key cannot be accessed error that is maddening to debug because the recovery point is intact and the lock is fine.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary repeats these for lookup; this is the mental model side by side:

Concept One-line definition Where it lives Why it matters to recovery
Backup plan Schedule + lifecycle + vault + selection Account (or rendered from org policy) The thing that runs — or silently doesn’t
Organizations backup policy Tag-targeted JSON attached to an OU Management → OU; authored in delegate One policy protects whatever a team tags
Backup vault Logical container + permission boundary Per account/Region Governs read/copy/delete of recovery points
Logically air-gapped (LAG) vault Isolated vault type for cross-acct share/restore Recovery account The copy that survives a full compromise
Recovery point Encrypted pointer to a snapshot In a vault The thing you restore (or an attacker deletes)
Vault Lock WORM immutability config On a vault Compliance mode = even root can’t delete
changeable_for_days Grace window before lock hardens Lock config Your only undo; min 3 days
copy_actions Fan-out copy to another vault On a rule Cross-account/Region air-gap mechanism
CMK + key policy Customer-managed key, shareable Per vault account/Region aws/backup can’t be shared cross-account
kms:CreateGrant Permits AWS Backup to grant key use Key policy Load-bearing for cross-account restore
Restore testing plan Scheduled real restore + teardown Recovery account Turns RTO/RPO into measured evidence
Audit Manager framework Controls + evidence export Delegated admin “Compliant” becomes queryable + auditable
RTO / RPO Recovery time / recovery point objective Measured, not asserted The numbers the business actually buys

The account topology and where policy lives

Three planes, three accounts. Do not collapse them. The management account stays thin — it enables the integration and attaches policies; day-to-day backup operations live in the delegated admin so you are not running consoles from the org root, and the immutable copies live somewhere neither of those can reach.

Plane Account Responsibility Standing access
Policy authoring Delegated backup admin Owns Organizations backup policies, monitors jobs org-wide Backup operators (least power)
Org control Management account Enables trusted access, registers the delegate, attaches policies to OUs Org admins only; no daily ops
Recovery (air-gap) Isolated recovery account Receives cross-account copies into an air-gapped vault; separate OU, restrictive SCPs None (break-glass only)
# Run in the MANAGEMENT account.
# 1) Trusted access so AWS Backup can act across the org.
aws organizations enable-aws-service-access \
  --service-principal backup.amazonaws.com

# 2) Turn on the BACKUP_POLICY policy type (idempotent if already enabled).
ROOT_ID=$(aws organizations list-roots --query 'Roots[0].Id' --output text)
aws organizations enable-policy-type \
  --root-id "$ROOT_ID" \
  --policy-type BACKUP_POLICY

# 3) Register the backup admin account as delegated administrator.
aws organizations register-delegated-administrator \
  --account-id 222222222222 \
  --service-principal backup.amazonaws.com

The same enablement expressed as Terraform, so the org integration is reviewable in a PR rather than clicked once and forgotten:

resource "aws_organizations_organization" "this" {
  aws_service_access_principals = ["backup.amazonaws.com"]
  enabled_policy_types          = ["BACKUP_POLICY", "SERVICE_CONTROL_POLICY"]
  feature_set                   = "ALL"
}

resource "aws_backup_global_settings" "delegated" {
  global_settings = { "isCrossAccountBackupEnabled" = "true" }
}

Delegated administration in AWS Backup is two grants, not one. register-delegated-administrator lets the account manage backup policies; you separately enable cross-account monitoring in the AWS Backup console (Settings) so the delegate can see jobs in member accounts. Enable both, or you get a policy author who is blind to execution.

One prerequisite that bites everyone: the vault and the IAM role a policy references must already exist in every target account and Region before a backup job runs — AWS Backup does not create them. Bootstrap them with a CloudFormation StackSet (service-managed permissions) targeting the OUs that receive policies, so vault + role land automatically when an account joins.

The full enablement checklist, in dependency order — each row gates the next, and skipping one produces a specific, confusing symptom:

# Step Account Command / setting Symptom if skipped
1 Org all-features enabled Management describe-organizationFeatureSet: ALL Policies cannot be enabled at all
2 Trusted access for Backup Management enable-aws-service-access Org policies have no effect
3 BACKUP_POLICY policy type on root Management enable-policy-type create-policy --type BACKUP_POLICY errors
4 Register delegated admin Management register-delegated-administrator Delegate cannot author org policies
5 Cross-account backup enabled Management/delegate isCrossAccountBackupEnabled=true Cross-account copies are rejected
6 Cross-account monitoring enabled Delegate (console Settings) toggle in AWS Backup → Settings Delegate is blind to member-account jobs
7 Vault + role StackSet to OUs Management → OUs service-managed StackSet Jobs fail “role/vault not found”
8 Attach backup policy to prod OU Delegate / management attach-policy Nothing is protected

The two cross-account toggles confuse everyone because they sound identical. Keep them straight:

Setting What it enables Where you set it Without it
isCrossAccountBackupEnabled (backup copy) Recovery points may be copied across accounts Global settings (management/delegate) copy_actions cross-account fails
Cross-account monitoring The delegate can view jobs in member accounts AWS Backup console → Settings A blind policy author; jobs invisible org-wide

Defining backup policies with tag-based selection

An Organizations backup policy is JSON with three required keys per plan — regions, rules, selections — using declarative-policy inheritance operators (@@assign sets a value; child policies merge with or override parents). You target resources by tag, not ARN, so the same policy protects whatever a team launches as long as they tag it. This is the single most important design choice in the whole program: tag-targeting means coverage scales automatically as teams ship, instead of someone remembering to add each new ARN to a selection.

Save the following as tier1-backup-policy.json. It runs daily, writes to the local vault, and fans out a copy to the recovery account in another Region. copy_actions is keyed by the destination vault ARN; $account / $region are placeholders AWS resolves per member account.

{
  "plans": {
    "Tier1_Daily": {
      "regions": { "@@assign": ["ap-south-1", "us-east-1"] },
      "rules": {
        "DailySnapshot": {
          "schedule_expression": { "@@assign": "cron(0 18 ? * * *)" },
          "start_backup_window_minutes": { "@@assign": "60" },
          "complete_backup_window_minutes": { "@@assign": "10080" },
          "target_backup_vault_name": { "@@assign": "central-backup-vault" },
          "lifecycle": {
            "move_to_cold_storage_after_days": { "@@assign": "30" },
            "delete_after_days": { "@@assign": "365" }
          },
          "copy_actions": {
            "arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault": {
              "target_backup_vault_arn": {
                "@@assign": "arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"
              },
              "lifecycle": {
                "move_to_cold_storage_after_days": { "@@assign": "30" },
                "delete_after_days": { "@@assign": "2555" }
              }
            }
          }
        }
      },
      "selections": {
        "tags": {
          "tier1": {
            "iam_role_arn": { "@@assign": "arn:aws:iam::$account:role/AWSBackupCentralRole" },
            "tag_key": { "@@assign": "backup-tier" },
            "tag_value": { "@@assign": ["tier1"] }
          }
        }
      }
    }
  }
}

Create the policy in the delegated admin account and attach it to the OU that holds your production workloads:

POLICY_ID=$(aws organizations create-policy \
  --name "Tier1-Daily-Backup" \
  --type BACKUP_POLICY \
  --content file://tier1-backup-policy.json \
  --query 'Policy.PolicySummary.Id' --output text)

aws organizations attach-policy \
  --policy-id "$POLICY_ID" \
  --target-id ou-abcd-prodaccts   # the production OU

The same plan expressed as a per-account Terraform resource (useful for accounts outside the org policy, or to model what the org policy renders into):

resource "aws_backup_plan" "tier1" {
  name = "Tier1_Daily"
  rule {
    rule_name         = "DailySnapshot"
    target_vault_name = aws_backup_vault.central.name
    schedule          = "cron(0 18 ? * * *)"
    start_window      = 60
    completion_window = 10080
    lifecycle { cold_storage_after = 30, delete_after = 365 }
    copy_action {
      destination_vault_arn = "arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"
      lifecycle { cold_storage_after = 30, delete_after = 2555 }
    }
  }
}

Every policy key, end to end

The policy schema is small but unforgiving — a wrong key name silently no-ops, and lifecycle math has hard floors. Here is every field that matters, what it does, and the gotcha:

Key Scope Values / type Default Gotcha / limit
regions plan list of Region IDs required Policy only acts in listed Regions
schedule_expression rule cron(...) UTC required Cron is UTC; ? for day-of-week OR day-of-month
start_backup_window_minutes rule minutes 60 (8h for some) Job is cancelled if it can’t start in the window
complete_backup_window_minutes rule minutes derived Must exceed start window; long backups need headroom
target_backup_vault_name rule vault name (must pre-exist) required Name only; the account is implied (the source acct)
lifecycle.move_to_cold_storage_after_days rule/copy days none Cold storage is min 90-day commit; not all services
lifecycle.delete_after_days rule/copy days none Must be ≥ cold_storage_after + 90
enable_continuous_backup rule true/false false Enables PITR; hard 35-day retention ceiling
copy_actions.<vaultArn> rule keyed by dest vault ARN none Independent lifecycle from the local rule
copy_actions.*.target_backup_vault_arn copy full ARN (acct+Region) required This is what makes it cross-account/Region
selections.tags.<name>.tag_key selection string required Case-sensitive; must match the resource tag exactly
selections.tags.<name>.tag_value selection list of strings required Any value in the list matches (OR)
selections.tags.<name>.iam_role_arn selection role ARN with $account required Role must exist in every target account

The inheritance operators are the part people get wrong when policies stack on nested OUs. What each does:

Operator Effect Use when Trap
@@assign Set the value (replace) Setting a concrete value A child @@assign overrides the parent’s
@@append Add to a list (merge) Adding Regions/tags without losing parents Only valid on list-typed values
@@remove Remove from a merged list Excluding an inherited value Order of evaluation matters on deep OU trees
(none) child key Merge child into parent map Adding a new rule alongside inherited ones Duplicate rule names collide

A few rules that keep this correct:

Schedule-and-window sizing trips up long-running backups; the relationship between the three time fields:

Field Meaning Too small → Sensible value
schedule_expression When the job is eligible to start n/a Off-peak, UTC (e.g. cron(0 18 ? * * *))
start_backup_window_minutes Grace to begin before cancel Job cancelled under contention 60 (raise to 480 for busy accounts)
complete_backup_window_minutes Total time allowed to finish Large RDS/EFS jobs killed mid-run 10080 (7 days) for big datasets

Vault Lock in compliance mode (WORM immutability)

A vault you can delete is a vault an attacker can delete. Compliance-mode Vault Lock makes recovery points immutable until their lifecycle expires — no IAM principal, not the account root, not AWS, can shorten retention or delete the recovery point once the lock hardens. This is the control that turns “we have backups” into “the backups will still be there after the breach.”

The mode is decided by one flag. Including --changeable-for-days selects compliance mode; omitting it gives you governance mode (removable by sufficiently privileged IAM). The value is the cooling-off grace time: during it you can still adjust or delete the lock to fix mistakes. Minimum is 3 days (72 hours).

# Run in EACH account/Region that owns a vault (member accounts + recovery account).
aws backup put-backup-vault-lock-configuration \
  --backup-vault-name central-backup-vault \
  --changeable-for-days 3 \
  --min-retention-days 30 \
  --max-retention-days 2555
resource "aws_backup_vault_lock_configuration" "central" {
  backup_vault_name   = aws_backup_vault.central.name
  changeable_for_days = 3      # presence => COMPLIANCE mode; omit => governance
  min_retention_days  = 30
  max_retention_days  = 2555
}

Treat the 3-day grace window as your only undo. Once LockDate passes, the configuration is set in stone. The classic foot-gun: a recovery point with retention set to “Always” inside a locked compliance vault is then un-deletable forever and bills forever. Never combine indefinite retention with compliance Vault Lock.

Verify the lock actually hardened — Locked: true plus a past LockDate is the only state that matters:

aws backup describe-backup-vault \
  --backup-vault-name central-backup-vault \
  --query '{Locked:Locked, LockDate:LockDate, Min:MinRetentionDays, Max:MaxRetentionDays}'

Apply the same lock to airgap-recovery-vault in the recovery account. That vault is the one that actually has to survive a credential compromise in the workload accounts.

Compliance vs governance, decided correctly

The two modes look similar in the API and are wildly different in consequence. Side by side:

Dimension Governance mode Compliance mode
Selected by Omit --changeable-for-days Include --changeable-for-days (≥3)
Who can delete RPs early Sufficiently privileged IAM (with backup:* overrides) Nobody — not IAM, not root, not AWS
Lock removable after hardening Yes (by privileged IAM) No — permanent until RP expires
Grace window n/a 3+ days, then LockDate is final
Good for Operational guardrail, accidental-delete prevention Ransomware survivability, regulatory WORM
Foot-gun Privileged role can still purge “Always”-retention RP bills forever
Reversal cost Change the config any time The entire account must be closed to escape

The lock-configuration parameters and their bounds:

Parameter Meaning Min Max Gotcha
min_retention_days Floor on every RP’s retention 1 RP retention shorter than this is rejected
max_retention_days Ceiling on every RP’s retention 36500 RP retention longer than this is rejected; blocks “Always”
changeable_for_days Grace before hardening (compliance) 3 36500 Presence is what selects compliance mode
LockDate When the lock becomes immutable Past LockDate + Locked:true = done

The lock-state truth table — only one row means “you are actually protected”:

Locked LockDate Mode Meaning Action
false absent none No lock at all Configure a lock
false future compliance (cooling) In grace window; still mutable Wait, or fix now while you still can
true future compliance (cooling) Grace running; not yet immutable Verify config before LockDate
true past compliance Immutable — the only safe state Done; verify periodically
true n/a governance Locked but IAM-removable Acceptable only for non-ransomware use

Cross-account, cross-Region copy into an air-gapped account

The copy_actions block in the policy already routes copies to 444444444444 in us-west-2. For that copy to land, two backstops must be in place — and when copies silently fail to appear, it is almost always one of these two.

(a) Vault access policy on the destination — a resource policy that permits the source org/account to copy in:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOrgCopyInto",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "backup:CopyIntoBackupVault",
      "Resource": "*",
      "Condition": {
        "StringEquals": { "aws:PrincipalOrgID": "o-exampleorgid" }
      }
    }
  ]
}
aws backup put-backup-vault-access-policy \
  --backup-vault-name airgap-recovery-vault \
  --policy file://airgap-vault-policy.json   # run in account 444444444444

(b) Logically air-gapped vault, not a standard vault. A logically air-gapped (LAG) vault is a distinct AWS Backup vault type supporting direct sharing for cross-account/cross-Region restore without you hand-rolling key policy on the restore side. Target it on a rule via target_logically_air_gapped_backup_vault_arn. As of 2025 these vaults support customer-managed KMS keys and can receive primary backups directly. Pair that isolation with restrictive SCPs on the recovery OU (deny backup:DeleteRecoveryPoint, backup:DeleteBackupVault, kms:ScheduleKeyDeletion) and no standing human role, and the copy survives even if every workload account is owned.

Standard vault vs logically air-gapped vault

Choosing the wrong vault type for the recovery target is a quiet design error — a standard vault works but forces you to hand-roll restore-side key policy and cross-account sharing. The comparison:

Dimension Standard backup vault Logically air-gapped (LAG) vault
Primary purpose Local recovery points Isolated copy/recovery target
Cross-account restore sharing You hand-roll key policy + RAM Built-in direct sharing
KMS Your CMK (or aws/backup) Customer-managed KMS (as of 2025)
Receives primary backups Yes Yes (2025+)
Vault Lock Compliance/governance Compliance/governance
Best role Source-account local vault The air-gap recovery copy
Indexing/search Standard Enhanced recovery-point search

The copy path has multiple independent gates; when a copy does not appear, walk them in this order:

# Gate Check (run where) Failure symptom
1 Cross-account backup enabled Global settings (mgmt) isCrossAccountBackupEnabled=true Copy job rejected immediately
2 Dest vault access policy get-backup-vault-access-policy (recovery acct) “Access denied” on copy
3 PrincipalOrgID matches Compare org ID in policy vs describe-organization Copy denied despite a policy
4 Recovery SCP not over-broad describe-effective-policy SCP (recovery acct) SCP blocks CopyIntoBackupVault
5 Dest KMS key usable by Backup Dest key policy has backup.amazonaws.com use Copy fails with KMS error
6 Destination Region matches ARN Region in target_backup_vault_arn Copy goes nowhere / errors

The copy lifecycle is independent of the local lifecycle — a deliberate, important asymmetry:

Tier Local vault Air-gap copy Rationale
Hot (restore speed) 30 days warm 30 days warm Fast operational restores
Cold (cost) move to cold @30d move to cold @30d Cut storage cost on aging RPs
Long-term (compliance) delete @365d delete @2555d (7yr) Air-gap carries the regulatory floor
Mode governance OK compliance lock The copy must survive a compromise

KMS keys and key policies for cross-account restore

Encrypt the central and recovery vaults with dedicated customer-managed keys — never the AWS-managed aws/backup key, which you cannot share cross-account. The restore side needs to use the key; encode that in the key policy in the recovery account. This section is where most “the backup is fine but the restore fails” incidents are born, so the key policy below is worth reading line by line.

{
  "Version": "2012-10-17",
  "Id": "airgap-backup-key",
  "Statement": [
    {
      "Sid": "KeyAdmins",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::444444444444:role/KeyAdmin" },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "AllowBackupServiceUse",
      "Effect": "Allow",
      "Principal": { "Service": "backup.amazonaws.com" },
      "Action": ["kms:Decrypt", "kms:GenerateDataKey", "kms:DescribeKey", "kms:CreateGrant"],
      "Resource": "*"
    },
    {
      "Sid": "AllowRestoreRoleDecrypt",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::444444444444:role/RecoveryRestoreRole" },
      "Action": ["kms:Decrypt", "kms:DescribeKey", "kms:CreateGrant"],
      "Resource": "*",
      "Condition": {
        "Bool": { "kms:GrantIsForAWSResource": "true" }
      }
    }
  ]
}

The kms:CreateGrant with the GrantIsForAWSResource condition is load-bearing — AWS Backup creates grants on your behalf during restore, and without it the restore fails with a KMS key cannot be accessed error that is maddening to debug. Copying across Regions, a multi-Region KMS key avoids re-wrapping data keys on every hop.

The KMS actions the recovery path needs, and which principal needs each

Every KMS Action here is required by a specific actor at a specific moment. Map them so you can reason about a deny precisely:

Action Who needs it When Symptom if missing
kms:GenerateDataKey backup.amazonaws.com Encrypting a new RP/copy Backup/copy job fails with KMS error
kms:Decrypt Backup service + restore role Reading an RP back Restore fails “cannot be accessed”
kms:DescribeKey Backup service + restore role Resolving key metadata Job can’t validate the key
kms:CreateGrant Backup service + restore role Backup grants key use mid-restore Restore fails — the classic bug
kms:ReEncrypt* Backup service (cross-Region, single-Region key) Re-wrapping data keys per hop Cross-Region copy slow/failing
kms:* (admin) KeyAdmin role only Key lifecycle/rotation Cannot manage the key

Why the GrantIsForAWSResource condition matters and how to scope it safely:

Aspect Detail
What it does Restricts CreateGrant to grants AWS services create on your behalf
Why it’s needed AWS Backup creates a transient grant to decrypt during restore
Risk without the condition A broad CreateGrant lets a principal grant key use arbitrarily
Risk with the condition Minimal — only AWS-resource-scoped grants are permitted
Confirm a denial Restore job StatusMessage/AbortReason cites KMS access

Single-Region vs multi-Region key for cross-Region copy — the choice changes the per-hop cost and the failure surface:

Factor Single-Region CMK per Region Multi-Region key (MRK)
Cross-Region copy mechanics Re-encrypt data key on each hop (ReEncrypt*) Same key material; no re-wrap
Key policy management Two policies to keep in sync One logical key, replicated
Failure surface More moving parts per hop Fewer; simpler restore
Cost Two keys, re-encrypt ops Replica key per Region
When to choose Strict per-Region key isolation required Cross-Region copy/restore at scale

The dedicated CMK vs the managed key — the single non-negotiable rule:

Key Shareable cross-account Custom key policy Use for vaults?
aws/backup (AWS-managed) No No Never for the recovery program
Dedicated CMK (customer-managed) Yes Yes Always — central and air-gap vaults

Per-service recovery: continuous vs snapshot

Restore mechanics differ by service. Match the protection mode to the workload’s RPO — continuous/PITR for low-RPO transactional data, snapshot for everything else and for long retention.

Service PITR / continuous? Restore granularity Recovery notes
RDS / Aurora Yes (continuous) Any second within 35 days enable_continuous_backup: true; else snapshot → new instance/cluster
DynamoDB Yes (PITR) Second-level within 35 days AWS Backup also does full-table snapshots for longer retention
S3 Yes (continuous) Point-in-time + item-level Continuous backup enables object/version-level restore
EFS Snapshot Full or item-level files Item-level file restore from recovery points
EBS Snapshot New volume Restores as a new volume; EC2/AMI restore blocked if AMI disabled
FSx Snapshot Full file system Per-file-system recovery point
EC2 (AMI) Snapshot New instance Depends on the underlying AMI being enabled
VMware / on-prem Snapshot Full VM Via AWS Backup gateway

To enable point-in-time recovery in a policy, set it on the rule:

"rules": {
  "ContinuousTier1": {
    "schedule_expression": { "@@assign": "cron(0 */1 ? * * *)" },
    "enable_continuous_backup": { "@@assign": "true" },
    "target_backup_vault_name": { "@@assign": "central-backup-vault" },
    "lifecycle": { "delete_after_days": { "@@assign": "35" } }
  }
}

PITR-enabled (continuous) recovery points have a hard 35-day retention ceiling. They are your low-RPO tier; the daily snapshot rule with long air-gapped retention is your long-term and ransomware tier. Run both rules in the same plan.

One trap worth flagging: EC2/EBS restores fail if the underlying AMI has been disabled — the recovery point is intact and the lock holds, but the resource is not restorable. Block ec2:DisableImage via SCP on production OUs so an attacker cannot soft-brick your restore path without deleting anything.

Matching RPO to the protection mode

The protection mode you pick is your RPO floor — you cannot restore to a point finer than your backups capture. The decision grid:

If the workload needs… Use RPO achieved Retention ceiling Cost shape
Sub-minute recovery (transactional DB) Continuous/PITR (RDS, Dynamo, S3) seconds 35 days Higher (continuous capture)
Hourly recovery, long retention Snapshot every hour ~1 hour up to 100 yr Per-snapshot storage
Daily, compliance/ransomware Daily snapshot + air-gap copy ~24 hours regulatory floor Cold storage cheapest
Both low-RPO and long retention Continuous and daily snapshot in one plan seconds + 24h 35d + long Two tiers, two costs

Per-service restore gotchas that are not obvious until restore time:

Service Restore gotcha Pre-empt it by
EC2 / EBS Fails if source AMI is disabled SCP deny ec2:DisableImage on prod OUs
RDS Restores to a new instance (new endpoint) Plan DNS/connection-string cutover
Aurora New cluster; param/option groups must exist Pre-create groups in the recovery account
DynamoDB New table name on restore Automate rename/swap in the runbook
S3 Item-level restore needs continuous backup on Enable continuous, not just snapshots
EFS Item-level restore lands in a new directory Account for the restore path in validation
FSx Restore creates a new file system Re-point clients post-restore

Restore testing: prove RTO/RPO, do not assert it

An untested backup is a hypothesis. AWS Backup restore testing runs real StartRestoreJob operations on a schedule, measures completion time, optionally validates integrity, then tears down the restored resource. Build it in the recovery account so you exercise the air-gapped copies — testing the local copies proves nothing about the path that has to survive a compromise.

It is two API calls. First the plan (cadence + which recovery points are eligible):

aws backup create-restore-testing-plan --restore-testing-plan '{
  "RestoreTestingPlanName": "weekly_tier1_dr",
  "ScheduleExpression": "cron(0 7 ? * MON *)",
  "StartWindowHours": 4,
  "RecoveryPointSelection": {
    "Algorithm": "LATEST_WITHIN_WINDOW",
    "RecoveryPointTypes": ["SNAPSHOT"],
    "SelectionWindowDays": 7,
    "IncludeVaults": ["arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"]
  }
}'

Then a selection per resource type (here, RDS), with a validation window so the resource survives long enough to integrity-check before cleanup:

aws backup create-restore-testing-selection \
  --restore-testing-plan-name weekly_tier1_dr \
  --restore-testing-selection '{
    "RestoreTestingSelectionName": "rds_tier1",
    "ProtectedResourceType": "RDS",
    "IamRoleArn": "arn:aws:iam::444444444444:role/RecoveryRestoreRole",
    "ValidationWindowHours": 4,
    "ProtectedResourceArns": ["*"],
    "ProtectedResourceConditions": {
      "StringEquals": [{ "Key": "aws:ResourceTag/backup-tier", "Value": "tier1" }]
    }
  }'

For integrity beyond “did it boot”, wire an EventBridge rule on the restore-job-completed event to a Lambda that connects to the restored DB / reads canary objects from the restored bucket / checksums an EBS volume, then writes the verdict back:

aws backup put-restore-validation-result \
  --restore-job-id "$RESTORE_JOB_ID" \
  --validation-status SUCCESSFUL \
  --validation-status-message "row-count and checksum match production canary"

The job’s measured duration is your empirical RTO; the gap between recovery-point timestamp and incident is your RPO. Track both in Audit Manager so an auditor sees evidence, not a runbook claiming a number. A validation status, once written, is immutable — compute it correctly.

Restore-testing plan options, end to end

The plan and selection schemas have several fields whose defaults are wrong for a serious DR program. Every option:

Field Belongs to Values Default When to change
ScheduleExpression plan cron(...) UTC required Weekly is the common cadence
StartWindowHours plan hours 168 Bound the test to off-peak
RecoveryPointSelection.Algorithm plan LATEST_WITHIN_WINDOW / RANDOM_WITHIN_WINDOW RANDOM exercises older RPs too
RecoveryPointTypes plan SNAPSHOT / CONTINUOUS Match the tier you’re proving
SelectionWindowDays plan days How far back RPs are eligible
IncludeVaults plan vault ARNs all Point at the air-gap vault
ProtectedResourceType selection RDS / EBS / DynamoDB / EFS / S3 / EC2 required One selection per type
IamRoleArn selection restore role ARN required Needs KMS + restore perms
ValidationWindowHours selection hours 0 >0 so the resource survives to validate
ProtectedResourceConditions selection tag conditions none Scope to backup-tier=tier1
RestoreMetadataOverrides selection key/value none Override subnet/SG/instance class for the test env

The recovery-point selection algorithm changes what your test actually proves:

Algorithm Picks Proves Use when
LATEST_WITHIN_WINDOW Most recent RP in the window Your freshest copy restores Default — proves current recoverability
RANDOM_WITHIN_WINDOW A random RP in the window Older copies also restore Catch silent corruption in aging RPs

The validation status values and what each means downstream:

validation-status Meaning Effect on evidence
SUCCESSFUL Integrity check passed Counts as a passing test (immutable)
FAILED Check ran, data wrong Flags a real recoverability gap
TIMED_OUT Validation didn’t finish in window Raise ValidationWindowHours
(unset) No validation wired RTO measured, integrity unproven

Audit, drift, and evidence export

AWS Backup Audit Manager turns “are we compliant” into a queryable framework. It ships controls you parameterize — minimum retention enforced, resources protected by a plan, resources in a Vault-Lock-protected vault, cross-Region/cross-account copy present, last-recovery-point recency, and restore-time-meets-target (fed by restore testing).

aws backup create-framework \
  --framework-name org_dr_framework \
  --framework-controls '[
    {
      "ControlName": "BACKUP_RECOVERY_POINT_MINIMUM_RETENTION_CHECK",
      "ControlInputParameters": [
        { "ParameterName": "requiredRetentionDays", "ParameterValue": "365" }
      ]
    },
    { "ControlName": "BACKUP_RESOURCES_PROTECTED_BY_BACKUP_VAULT_LOCK" },
    { "ControlName": "BACKUP_RECOVERY_POINT_ENCRYPTED" }
  ]'

Schedule a report plan that drops compliance evidence (CSV/JSON) into S3 on a cadence — the artifact you hand auditors and the drift signal when a resource slips out of policy. For org-wide visibility, aggregate findings in the delegated admin via AWS Config aggregators or Security Hub rather than logging into each account. (See AWS CloudTrail & Config: Audit & Compliance for the wider audit fabric.)

The Audit Manager controls worth enabling

Each control answers a specific auditor question and emits a specific piece of evidence. The set you want:

Control Question it answers Parameter Evidence emitted
BACKUP_RECOVERY_POINT_MINIMUM_RETENTION_CHECK Are RPs kept long enough? requiredRetentionDays Per-RP retention compliance
BACKUP_RESOURCES_PROTECTED_BY_BACKUP_PLAN Is everything in scope backed up? resource type/tags List of unprotected resources
BACKUP_RESOURCES_PROTECTED_BY_BACKUP_VAULT_LOCK Are RPs in a locked vault? Vault-Lock coverage
BACKUP_RECOVERY_POINT_ENCRYPTED Are RPs encrypted? Encryption status per RP
BACKUP_RECOVERY_POINT_MANUAL_DELETION_DISABLED Can RPs be hand-deleted? Vault access-policy posture
BACKUP_LAST_RECOVERY_POINT_CREATED Is the latest RP recent? recoveryPointAgeValue/Unit Freshness (catches stalled jobs)
_RESTORE_TIME_FOR_RESOURCES_MEET_TARGET Does RTO meet target? maxRestoreTime Measured restore-time compliance
BACKUP_RESOURCES_PROTECTED_BY_CROSS_REGION Is there an off-Region copy? Cross-Region copy presence
BACKUP_RESOURCES_PROTECTED_BY_CROSS_ACCOUNT Is there an off-account copy? Cross-account copy presence

Where each compliance signal aggregates for org-wide visibility:

Signal Source Aggregate in How
Backup job success/failure AWS Backup per account Delegated admin Cross-account monitoring
Framework control compliance Audit Manager per account S3 (report plan) Scheduled CSV/JSON export
Config rule drift AWS Config per account Delegated admin Config aggregator
Security findings Security Hub Delegated admin / security acct Hub aggregation
API actions (who deleted what) CloudTrail Org trail → central S3 Organization trail

Architecture at a glance

The diagram traces the recovery path the way data actually moves through it, left to right, and drops a number on each control or failure point. Read it as a pipeline. On the far left, the org control plane (the management account) enables AWS Backup trusted access, turns on the BACKUP_POLICY type, and registers the delegate — and marks the trap that a registered delegate is still blind to jobs until you separately enable cross-account monitoring. The delegated admin authors the tag-targeted policy and runs Audit Manager, then attaches the policy to the production OU. That policy renders into every workload account, where a daily rule snapshots tagged resources into a local CMK-encrypted vault for fast restores — marks the bootstrap gotcha that the vault and AWSBackupCentralRole must already exist or the job fails before it starts.

From the workload accounts, copy_actions fans each recovery point cross-account and cross-Region into the air-gapped recovery account in us-west-2. That is where the program earns its keep: a logically air-gapped vault under a 7-year compliance-mode Vault Lock ( — the lock must actually harden, and no “Always”-retention recovery point may sit inside it), a destination CMK or multi-Region key, and a recovery-OU SCP that denies the deletion verbs ( — the copy still has to land, which fails if the destination access policy or that very SCP is wrong). Finally the restore-testing plane, running in the recovery account, fires real StartRestoreJob calls weekly against the air-gapped copies, a Lambda checksums the result, and PutRestoreValidationResult writes immutable evidence back to Audit Manager — marks the restore that fails with a KMS-denied error when the destination key policy lacks kms:CreateGrant. Follow the numbers and you have the whole method: author policy centrally, snapshot locally, copy to an immutable air-gap, and prove you can restore.

Org-wide AWS Backup recovery path rendered left to right across five zones: the management account enabling Organizations trusted access, the BACKUP_POLICY type and delegate registration; a delegated backup admin authoring a tag-targeted policy and running Audit Manager; workload accounts where a daily rule snapshots backup-tier-tagged resources into a CMK-encrypted local vault using AWSBackupCentralRole; an isolated air-gapped recovery account in us-west-2 receiving cross-account/cross-Region copies into a logically air-gapped vault under a seven-year compliance-mode Vault Lock, encrypted by a multi-Region KMS key and guarded by a recovery-OU SCP that denies delete and key-deletion verbs; and a restore-testing plane that runs weekly StartRestoreJob operations against the air-gapped copies, validates checksums with Lambda, and writes immutable RTO/RPO evidence back. Five numbered badges flag the failure points: the delegate being blind to jobs without cross-account monitoring, the vault and role not bootstrapped, the Vault Lock not hardening or holding an Always-retention recovery point, the copy failing to land on a bad access policy or SCP, and the restore failing with a KMS access-denied error from a missing CreateGrant.

Real-world scenario

A fintech platform team I worked with — call it Meridian Pay, 140 AWS accounts under a Control Tower landing zone, running card-processing on RDS PostgreSQL and a DynamoDB ledger — ran a clean per-account AWS Backup setup: every workload account backed itself up to a local vault, lifecycle was correct, dashboards were green. RPO was nominally one hour (continuous backup on the ledger), retention met their seven-year card-industry floor, and the quarterly “do we have backups?” checkbox was always ticked. The program had cost them about ₹95,000/month in storage and they considered it solved.

Then a CI/CD pipeline with an over-broad deployment role in their staging account was compromised through a poisoned npm dependency. The attacker enumerated AWS Backup and, because the same role could manage vaults, began deleting recovery points and the vault itself. The local backups in that account were gone in minutes — encrypted with a CMK the same role could also schedule for deletion. Staging was non-production, so the data loss was survivable, but the incident review asked a sharper question than the incident itself: prove the production copies could not have met the same fate. Nobody could. Production used the identical pattern; only luck (and the attacker stopping at staging) separated a bad day from a company-ending one.

The constraint made the fix non-trivial. Compliance forbade standing human or break-glass access to the immutable tier, yet they needed copies that survived a full account compromise and evidence the copies were both immutable and restorable. They could not simply “add a second vault” in the same account — that shared the trust boundary the staging incident had just walked through.

The fix was three changes, not a re-platform. First, every Tier-1 plan got a copy_actions fan-out into a logically air-gapped vault in a separate recovery account (444444444444) in us-west-2, locked in compliance mode with min-retention-days matching their seven-year floor — so even a root-equivalent compromise of a workload account could not shorten or delete those copies. Second, the recovery OU got an SCP denying the deletion verbs outright, closing the exact door the staging incident walked through:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": [
      "backup:DeleteRecoveryPoint",
      "backup:DeleteBackupVault",
      "backup:PutBackupVaultAccessPolicy",
      "kms:ScheduleKeyDeletion",
      "kms:DisableKey"
    ],
    "Resource": "*"
  }]
}

Third — the part that satisfied the auditor — a weekly restore-testing plan in the recovery account restored the latest air-gapped RDS and DynamoDB recovery points, a Lambda validated row counts and a ledger checksum against a production canary, and PutRestoreValidationResult recorded an immutable pass with a measured restore duration. The Audit Manager report plan exported that evidence to a locked S3 bucket every Monday.

When the auditor asked the sharp question, the answer was a CSV with timestamps, durations, and validation verdicts — not a slide deck. RTO went from “we believe under four hours” to “measured 2h41m on the last 11 consecutive weekly tests.” The incremental cost was about ₹40,000/month (the air-gap storage and cross-Region copy egress), which the CFO signed in one meeting once it was framed as “the difference between a finding and a clean audit.” The wall-lesson the team adopted: “A green backup dashboard is a claim. A passing restore test in an account you can’t delete from is a fact.”

The transformation as a before/after, because the gap is the lesson:

Dimension Before (per-account) After (centralized air-gap)
Trust boundary Backups share the workload account Separate recovery account + Region
Immutability CMK + RPs deletable by the same role Compliance Vault Lock, 7-year floor
Deletion door Deploy role could manage vaults SCP denies Delete*/key-deletion
RTO/RPO “We believe under 4h” Measured 2h41m over 11 weekly tests
Audit answer A runbook claiming a number CSV of timestamps, durations, verdicts
Cost ~₹95,000/mo ~₹135,000/mo (+air-gap copy/egress)
Survives account compromise? No Yes

Advantages and disadvantages

Centralizing backup as a separate trust boundary buys survivability and provability at the cost of complexity and cross-account egress. Weigh it honestly:

Advantages (why this model protects you) Disadvantages (why it costs you)
Recovery path is a separate trust boundary — a compromised workload account can’t reach the immutable copies Multi-account, multi-Region, multi-key — materially more moving parts to get right
Compliance Vault Lock makes copies immutable to everyone, including root and AWS — true ransomware survivability The “Always”-retention-in-compliance-vault foot-gun bills forever and is unrecoverable short of closing the account
Tag-targeted org policy means coverage scales automatically as teams ship Tag discipline is now load-bearing — an untagged resource is silently unprotected
Restore testing converts RTO/RPO from claims into measured, auditable evidence Cross-Region copy adds egress and storage cost (the air-gap is not free)
One delegated admin authors policy org-wide; the management account stays thin Half-configured delegation (policy without monitoring) is an easy, invisible mistake
Audit Manager + report plans hand auditors evidence instead of opinions KMS key policy across accounts is fiddly; the CreateGrant bug is a common stumble
SCPs on the recovery OU close the exact deletion door incidents walk through Over-broad SCPs can also block the legitimate copy-in path — must be scoped precisely

The model is right for any organization where a backup failing during an incident is unacceptable — regulated data, ransomware-exposed workloads, anything where “prove you can recover” is a real question. It is overkill for a single throwaway dev account. It bites hardest on teams with weak tagging discipline (coverage gaps), on those who lock a vault in compliance mode without understanding the irreversibility, and on anyone who configures the copy path but never tests the restore — discovering the KMS grant bug during a real disaster instead of a Monday drill.

Hands-on lab

Stand up the core of the program in a single sandbox account — a CMK, a vault, a compliance-mode lock in its grace window, a tag-targeted plan, and an on-demand backup you watch complete. Everything here is free-tier-friendly except a few paise of EBS snapshot storage; we tear it all down at the end. Run in CloudShell (the AWS CLI is pre-authenticated).

Step 1 — Variables and a dedicated KMS key. A dedicated CMK is mandatory for a real program; we make one even in the lab.

REGION=ap-south-1
VAULT=lab-backup-vault
KEY_ID=$(aws kms create-key --description "lab backup CMK" \
  --query KeyMetadata.KeyId --output text)
aws kms create-alias --alias-name alias/lab-backup --target-key-id "$KEY_ID"
echo "Key: $KEY_ID"

Expected: a key ID prints; alias/lab-backup now resolves to it.

Step 2 — Create a backup vault encrypted with that key.

aws backup create-backup-vault --backup-vault-name "$VAULT" \
  --encryption-key-arn "arn:aws:kms:$REGION:$(aws sts get-caller-identity --query Account --output text):key/$KEY_ID"
aws backup describe-backup-vault --backup-vault-name "$VAULT" \
  --query '{Name:BackupVaultName, Locked:Locked, RP:NumberOfRecoveryPoints}'

Expected: Locked: false, RP: 0 — an empty, unlocked vault.

Step 3 — Tag an EBS volume so the plan can select it. Create a tiny 1 GiB volume and tag it backup-tier=tier1.

AZ=${REGION}a
VOL_ID=$(aws ec2 create-volume --availability-zone "$AZ" --size 1 \
  --volume-type gp3 --query VolumeId --output text)
aws ec2 create-tags --resources "$VOL_ID" \
  --tags Key=backup-tier,Value=tier1 Key=Name,Value=lab-backup-target
echo "Volume: $VOL_ID"

Step 4 — Create a backup plan and a tag-based selection.

PLAN_ID=$(aws backup create-backup-plan --backup-plan '{
  "BackupPlanName": "lab_tier1",
  "Rules": [{
    "RuleName": "DailySnapshot",
    "TargetBackupVaultName": "'"$VAULT"'",
    "ScheduleExpression": "cron(0 18 ? * * *)",
    "Lifecycle": { "DeleteAfterDays": 35 }
  }]
}' --query BackupPlanId --output text)

ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/service-role/AWSBackupDefaultServiceRole"
aws backup create-backup-selection --backup-plan-id "$PLAN_ID" \
  --backup-selection '{
    "SelectionName": "tier1-by-tag",
    "IamRoleArn": "'"$ROLE_ARN"'",
    "ListOfTags": [{ "ConditionType": "STRINGEQUALS", "ConditionKey": "backup-tier", "ConditionValue": "tier1" }]
  }'

(If AWSBackupDefaultServiceRole does not exist, the AWS Backup console’s first-run creates it, or create it from the AWS managed AWSBackupServiceRolePolicyForBackup policy.)

Step 5 — Take an on-demand backup and watch it complete. Don’t wait for the 18:00 cron — fire one now.

JOB_ID=$(aws backup start-backup-job \
  --backup-vault-name "$VAULT" \
  --resource-arn "arn:aws:ec2:$REGION:$(aws sts get-caller-identity --query Account --output text):volume/$VOL_ID" \
  --iam-role-arn "$ROLE_ARN" \
  --query BackupJobId --output text)
aws backup describe-backup-job --backup-job-id "$JOB_ID" \
  --query '{State:State, Pct:PercentDone}'
# re-run the describe until State = COMPLETED (usually a couple of minutes for 1 GiB)

Expected: State moves CREATEDRUNNINGCOMPLETED; a recovery point now exists in the vault.

Step 6 — Lock the vault in compliance mode (grace window) and inspect the state. We use the minimum 3-day grace so you can still unlock before teardown.

aws backup put-backup-vault-lock-configuration --backup-vault-name "$VAULT" \
  --changeable-for-days 3 --min-retention-days 1 --max-retention-days 365
aws backup describe-backup-vault --backup-vault-name "$VAULT" \
  --query '{Locked:Locked, LockDate:LockDate, Min:MinRetentionDays, Max:MaxRetentionDays}'

Expected: Locked: true with a LockDate ~3 days in the future — the cooling-off window. Because it hasn’t hardened, you can still delete the lock in teardown.

Validation checklist. You created a dedicated CMK, an encrypted vault, a tag-targeted plan, a real recovery point, and a compliance-mode lock in its grace window — the full local half of the program in one account. What each step proved:

Step What you did What it proves Real-world analogue
1 Dedicated CMK The key the program must own (not aws/backup) Every prod vault uses a CMK
2 Encrypted vault The permission boundary for recovery points Local vault in each account
3 Tag the resource Coverage follows tags, not ARNs How org policy scales
4 Plan + tag selection One plan protects whatever is tagged The Tier-1 org policy
5 On-demand backup completes A recovery point is real, not theoretical The nightly job
6 Compliance lock (grace) Immutability is a mode + a grace window Vault Lock on prod/air-gap vaults

Teardown (delete the lock while still in grace, then everything else).

# Delete the lock config — only possible because LockDate is still in the future.
aws backup delete-backup-vault-lock-configuration --backup-vault-name "$VAULT"
# Delete the recovery point, then the plan/selection, vault, volume, and key.
RP_ARN=$(aws backup list-recovery-points-by-backup-vault --backup-vault-name "$VAULT" \
  --query 'RecoveryPoints[0].RecoveryPointArn' --output text)
aws backup delete-recovery-point --backup-vault-name "$VAULT" --recovery-point-arn "$RP_ARN"
aws backup delete-backup-plan --backup-plan-id "$PLAN_ID"
aws backup delete-backup-vault --backup-vault-name "$VAULT"
aws ec2 delete-volume --volume-id "$VOL_ID"
aws kms schedule-key-deletion --key-id "$KEY_ID" --pending-window-in-days 7

Cost note. A 1 GiB EBS snapshot is a fraction of a rupee; the CMK is ~₹85/month prorated (deleted here after a 7-day pending window, the KMS minimum). The whole lab runs to a few rupees. Critical: never run Step 6 with --changeable-for-days omitted in a sandbox you want to delete — a hardened compliance lock makes the vault and its recovery points undeletable until they expire, and the only escape is closing the AWS account.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark, because every one of these silently breaks a backup program and most produce a green dashboard while doing it. First as a scannable table, then the entries that bite hardest expanded with the full confirm-command detail.

# Symptom Root cause Confirm (exact cmd) Fix
1 Policy renders perfectly but every job fails Vault/role not bootstrapped in the target account/Region describe-effective-policy OK, but job error “role/vault not found” StackSet vault + AWSBackupCentralRole to every OU/Region before policies
2 Delegate authors policy but sees no jobs org-wide Cross-account monitoring not enabled (only delegation was) list-delegated-administrators returns acct, list-backup-jobs empty Enable cross-account monitoring in AWS Backup → Settings
3 Cross-account copies never appear in the recovery vault Dest vault access policy missing / wrong PrincipalOrgID, or recovery SCP blocks copy-in list-recovery-points-by-backup-vault (airgap) empty after a job put-backup-vault-access-policy for backup:CopyIntoBackupVault; scope SCP to Delete* only
4 Restore fails: “KMS key cannot be accessed” Dest key policy lacks kms:CreateGrant for Backup/restore role Restore job StatusMessage/AbortReason cites KMS Add Decrypt+GenerateDataKey+CreateGrant (GrantIsForAWSResource) to dest CMK
5 Vault “looks locked” but is still deletable Governance mode (no --changeable-for-days) or LockDate still future describe-backup-vaultLocked:false or future LockDate Re-lock with --changeable-for-days; wait for LockDate to pass
6 A recovery point bills forever, can’t delete it “Always”/indefinite retention inside a compliance-locked vault list-recovery-points shows no CalculatedLifecycle.DeleteAt Prevent via max-retention-days; existing one only clears by account closure
7 Some resources are simply never backed up They aren’t tagged with the selection tag describe-effective-policy selection tag vs get-resources by tag Enforce tagging (Tag Policy / Config rule); add the tag
8 Backup job cancelled before it ran start_backup_window_minutes too small under contention Job StatusMessage: “window expired” Raise start/complete windows; stagger schedules
9 Continuous-backup RPs vanish after ~35 days PITR has a hard 35-day ceiling (working as designed) RP age ~35d on a CONTINUOUS rule Add a daily SNAPSHOT rule with long retention for the long tail
10 EC2/EBS restore fails though RP is intact Source AMI was disabled Restore error references a disabled image SCP deny ec2:DisableImage on prod OUs; re-enable the AMI
11 Cross-Region copy slow / intermittently failing Single-Region key re-wrapping per hop, or missing ReEncrypt* Copy job latency/errors; key policy lacks ReEncrypt* Use a multi-Region key; or add kms:ReEncrypt*
12 Audit Manager shows resources non-compliant for “protected” Resource in scope but no plan covers its tag/type Framework finding lists the resource ARN Extend the policy selection; re-tag
13 create-policy --type BACKUP_POLICY errors Policy type not enabled on the org root list-rootsPolicyTypes lacks BACKUP_POLICY enable-policy-type --policy-type BACKUP_POLICY
14 Restore test “passes” but data is wrong No validation wired — only boot was checked Restore job ValidationStatus unset Add EventBridge→Lambda + put-restore-validation-result

The expanded form, with the full reasoning for the entries that bite hardest:

1. The org policy renders perfectly into a member account, yet every backup job fails. Root cause: The target vault and the IAM role the policy references do not exist in that account/Region — AWS Backup never creates them. Confirm: aws organizations describe-effective-policy --policy-type BACKUP_POLICY (run in the member) shows a correct policy; aws backup list-backup-jobs --by-state FAILED shows a StatusMessage referencing a missing role or vault. Fix: Deploy the vault and AWSBackupCentralRole to every target account/Region via a service-managed CloudFormation StackSet targeting the OUs, before attaching any policy. The role needs the AWS managed AWSBackupServiceRolePolicyForBackup (and ...ForRestores where you restore).

2. The delegated admin can author policies but sees no jobs anywhere. Root cause: Only register-delegated-administrator was done; cross-account monitoring was never enabled. They are two separate grants. Confirm: aws organizations list-delegated-administrators --service-principal backup.amazonaws.com returns the account, but aws backup list-backup-jobs from the delegate is empty even though members are backing up. Fix: In the delegate’s AWS Backup console → Settings, enable cross-account monitoring (and ensure isCrossAccountBackupEnabled is true for copies).

3. Cross-account copies never land in the recovery vault. Root cause: The destination vault access policy is missing or its aws:PrincipalOrgID doesn’t match, or the recovery OU’s SCP is so broad it blocks backup:CopyIntoBackupVault along with the deletion verbs. Confirm: aws backup list-recovery-points-by-backup-vault --backup-vault-name airgap-recovery-vault (run in 444444444444) is empty after a member job completes; aws backup get-backup-vault-access-policy shows a missing/incorrect policy; aws organizations describe-effective-policy --policy-type SERVICE_CONTROL_POLICY reveals an over-broad deny. Fix: put-backup-vault-access-policy allowing backup:CopyIntoBackupVault for PrincipalOrgID; scope the recovery SCP to only the Delete*/key-deletion verbs so copy-in still works.

4. Restore fails with “KMS key cannot be accessed” though the recovery point is intact. Root cause: The destination CMK’s key policy lacks kms:CreateGrant (with the GrantIsForAWSResource condition) for backup.amazonaws.com and/or the restore role — AWS Backup creates a transient grant to decrypt during restore. Confirm: aws backup describe-restore-job --restore-job-id <id> shows a StatusMessage/AbortReason citing KMS; the recovery point itself lists fine. Fix: Add kms:Decrypt, kms:GenerateDataKey, kms:DescribeKey, and kms:CreateGrant to the dest key policy for the Backup service and the restore role (the restore-role statement scoped by "kms:GrantIsForAWSResource": "true"). Across Regions, prefer a multi-Region key.

5. The vault “looks locked” but a privileged role can still delete recovery points. Root cause: It’s in governance mode (you omitted --changeable-for-days), or it’s compliance mode but LockDate is still in the future (the grace window). Confirm: aws backup describe-backup-vault --query '{Locked:Locked, LockDate:LockDate}' shows Locked:false, or Locked:true with a future LockDate. Fix: For true immutability, lock with --changeable-for-days 3 (compliance) and wait for LockDate to pass — only then is it un-deletable by anyone.

6. A recovery point bills indefinitely and refuses to delete. Root cause: It was created with “Always”/indefinite retention and the vault is under a hardened compliance lock, so its retention can never be shortened and it can never be deleted. Confirm: aws backup list-recovery-points-by-backup-vault shows the RP with no CalculatedLifecycle.DeleteAt; the vault shows Locked:true with a past LockDate. Fix: Prevent it up front with max-retention-days on the lock (which rejects indefinite-retention RPs). An already-hardened case has no escape short of closing the AWS account — which is exactly why you never combine “Always” with a compliance lock.

7. Some production resources are silently never backed up. Root cause: They lack the selection tag (backup-tier=tier1), so the tag-targeted policy never selects them. Tag-targeting scales coverage and silently drops the untagged. Confirm: Compare the policy’s selection tag (from describe-effective-policy) against aws resourcegroupstaggingapi get-resources --tag-filters Key=backup-tier. Fix: Enforce tagging with an Organizations Tag Policy and an AWS Config rule that flags untagged in-scope resources; backfill the tag.

9. Continuous-backup recovery points disappear after about 35 days. Root cause: PITR/continuous backup has a hard 35-day retention ceiling — this is by design, not a bug. Confirm: The vanishing RPs are on a rule with enable_continuous_backup: true; their age clusters at ~35 days. Fix: Continuous is your low-RPO tier only. Run a daily SNAPSHOT rule (with long, air-gapped retention) in the same plan for the long-term and ransomware tier.

Best practices

The leading indicators worth alerting on before an incident or audit — not the lagging “job failed”:

Alert on Signal / control Threshold (starting point) Why it’s leading
Stalled backups BACKUP_LAST_RECOVERY_POINT_CREATED RP age > 26h on a daily rule Catches a quietly broken job before the next audit
Copy not landing Recovery-point count in air-gap vault 0 new in 26h The air-gap is the part that must not silently fail
Lock not hardened describe-backup-vault Locked/LockDate Locked:false on a prod/air-gap vault An “immutable” vault that isn’t
Coverage gap _PROTECTED_BY_BACKUP_PLAN any in-scope resource non-compliant Unprotected resource discovered before restore time
Restore-time regression _RESTORE_TIME_..._MEET_TARGET measured RTO > target RTO drifting past the SLA the business bought
Deletion attempt CloudTrail DeleteRecoveryPoint/DeleteBackupVault any in the recovery account The SCP should deny it — an attempt is a signal

Security notes

The security controls and what each defends against — note that “secure” and “recoverable” pull the same direction here:

Control Mechanism Defends against Also enables
Separate recovery account Distinct account + OU Account-scoped compromise reaching backups Clean blast-radius boundary
Compliance Vault Lock put-backup-vault-lock-configuration Ransomware/root deleting recovery points Regulatory WORM evidence
Recovery-OU SCP Deny Delete*/key-deletion The exact door incidents walk through Auditable “cannot be deleted” claim
Scoped KMS key policy CreateGrant + GrantIsForAWSResource Arbitrary key-use grants Working cross-account restore
Vault access policy CopyIntoBackupVault + PrincipalOrgID Unauthorised copy-in / exfil Cross-account copy landing
Organization CloudTrail Org trail → locked S3 Tampering hiding deletion attempts Forensic timeline of the recovery path
ec2:DisableImage SCP Deny on prod OUs Soft-bricking EC2/EBS restore Guaranteed restorable AMIs

Cost & sizing

What drives the AWS Backup bill, and how each lever interacts with the architecture:

A rough monthly picture for a mid-size estate (say 8 TB of warm Tier-1 data, 8 TB copied cross-Region, 7-year cold air-gap tail):

Cost driver What you pay for Rough INR / month What it buys Watch-out
Warm local storage (8 TB) Fast-restore recovery points ~₹35,000–55,000 Low-RTO operational restores Drops sharply once moved to cold
Cold storage (long tail) Cheap long-retention RPs ~₹8,000–15,000 7-year compliance retention 90-day min commit; retrieval fee
Cross-Region copy egress Inter-Region transfer of 8 TB ~₹40,000–60,000 (first copy build) The air-gap (Region isolation) Recurs on new data, not re-copied
Air-gap destination storage Second copy in recovery Region ~₹35,000–55,000 Survives a regional event + compromise Doubles storage for Tier-1
KMS keys A few CMKs / one MRK ~₹300–800 Shareable, immutable-friendly encryption Per-request charges at high volume
Restore testing Transient restored resources ~₹2,000–5,000 Measured RTO/RPO evidence Validation window compute

Free-tier note: there is no free tier for AWS Backup storage, but the control plane (policies, vaults, locks, Audit Manager frameworks) is free — you pay for stored bytes and copied bytes, not for the program’s machinery. Right-size by tiering aggressively to cold, copying only the tiers that genuinely need the air-gap (your Tier-1, not everything), and letting the daily-snapshot retention — not the continuous tier — carry the long, cheap tail.

Interview & exam questions

1. Why is a per-account backup setup dangerous even when every account backs itself up correctly? Because the backups share a trust boundary with the thing that gets compromised. A single over-broad identity (a deploy role, a poisoned pipeline) that can manage vaults can delete every recovery point and the vault itself — and if the CMK is the same account’s, schedule it for deletion too. The fix is a separate recovery account, in another Region, with compliance-mode immutability and an SCP denying deletion.

2. What are the two distinct grants delegated administration in AWS Backup requires, and what breaks with only one? register-delegated-administrator lets the account author and manage backup policies; separately, cross-account monitoring (AWS Backup console → Settings) lets it see jobs in member accounts. With only the first, you get a policy author who is blind to execution — it looks configured but can’t observe whether anything is actually backing up.

3. A backup policy renders perfectly into a member account but every job fails. Most likely cause? The vault and the IAM role the policy references don’t exist in that account/Region — AWS Backup does not create them. Confirm with describe-effective-policy (policy is fine) versus a failed job’s StatusMessage (“role/vault not found”). Fix by bootstrapping the vault and AWSBackupCentralRole to every target OU/Region via a service-managed StackSet before attaching policies.

4. What single flag selects compliance mode for Vault Lock, and why does it matter? Including --changeable-for-days (with a grace window ≥3 days) selects compliance mode — immutable to every principal including root and AWS once LockDate passes. Omitting it gives governance mode, removable by sufficiently privileged IAM. Only compliance mode survives a root-equivalent compromise; it’s the anti-ransomware control.

5. Describe the “Always-retention in a compliance vault” foot-gun. A recovery point with indefinite (“Always”) retention inside a hardened compliance-locked vault can never have its retention shortened and never be deleted — so it bills forever, with the only escape being closing the AWS account. Prevent it by setting max-retention-days on the lock (which rejects indefinite-retention RPs). Never combine indefinite retention with a compliance lock.

6. Cross-account restore fails with “KMS key cannot be accessed” though the recovery point is intact. What’s missing? The destination CMK’s key policy lacks kms:CreateGrant (with the GrantIsForAWSResource condition) for backup.amazonaws.com and/or the restore role. AWS Backup creates a transient grant to decrypt during restore; without CreateGrant it can’t. Add Decrypt, GenerateDataKey, DescribeKey, and CreateGrant to the dest key policy; across Regions use a multi-Region key.

7. Why must the recovery vault use a dedicated CMK and not the aws/backup managed key? The AWS-managed aws/backup key cannot be shared cross-account, so the restore side in a different account can’t use it — cross-account restore is impossible. A dedicated customer-managed key has a key policy you control, letting you grant the Backup service and the restore role exactly the actions they need.

8. How do you make the cross-account copy actually land in the recovery vault? Two backstops: (a) a vault access policy on the destination allowing backup:CopyIntoBackupVault for your aws:PrincipalOrgID, and (b) ensuring isCrossAccountBackupEnabled is on. A common failure is an over-broad recovery-OU SCP that denies CopyIntoBackupVault along with the deletion verbs — scope the SCP to only Delete*/key-deletion so copy-in still works.

9. What’s the difference between continuous (PITR) and snapshot protection, and how do you use both? Continuous/PITR (RDS, Aurora, DynamoDB, S3) restores to any second within a hard 35-day ceiling — your low-RPO tier. Snapshots restore to discrete points and support long retention — your long-term and ransomware tier. Run both rules in the same plan: continuous for low RPO, daily snapshot with long air-gapped retention for the long tail.

10. How does restore testing turn RTO/RPO from a claim into evidence? It runs real StartRestoreJob operations on a schedule against the air-gapped copies, measures completion time (your empirical RTO), optionally validates integrity via a Lambda, writes an immutable PutRestoreValidationResult, and tears the resource down. Audit Manager aggregates the durations and verdicts so an auditor sees timestamps and measured RTO, not a runbook asserting a number.

11. An EC2/EBS restore fails although the recovery point and lock are fine. What non-obvious cause should you check, and how do you prevent it? The source AMI has been disabled — the recovery point is intact but the resource isn’t restorable. Prevent it by an SCP denying ec2:DisableImage on production OUs, so an attacker can’t soft-brick the restore path without deleting anything; re-enable the AMI to restore.

12. Which SCP closes the exact door a compromised role walks through, and why an SCP rather than IAM? An SCP on the recovery OU denying backup:DeleteRecoveryPoint, backup:DeleteBackupVault, backup:PutBackupVaultAccessPolicy, kms:ScheduleKeyDeletion, and kms:DisableKey. SCPs bound every principal in the account including root, which IAM policies cannot — so even a fully compromised identity can’t delete the copies or schedule the key for deletion.

These map to AWS Certified Security – Specialty (data protection, key management, incident response), AWS Certified Solutions Architect – Professional (multi-account governance, DR, cross-Region/account design), and the resilience pillar of the Well-Architected reviews. A compact cert-mapping for revision:

Question theme Primary cert Objective area
Air-gap trust boundary, SCP deny Security Specialty / SA Pro Account isolation; preventive guardrails
Vault Lock compliance vs governance Security Specialty Data protection; WORM immutability
KMS key policy, CreateGrant Security Specialty Key management; cross-account access
Delegated admin, Organizations policy SA Pro Multi-account governance
Continuous vs snapshot, RTO/RPO SA Pro / Well-Architected DR design; reliability pillar
Restore testing + Audit Manager Security Specialty / SA Pro Recovery validation; auditability

Quick check

  1. You ran register-delegated-administrator for AWS Backup and attached a policy, but the delegate’s console shows no backup jobs from member accounts. What second grant did you miss?
  2. True or false: omitting --changeable-for-days when locking a vault gives you a stronger, immutable compliance-mode lock.
  3. A cross-account restore fails with “KMS key cannot be accessed” even though the recovery point lists fine. Name the specific KMS permission that’s almost certainly missing.
  4. Why must you bootstrap the backup vault and IAM role into a target account before attaching an Organizations backup policy to its OU?
  5. Your continuous-backup recovery points keep disappearing at about 35 days. Is this a bug, and what do you add for long-term retention?

Answers

  1. Cross-account monitoring (AWS Backup console → Settings). Delegation (register-delegated-administrator) only grants policy authoring; cross-account monitoring is the separate grant that lets the delegate see member-account jobs. Without it you have a policy author blind to execution.
  2. False. Omitting --changeable-for-days gives governance mode, which a sufficiently privileged IAM principal can still remove and use to delete recovery points. Including --changeable-for-days (≥3) selects compliance mode — the immutable one that even root can’t undo once LockDate passes.
  3. kms:CreateGrant (with the kms:GrantIsForAWSResource condition) on the destination CMK’s key policy, for backup.amazonaws.com and the restore role. AWS Backup creates a transient grant to decrypt during restore; without CreateGrant the restore is denied.
  4. Because AWS Backup does not create the vault or the role — it only references them. If they don’t exist when a job fires, the job fails (“role/vault not found”) even though describe-effective-policy shows a perfect policy. A service-managed StackSet to the OU lands them before any account starts running jobs.
  5. Not a bug — PITR/continuous backup has a hard 35-day retention ceiling by design. It’s your low-RPO tier; add a daily SNAPSHOT rule (with long, air-gapped retention) in the same plan to carry the long-term and ransomware tail.

Glossary

Next steps

You can now stand up and prove an org-wide, air-gapped backup program. Build outward:

awsaws-backuporganizationsvault-lockdisaster-recoveryransomware
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments