Centralized AWS Backup with Organizations: Vault Lock, Cross-Account Copy, and Recovery Runbooks

Backups are not a feature you turn on; they are a control plane you operate. The failure mode that ends careers is not “we had no backups” — it is “we had backups, in the same account, encrypted with the same key, deletable by the same blast-radius identity that just got compromised.” Ransomware crews and fat-fingered automation both target the recovery path first, because the recovery path is the one thing standing between them and a paid ransom or a permanent data-loss incident. This guide builds that recovery path as a separate trust boundary: a delegated backup admin that authors policy, compliance-mode Vault Lock vaults that even your root user cannot empty, copies landing in a logically air-gapped account in another Region, and restore tests that prove RTO/RPO instead of asserting them. It assumes AWS Organizations with all features enabled and a multi-account landing zone underneath.

The reason this is hard is that AWS Backup is an orchestrator, not a storage service — it coordinates snapshots that physically live with each source service (EBS, RDS, DynamoDB, S3, EFS, FSx), wrapped in recovery points that a vault references and a KMS key encrypts. Every one of those layers — the policy that schedules the job, the IAM role the job assumes, the vault that holds the recovery point, the key that encrypts it, the lock that makes it immutable, the copy action that fans it cross-Region, the access policy that lets the copy land, the restore role that reads it back — is a place a real program quietly breaks. A green dashboard tells you jobs ran; it does not tell you a copy landed in the isolated account, that the lock actually hardened, or that a restore would succeed. Those are separate facts, each with its own confirming command.

By the end you will stop trusting the dashboard and start proving the recovery path. You will know exactly which two grants delegated administration takes (and why one without the other gives you a blind policy author), why a vault and role must exist before the first job runs, the one flag that separates compliance-mode Vault Lock from governance mode, the kms:CreateGrant condition that is load-bearing for cross-account restore, and how restore testing converts “we believe under four hours” into “measured 2h41m on the last 11 weekly tests.” Because this is a reference you return to mid-incident — or mid-audit — the policy keys, the lock modes, the KMS grants, the per-service restore mechanics, the limits and the failure modes are all laid out as scannable tables. Read the prose once; keep the tables open when the auditor (or the attacker) shows up.

What problem this solves

Per-account backups feel safe and fail catastrophically. The common pattern — every workload account backs itself up to a local vault, lifecycle is correct, dashboards are green — collapses the moment the account itself is the blast radius. A compromised deployment role, a poisoned CI/CD dependency, or a misconfigured trust policy gives an attacker (or a buggy automation) the same backup:DeleteRecoveryPoint and backup:DeleteBackupVault permissions your operators have. The local backups are gone in minutes, encrypted with a key the same identity can schedule for deletion, and there is no second copy anywhere an attacker cannot reach. You discover this not during a drill but during an incident, which is the worst possible time to learn that your recovery path shared a trust boundary with the thing that just got owned.

What breaks without a centralized, air-gapped program: (1) no immutability — recovery points are deletable, so ransomware deletes them; (2) no isolation — copies live in the same account/Region, so one compromise or one regional event takes both primary and backup; (3) no governance — each team configures its own backups, retention drifts, some resources are simply never protected and nobody notices until restore time; (4) no proof — RTO/RPO are slide-deck numbers, not measured facts, so the first real restore is also the first test, and it fails on a KMS grant you forgot. The financial and regulatory exposure is brutal: failing an auditor’s “prove the production copies could not have met the same fate” question can mean a finding, a fine, or a lost certification.

Who hits this: every organization past a handful of accounts. It bites hardest on regulated workloads (finance, healthcare) where a 7-year immutable retention is a legal floor, on teams that grew account-by-account without a landing zone, and on anyone who equates “backup job succeeded” with “I can recover.” The fix is not more backups — it is a recovery path that is a separate trust boundary: a delegated admin authoring policy, immutable vaults, an isolated recovery account in another Region, and restore tests that produce evidence. To frame the whole field before the deep dive, here is every control plane this article builds, the failure it prevents, and the one command that proves it works:

Control plane	What it is	Failure it prevents	First command to prove it
Delegated backup admin	An account that authors org-wide policy	Operating consoles from the org root; blast-radius concentration	`aws organizations list-delegated-administrators --service-principal backup.amazonaws.com`
Organizations backup policy	Tag-targeted JSON attached to OUs	Per-team drift; unprotected resources	`aws organizations describe-effective-policy --policy-type BACKUP_POLICY`
Compliance-mode Vault Lock	WORM immutability until expiry	Ransomware / root deleting recovery points	`aws backup describe-backup-vault --query '{Locked:Locked,LockDate:LockDate}'`
Cross-account/Region copy	`copy_actions` into another account/Region	Account compromise or regional event taking both copies	`aws backup list-recovery-points-by-backup-vault --backup-vault-name airgap-recovery-vault`
Dedicated CMK + key policy	Customer-managed key shareable cross-account	`aws/backup` key cannot be shared; restore denied	`aws kms describe-key --key-id <id> --query 'KeyMetadata.MultiRegion'`
Recovery-account SCP	Deny delete/key-deletion verbs	The exact door a compromised role walks through	`aws organizations describe-effective-policy --policy-type SERVICE_CONTROL_POLICY`
Restore testing	Scheduled real `StartRestoreJob` + validate	RTO/RPO asserted but never measured	`aws backup list-restore-jobs --by-restore-testing-plan-arn <arn>`
Audit Manager framework	Queryable compliance + evidence export	“Are we compliant?” answered by opinion, not data	`aws backup list-frameworks`

Learning objectives

By the end of this article you can:

Design the three-plane account topology (management / delegated admin / isolated recovery) and explain why collapsing any two is a mistake.
Enable AWS Backup org integration end to end: trusted access, the BACKUP_POLICY policy type, delegated administrator registration, and the separate cross-account monitoring grant — and explain what breaks if you do one without the other.
Author an Organizations backup policy in JSON with @@assign inheritance operators, tag-based selection, independent local and copy lifecycles, and a copy_actions fan-out to another account/Region.
Choose between compliance and governance Vault Lock, set --changeable-for-days / min/max retention correctly, and avoid the indefinite-retention foot-gun.
Build the cross-account/cross-Region copy path: destination vault access policy, logically air-gapped vault type, and the kms:CreateGrant with GrantIsForAWSResource condition that makes cross-account restore actually work.
Stand up restore testing (plan + selection + validation), measure empirical RTO/RPO, and export evidence via Audit Manager for an auditor.
Match the protection mode (continuous/PITR vs snapshot) to each service’s RPO, and harden the recovery account with SCPs that close the deletion door the typical incident walks through.
Run a symptom → root cause → confirm → fix playbook for the failure modes that silently break a backup program.

Prerequisites & where this fits

You should already understand the multi-account fundamentals: AWS Organizations with all features enabled, Organizational Units (OUs) as the attachment target for policies, Service Control Policies (SCPs) as the guardrail mechanism, and IAM roles with trust policies and cross-account assumption. You should know how to run the AWS CLI with named profiles (you will switch between the management, delegated-admin, and recovery accounts constantly), read JSON output, and reason about KMS key policies versus IAM policies. Familiarity with at least one source service’s snapshot model (EBS snapshots, RDS automated backups) helps the per-service sections land.

This sits at the top of the Reliability & DR track and assumes the governance layer beneath it. It builds directly on AWS Organizations: SCP Guardrails & Delegated Admin (the delegation and SCP mechanics this whole program rides on) and AWS Control Tower: Guardrails & Multi-Account Foundation (the landing zone that places accounts in OUs). The encryption layer is AWS KMS Deep Dive: Keys, Policies, Envelope Encryption & Rotation, and cross-Region copy specifically wants KMS Multi-Region Keys: Envelope Encryption & Key Policies. It pairs with the strategy-level AWS Backup & Disaster Recovery Strategies and the framing in High Availability vs Disaster Recovery: RTO & RPO. For the threat model that motivates immutability, see Ransomware Resilience: Immutable Backup & Isolated Recovery Environment.

A quick map of which account owns which responsibility, so you run each command from the right place and never operate from the org root by habit:

Concern	Owning account	Why it lives there	What you must NOT do here
Trusted access, policy type, delegate registration	Management	Only the org root can enable integrations	Run day-to-day backup operations
Authoring/attaching backup policies	Delegated admin	Keeps the org root thin; least standing power	Hold the immutable recovery copies
Source backup jobs + local vaults	Each workload account	Backups run where the data is	Share a trust boundary with the air-gap copy
Immutable cross-Region copies	Isolated recovery	Survives a full workload-account compromise	Grant standing human access
KMS key administration	Each vault-owning account	Keys are account- and Region-scoped	Use the `aws/backup` managed key
Restore testing + validation	Recovery account	Exercises the air-gapped copies, not the originals	Skip it and assume the RTO
Compliance evidence + drift	Delegated admin	Org-wide aggregation point	Log into each account to check manually

Core concepts

Six mental models make every later step obvious.

AWS Backup orchestrates; the data lives with the service. A backup plan is a schedule + lifecycle + target vault + selection. When a rule fires, AWS Backup assumes an IAM role in the source account and calls the source service’s snapshot API; the resulting recovery point is a pointer that lives in a backup vault and is encrypted by a KMS key. The vault is a logical container and a permission boundary — it does not “store” bytes the way S3 does; it governs who can read, copy, and delete the recovery points it references. This is why a single over-broad identity that can manage vaults is catastrophic: it controls the pointers to all your recovery data at once.

Three planes, three accounts — do not collapse them. The management account enables the integration and attaches policies to OUs; that is all it does day to day. The delegated backup admin authors policy and monitors jobs org-wide, so you never run backup consoles from the org root. The isolated recovery account receives cross-account copies into an air-gapped vault and is, ideally, in a separate OU with restrictive SCPs and no standing human access. The whole security argument is that the identity which can delete your last copy must not be the identity that just got compromised — and that requires a different account, a different Region, and an immutable lock, all at once.

Delegated administration is two grants, not one. register-delegated-administrator lets the account manage backup policies. You separately enable cross-account monitoring in the AWS Backup console (Settings) so the delegate can see jobs in member accounts. Enable both, or you get a policy author who is blind to execution — a real and common half-configured state.

The vault and role must pre-exist; AWS Backup does not create them. The vault a policy targets and the IAM role a policy references must already exist in every target account and Region before a backup job runs. Bootstrap them with a CloudFormation StackSet (service-managed permissions) targeting the OUs that receive policies, so vault + role land automatically when an account joins the OU. Forget this and describe-effective-policy renders a perfect policy while every job fails with “role/vault not found.”

Immutability is a mode, decided by one flag. Compliance-mode Vault Lock makes recovery points immutable until their lifecycle expires — no IAM principal, not the account root, not AWS, can shorten retention or delete them once the lock hardens. The mode is selected by whether you include --changeable-for-days: include it (with a grace window of ≥3 days) and you get compliance mode; omit it and you get governance mode (removable by sufficiently privileged IAM). The grace window is your only undo — once LockDate passes, the configuration is set in stone.

Restore is its own permission and KMS path. Reading a recovery point back — especially cross-account — requires the restore role to use the destination KMS key, and AWS Backup creates grants on your behalf during restore. The kms:CreateGrant permission with the kms:GrantIsForAWSResource condition is load-bearing; without it the restore fails with a KMS key cannot be accessed error that is maddening to debug because the recovery point is intact and the lock is fine.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary repeats these for lookup; this is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to recovery
Backup plan	Schedule + lifecycle + vault + selection	Account (or rendered from org policy)	The thing that runs — or silently doesn’t
Organizations backup policy	Tag-targeted JSON attached to an OU	Management → OU; authored in delegate	One policy protects whatever a team tags
Backup vault	Logical container + permission boundary	Per account/Region	Governs read/copy/delete of recovery points
Logically air-gapped (LAG) vault	Isolated vault type for cross-acct share/restore	Recovery account	The copy that survives a full compromise
Recovery point	Encrypted pointer to a snapshot	In a vault	The thing you restore (or an attacker deletes)
Vault Lock	WORM immutability config	On a vault	Compliance mode = even root can’t delete
`changeable_for_days`	Grace window before lock hardens	Lock config	Your only undo; min 3 days
`copy_actions`	Fan-out copy to another vault	On a rule	Cross-account/Region air-gap mechanism
CMK + key policy	Customer-managed key, shareable	Per vault account/Region	`aws/backup` can’t be shared cross-account
`kms:CreateGrant`	Permits AWS Backup to grant key use	Key policy	Load-bearing for cross-account restore
Restore testing plan	Scheduled real restore + teardown	Recovery account	Turns RTO/RPO into measured evidence
Audit Manager framework	Controls + evidence export	Delegated admin	“Compliant” becomes queryable + auditable
RTO / RPO	Recovery time / recovery point objective	Measured, not asserted	The numbers the business actually buys

The account topology and where policy lives

Three planes, three accounts. Do not collapse them. The management account stays thin — it enables the integration and attaches policies; day-to-day backup operations live in the delegated admin so you are not running consoles from the org root, and the immutable copies live somewhere neither of those can reach.

Plane	Account	Responsibility	Standing access
Policy authoring	Delegated backup admin	Owns Organizations backup policies, monitors jobs org-wide	Backup operators (least power)
Org control	Management account	Enables trusted access, registers the delegate, attaches policies to OUs	Org admins only; no daily ops
Recovery (air-gap)	Isolated recovery account	Receives cross-account copies into an air-gapped vault; separate OU, restrictive SCPs	None (break-glass only)

# Run in the MANAGEMENT account.
# 1) Trusted access so AWS Backup can act across the org.
aws organizations enable-aws-service-access \
  --service-principal backup.amazonaws.com

# 2) Turn on the BACKUP_POLICY policy type (idempotent if already enabled).
ROOT_ID=$(aws organizations list-roots --query 'Roots[0].Id' --output text)
aws organizations enable-policy-type \
  --root-id "$ROOT_ID" \
  --policy-type BACKUP_POLICY

# 3) Register the backup admin account as delegated administrator.
aws organizations register-delegated-administrator \
  --account-id 222222222222 \
  --service-principal backup.amazonaws.com

The same enablement expressed as Terraform, so the org integration is reviewable in a PR rather than clicked once and forgotten:

resource "aws_organizations_organization" "this" {
  aws_service_access_principals = ["backup.amazonaws.com"]
  enabled_policy_types          = ["BACKUP_POLICY", "SERVICE_CONTROL_POLICY"]
  feature_set                   = "ALL"
}

resource "aws_backup_global_settings" "delegated" {
  global_settings = { "isCrossAccountBackupEnabled" = "true" }
}

Delegated administration in AWS Backup is two grants, not one. register-delegated-administrator lets the account manage backup policies; you separately enable cross-account monitoring in the AWS Backup console (Settings) so the delegate can see jobs in member accounts. Enable both, or you get a policy author who is blind to execution.

One prerequisite that bites everyone: the vault and the IAM role a policy references must already exist in every target account and Region before a backup job runs — AWS Backup does not create them. Bootstrap them with a CloudFormation StackSet (service-managed permissions) targeting the OUs that receive policies, so vault + role land automatically when an account joins.

The full enablement checklist, in dependency order — each row gates the next, and skipping one produces a specific, confusing symptom:

#	Step	Account	Command / setting	Symptom if skipped
1	Org all-features enabled	Management	`describe-organization` → `FeatureSet: ALL`	Policies cannot be enabled at all
2	Trusted access for Backup	Management	`enable-aws-service-access`	Org policies have no effect
3	`BACKUP_POLICY` policy type on root	Management	`enable-policy-type`	`create-policy --type BACKUP_POLICY` errors
4	Register delegated admin	Management	`register-delegated-administrator`	Delegate cannot author org policies
5	Cross-account backup enabled	Management/delegate	`isCrossAccountBackupEnabled=true`	Cross-account copies are rejected
6	Cross-account monitoring enabled	Delegate (console Settings)	toggle in AWS Backup → Settings	Delegate is blind to member-account jobs
7	Vault + role StackSet to OUs	Management → OUs	service-managed StackSet	Jobs fail “role/vault not found”
8	Attach backup policy to prod OU	Delegate / management	`attach-policy`	Nothing is protected

The two cross-account toggles confuse everyone because they sound identical. Keep them straight:

Setting	What it enables	Where you set it	Without it
`isCrossAccountBackupEnabled` (backup copy)	Recovery points may be copied across accounts	Global settings (management/delegate)	`copy_actions` cross-account fails
Cross-account monitoring	The delegate can view jobs in member accounts	AWS Backup console → Settings	A blind policy author; jobs invisible org-wide

Defining backup policies with tag-based selection

An Organizations backup policy is JSON with three required keys per plan — regions, rules, selections — using declarative-policy inheritance operators (@@assign sets a value; child policies merge with or override parents). You target resources by tag, not ARN, so the same policy protects whatever a team launches as long as they tag it. This is the single most important design choice in the whole program: tag-targeting means coverage scales automatically as teams ship, instead of someone remembering to add each new ARN to a selection.

Save the following as tier1-backup-policy.json. It runs daily, writes to the local vault, and fans out a copy to the recovery account in another Region. copy_actions is keyed by the destination vault ARN; $account / $region are placeholders AWS resolves per member account.

{
  "plans": {
    "Tier1_Daily": {
      "regions": { "@@assign": ["ap-south-1", "us-east-1"] },
      "rules": {
        "DailySnapshot": {
          "schedule_expression": { "@@assign": "cron(0 18 ? * * *)" },
          "start_backup_window_minutes": { "@@assign": "60" },
          "complete_backup_window_minutes": { "@@assign": "10080" },
          "target_backup_vault_name": { "@@assign": "central-backup-vault" },
          "lifecycle": {
            "move_to_cold_storage_after_days": { "@@assign": "30" },
            "delete_after_days": { "@@assign": "365" }
          },
          "copy_actions": {
            "arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault": {
              "target_backup_vault_arn": {
                "@@assign": "arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"
              },
              "lifecycle": {
                "move_to_cold_storage_after_days": { "@@assign": "30" },
                "delete_after_days": { "@@assign": "2555" }
              }
            }
          }
        }
      },
      "selections": {
        "tags": {
          "tier1": {
            "iam_role_arn": { "@@assign": "arn:aws:iam::$account:role/AWSBackupCentralRole" },
            "tag_key": { "@@assign": "backup-tier" },
            "tag_value": { "@@assign": ["tier1"] }
          }
        }
      }
    }
  }
}

Create the policy in the delegated admin account and attach it to the OU that holds your production workloads:

POLICY_ID=$(aws organizations create-policy \
  --name "Tier1-Daily-Backup" \
  --type BACKUP_POLICY \
  --content file://tier1-backup-policy.json \
  --query 'Policy.PolicySummary.Id' --output text)

aws organizations attach-policy \
  --policy-id "$POLICY_ID" \
  --target-id ou-abcd-prodaccts   # the production OU

The same plan expressed as a per-account Terraform resource (useful for accounts outside the org policy, or to model what the org policy renders into):

resource "aws_backup_plan" "tier1" {
  name = "Tier1_Daily"
  rule {
    rule_name         = "DailySnapshot"
    target_vault_name = aws_backup_vault.central.name
    schedule          = "cron(0 18 ? * * *)"
    start_window      = 60
    completion_window = 10080
    lifecycle { cold_storage_after = 30, delete_after = 365 }
    copy_action {
      destination_vault_arn = "arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"
      lifecycle { cold_storage_after = 30, delete_after = 2555 }
    }
  }
}

Every policy key, end to end

The policy schema is small but unforgiving — a wrong key name silently no-ops, and lifecycle math has hard floors. Here is every field that matters, what it does, and the gotcha:

Key	Scope	Values / type	Default	Gotcha / limit
`regions`	plan	list of Region IDs	required	Policy only acts in listed Regions
`schedule_expression`	rule	`cron(...)` UTC	required	Cron is UTC; `?` for day-of-week OR day-of-month
`start_backup_window_minutes`	rule	minutes	60 (8h for some)	Job is cancelled if it can’t start in the window
`complete_backup_window_minutes`	rule	minutes	derived	Must exceed start window; long backups need headroom
`target_backup_vault_name`	rule	vault name (must pre-exist)	required	Name only; the account is implied (the source acct)
`lifecycle.move_to_cold_storage_after_days`	rule/copy	days	none	Cold storage is min 90-day commit; not all services
`lifecycle.delete_after_days`	rule/copy	days	none	Must be ≥ `cold_storage_after + 90`
`enable_continuous_backup`	rule	true/false	false	Enables PITR; hard 35-day retention ceiling
`copy_actions.<vaultArn>`	rule	keyed by dest vault ARN	none	Independent lifecycle from the local rule
`copy_actions.*.target_backup_vault_arn`	copy	full ARN (acct+Region)	required	This is what makes it cross-account/Region
`selections.tags.<name>.tag_key`	selection	string	required	Case-sensitive; must match the resource tag exactly
`selections.tags.<name>.tag_value`	selection	list of strings	required	Any value in the list matches (OR)
`selections.tags.<name>.iam_role_arn`	selection	role ARN with `$account`	required	Role must exist in every target account

The inheritance operators are the part people get wrong when policies stack on nested OUs. What each does:

Operator	Effect	Use when	Trap
`@@assign`	Set the value (replace)	Setting a concrete value	A child `@@assign` overrides the parent’s
`@@append`	Add to a list (merge)	Adding Regions/tags without losing parents	Only valid on list-typed values
`@@remove`	Remove from a merged list	Excluding an inherited value	Order of evaluation matters on deep OU trees
(none) child key	Merge child into parent map	Adding a new rule alongside inherited ones	Duplicate rule names collide

A few rules that keep this correct:

Cold-storage minimum is a 90-day commitment. delete_after_days must be at least move_to_cold_storage_after_days + 90. Above, the copy retains 7 years (2555 days) — set this to your regulatory floor.
selections is either tags or resources, not both at the top level. Use the resources block with a conditions clause if you need resource-type plus tag logic.
Local-rule and copy_action lifecycles are independent — the common pattern is short local retention (fast, cheap restore) and long air-gapped retention (compliance + ransomware survivability).

Schedule-and-window sizing trips up long-running backups; the relationship between the three time fields:

Field	Meaning	Too small →	Sensible value
`schedule_expression`	When the job is eligible to start	n/a	Off-peak, UTC (e.g. `cron(0 18 ? * * *)`)
`start_backup_window_minutes`	Grace to begin before cancel	Job cancelled under contention	60 (raise to 480 for busy accounts)
`complete_backup_window_minutes`	Total time allowed to finish	Large RDS/EFS jobs killed mid-run	10080 (7 days) for big datasets

Vault Lock in compliance mode (WORM immutability)

A vault you can delete is a vault an attacker can delete. Compliance-mode Vault Lock makes recovery points immutable until their lifecycle expires — no IAM principal, not the account root, not AWS, can shorten retention or delete the recovery point once the lock hardens. This is the control that turns “we have backups” into “the backups will still be there after the breach.”

The mode is decided by one flag. Including --changeable-for-days selects compliance mode; omitting it gives you governance mode (removable by sufficiently privileged IAM). The value is the cooling-off grace time: during it you can still adjust or delete the lock to fix mistakes. Minimum is 3 days (72 hours).

# Run in EACH account/Region that owns a vault (member accounts + recovery account).
aws backup put-backup-vault-lock-configuration \
  --backup-vault-name central-backup-vault \
  --changeable-for-days 3 \
  --min-retention-days 30 \
  --max-retention-days 2555

resource "aws_backup_vault_lock_configuration" "central" {
  backup_vault_name   = aws_backup_vault.central.name
  changeable_for_days = 3      # presence => COMPLIANCE mode; omit => governance
  min_retention_days  = 30
  max_retention_days  = 2555
}

Treat the 3-day grace window as your only undo. Once LockDate passes, the configuration is set in stone. The classic foot-gun: a recovery point with retention set to “Always” inside a locked compliance vault is then un-deletable forever and bills forever. Never combine indefinite retention with compliance Vault Lock.

Verify the lock actually hardened — Locked: true plus a past LockDate is the only state that matters:

aws backup describe-backup-vault \
  --backup-vault-name central-backup-vault \
  --query '{Locked:Locked, LockDate:LockDate, Min:MinRetentionDays, Max:MaxRetentionDays}'

Apply the same lock to airgap-recovery-vault in the recovery account. That vault is the one that actually has to survive a credential compromise in the workload accounts.

Compliance vs governance, decided correctly

The two modes look similar in the API and are wildly different in consequence. Side by side:

Dimension	Governance mode	Compliance mode
Selected by	Omit `--changeable-for-days`	Include `--changeable-for-days` (≥3)
Who can delete RPs early	Sufficiently privileged IAM (with `backup:*` overrides)	Nobody — not IAM, not root, not AWS
Lock removable after hardening	Yes (by privileged IAM)	No — permanent until RP expires
Grace window	n/a	3+ days, then `LockDate` is final
Good for	Operational guardrail, accidental-delete prevention	Ransomware survivability, regulatory WORM
Foot-gun	Privileged role can still purge	“Always”-retention RP bills forever
Reversal cost	Change the config any time	The entire account must be closed to escape

The lock-configuration parameters and their bounds:

Parameter	Meaning	Min	Max	Gotcha
`min_retention_days`	Floor on every RP’s retention	1	—	RP retention shorter than this is rejected
`max_retention_days`	Ceiling on every RP’s retention	—	36500	RP retention longer than this is rejected; blocks “Always”
`changeable_for_days`	Grace before hardening (compliance)	3	36500	Presence is what selects compliance mode
`LockDate`	When the lock becomes immutable	—	—	Past `LockDate` + `Locked:true` = done

The lock-state truth table — only one row means “you are actually protected”:

`Locked`	`LockDate`	Mode	Meaning	Action
false	absent	none	No lock at all	Configure a lock
false	future	compliance (cooling)	In grace window; still mutable	Wait, or fix now while you still can
true	future	compliance (cooling)	Grace running; not yet immutable	Verify config before `LockDate`
true	past	compliance	Immutable — the only safe state	Done; verify periodically
true	n/a	governance	Locked but IAM-removable	Acceptable only for non-ransomware use

Cross-account, cross-Region copy into an air-gapped account

The copy_actions block in the policy already routes copies to 444444444444 in us-west-2. For that copy to land, two backstops must be in place — and when copies silently fail to appear, it is almost always one of these two.

(a) Vault access policy on the destination — a resource policy that permits the source org/account to copy in:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOrgCopyInto",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "backup:CopyIntoBackupVault",
      "Resource": "*",
      "Condition": {
        "StringEquals": { "aws:PrincipalOrgID": "o-exampleorgid" }
      }
    }
  ]
}

aws backup put-backup-vault-access-policy \
  --backup-vault-name airgap-recovery-vault \
  --policy file://airgap-vault-policy.json   # run in account 444444444444

(b) Logically air-gapped vault, not a standard vault. A logically air-gapped (LAG) vault is a distinct AWS Backup vault type supporting direct sharing for cross-account/cross-Region restore without you hand-rolling key policy on the restore side. Target it on a rule via target_logically_air_gapped_backup_vault_arn. As of 2025 these vaults support customer-managed KMS keys and can receive primary backups directly. Pair that isolation with restrictive SCPs on the recovery OU (deny backup:DeleteRecoveryPoint, backup:DeleteBackupVault, kms:ScheduleKeyDeletion) and no standing human role, and the copy survives even if every workload account is owned.

Standard vault vs logically air-gapped vault

Choosing the wrong vault type for the recovery target is a quiet design error — a standard vault works but forces you to hand-roll restore-side key policy and cross-account sharing. The comparison:

Dimension	Standard backup vault	Logically air-gapped (LAG) vault
Primary purpose	Local recovery points	Isolated copy/recovery target
Cross-account restore sharing	You hand-roll key policy + RAM	Built-in direct sharing
KMS	Your CMK (or `aws/backup`)	Customer-managed KMS (as of 2025)
Receives primary backups	Yes	Yes (2025+)
Vault Lock	Compliance/governance	Compliance/governance
Best role	Source-account local vault	The air-gap recovery copy
Indexing/search	Standard	Enhanced recovery-point search

The copy path has multiple independent gates; when a copy does not appear, walk them in this order:

#	Gate	Check (run where)	Failure symptom
1	Cross-account backup enabled	Global settings (mgmt) `isCrossAccountBackupEnabled=true`	Copy job rejected immediately
2	Dest vault access policy	`get-backup-vault-access-policy` (recovery acct)	“Access denied” on copy
3	`PrincipalOrgID` matches	Compare org ID in policy vs `describe-organization`	Copy denied despite a policy
4	Recovery SCP not over-broad	`describe-effective-policy SCP` (recovery acct)	SCP blocks `CopyIntoBackupVault`
5	Dest KMS key usable by Backup	Dest key policy has `backup.amazonaws.com` use	Copy fails with KMS error
6	Destination Region matches ARN	Region in `target_backup_vault_arn`	Copy goes nowhere / errors

The copy lifecycle is independent of the local lifecycle — a deliberate, important asymmetry:

Tier	Local vault	Air-gap copy	Rationale
Hot (restore speed)	30 days warm	30 days warm	Fast operational restores
Cold (cost)	move to cold @30d	move to cold @30d	Cut storage cost on aging RPs
Long-term (compliance)	delete @365d	delete @2555d (7yr)	Air-gap carries the regulatory floor
Mode	governance OK	compliance lock	The copy must survive a compromise

KMS keys and key policies for cross-account restore

Encrypt the central and recovery vaults with dedicated customer-managed keys — never the AWS-managed aws/backup key, which you cannot share cross-account. The restore side needs to use the key; encode that in the key policy in the recovery account. This section is where most “the backup is fine but the restore fails” incidents are born, so the key policy below is worth reading line by line.

{
  "Version": "2012-10-17",
  "Id": "airgap-backup-key",
  "Statement": [
    {
      "Sid": "KeyAdmins",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::444444444444:role/KeyAdmin" },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "AllowBackupServiceUse",
      "Effect": "Allow",
      "Principal": { "Service": "backup.amazonaws.com" },
      "Action": ["kms:Decrypt", "kms:GenerateDataKey", "kms:DescribeKey", "kms:CreateGrant"],
      "Resource": "*"
    },
    {
      "Sid": "AllowRestoreRoleDecrypt",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::444444444444:role/RecoveryRestoreRole" },
      "Action": ["kms:Decrypt", "kms:DescribeKey", "kms:CreateGrant"],
      "Resource": "*",
      "Condition": {
        "Bool": { "kms:GrantIsForAWSResource": "true" }
      }
    }
  ]
}

The kms:CreateGrant with the GrantIsForAWSResource condition is load-bearing — AWS Backup creates grants on your behalf during restore, and without it the restore fails with a KMS key cannot be accessed error that is maddening to debug. Copying across Regions, a multi-Region KMS key avoids re-wrapping data keys on every hop.

The KMS actions the recovery path needs, and which principal needs each

Every KMS Action here is required by a specific actor at a specific moment. Map them so you can reason about a deny precisely:

Action	Who needs it	When	Symptom if missing
`kms:GenerateDataKey`	`backup.amazonaws.com`	Encrypting a new RP/copy	Backup/copy job fails with KMS error
`kms:Decrypt`	Backup service + restore role	Reading an RP back	Restore fails “cannot be accessed”
`kms:DescribeKey`	Backup service + restore role	Resolving key metadata	Job can’t validate the key
`kms:CreateGrant`	Backup service + restore role	Backup grants key use mid-restore	Restore fails — the classic bug
`kms:ReEncrypt*`	Backup service (cross-Region, single-Region key)	Re-wrapping data keys per hop	Cross-Region copy slow/failing
`kms:*` (admin)	`KeyAdmin` role only	Key lifecycle/rotation	Cannot manage the key

Why the GrantIsForAWSResource condition matters and how to scope it safely:

Aspect	Detail
What it does	Restricts `CreateGrant` to grants AWS services create on your behalf
Why it’s needed	AWS Backup creates a transient grant to decrypt during restore
Risk without the condition	A broad `CreateGrant` lets a principal grant key use arbitrarily
Risk with the condition	Minimal — only AWS-resource-scoped grants are permitted
Confirm a denial	Restore job `StatusMessage`/`AbortReason` cites KMS access

Single-Region vs multi-Region key for cross-Region copy — the choice changes the per-hop cost and the failure surface:

Factor	Single-Region CMK per Region	Multi-Region key (MRK)
Cross-Region copy mechanics	Re-encrypt data key on each hop (`ReEncrypt*`)	Same key material; no re-wrap
Key policy management	Two policies to keep in sync	One logical key, replicated
Failure surface	More moving parts per hop	Fewer; simpler restore
Cost	Two keys, re-encrypt ops	Replica key per Region
When to choose	Strict per-Region key isolation required	Cross-Region copy/restore at scale

The dedicated CMK vs the managed key — the single non-negotiable rule:

Key	Shareable cross-account	Custom key policy	Use for vaults?
`aws/backup` (AWS-managed)	No	No	Never for the recovery program
Dedicated CMK (customer-managed)	Yes	Yes	Always — central and air-gap vaults

Per-service recovery: continuous vs snapshot

Restore mechanics differ by service. Match the protection mode to the workload’s RPO — continuous/PITR for low-RPO transactional data, snapshot for everything else and for long retention.

Service	PITR / continuous?	Restore granularity	Recovery notes
RDS / Aurora	Yes (continuous)	Any second within 35 days	`enable_continuous_backup: true`; else snapshot → new instance/cluster
DynamoDB	Yes (PITR)	Second-level within 35 days	AWS Backup also does full-table snapshots for longer retention
S3	Yes (continuous)	Point-in-time + item-level	Continuous backup enables object/version-level restore
EFS	Snapshot	Full or item-level files	Item-level file restore from recovery points
EBS	Snapshot	New volume	Restores as a new volume; EC2/AMI restore blocked if AMI disabled
FSx	Snapshot	Full file system	Per-file-system recovery point
EC2 (AMI)	Snapshot	New instance	Depends on the underlying AMI being enabled
VMware / on-prem	Snapshot	Full VM	Via AWS Backup gateway

To enable point-in-time recovery in a policy, set it on the rule:

"rules": {
  "ContinuousTier1": {
    "schedule_expression": { "@@assign": "cron(0 */1 ? * * *)" },
    "enable_continuous_backup": { "@@assign": "true" },
    "target_backup_vault_name": { "@@assign": "central-backup-vault" },
    "lifecycle": { "delete_after_days": { "@@assign": "35" } }
  }
}

PITR-enabled (continuous) recovery points have a hard 35-day retention ceiling. They are your low-RPO tier; the daily snapshot rule with long air-gapped retention is your long-term and ransomware tier. Run both rules in the same plan.

One trap worth flagging: EC2/EBS restores fail if the underlying AMI has been disabled — the recovery point is intact and the lock holds, but the resource is not restorable. Block ec2:DisableImage via SCP on production OUs so an attacker cannot soft-brick your restore path without deleting anything.

Matching RPO to the protection mode

The protection mode you pick is your RPO floor — you cannot restore to a point finer than your backups capture. The decision grid:

If the workload needs…	Use	RPO achieved	Retention ceiling	Cost shape
Sub-minute recovery (transactional DB)	Continuous/PITR (RDS, Dynamo, S3)	seconds	35 days	Higher (continuous capture)
Hourly recovery, long retention	Snapshot every hour	~1 hour	up to 100 yr	Per-snapshot storage
Daily, compliance/ransomware	Daily snapshot + air-gap copy	~24 hours	regulatory floor	Cold storage cheapest
Both low-RPO and long retention	Continuous and daily snapshot in one plan	seconds + 24h	35d + long	Two tiers, two costs

Per-service restore gotchas that are not obvious until restore time:

Service	Restore gotcha	Pre-empt it by
EC2 / EBS	Fails if source AMI is disabled	SCP deny `ec2:DisableImage` on prod OUs
RDS	Restores to a new instance (new endpoint)	Plan DNS/connection-string cutover
Aurora	New cluster; param/option groups must exist	Pre-create groups in the recovery account
DynamoDB	New table name on restore	Automate rename/swap in the runbook
S3	Item-level restore needs continuous backup on	Enable continuous, not just snapshots
EFS	Item-level restore lands in a new directory	Account for the restore path in validation
FSx	Restore creates a new file system	Re-point clients post-restore

Restore testing: prove RTO/RPO, do not assert it

An untested backup is a hypothesis. AWS Backup restore testing runs real StartRestoreJob operations on a schedule, measures completion time, optionally validates integrity, then tears down the restored resource. Build it in the recovery account so you exercise the air-gapped copies — testing the local copies proves nothing about the path that has to survive a compromise.

It is two API calls. First the plan (cadence + which recovery points are eligible):

aws backup create-restore-testing-plan --restore-testing-plan '{
  "RestoreTestingPlanName": "weekly_tier1_dr",
  "ScheduleExpression": "cron(0 7 ? * MON *)",
  "StartWindowHours": 4,
  "RecoveryPointSelection": {
    "Algorithm": "LATEST_WITHIN_WINDOW",
    "RecoveryPointTypes": ["SNAPSHOT"],
    "SelectionWindowDays": 7,
    "IncludeVaults": ["arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"]
  }
}'

Then a selection per resource type (here, RDS), with a validation window so the resource survives long enough to integrity-check before cleanup:

aws backup create-restore-testing-selection \
  --restore-testing-plan-name weekly_tier1_dr \
  --restore-testing-selection '{
    "RestoreTestingSelectionName": "rds_tier1",
    "ProtectedResourceType": "RDS",
    "IamRoleArn": "arn:aws:iam::444444444444:role/RecoveryRestoreRole",
    "ValidationWindowHours": 4,
    "ProtectedResourceArns": ["*"],
    "ProtectedResourceConditions": {
      "StringEquals": [{ "Key": "aws:ResourceTag/backup-tier", "Value": "tier1" }]
    }
  }'

For integrity beyond “did it boot”, wire an EventBridge rule on the restore-job-completed event to a Lambda that connects to the restored DB / reads canary objects from the restored bucket / checksums an EBS volume, then writes the verdict back:

aws backup put-restore-validation-result \
  --restore-job-id "$RESTORE_JOB_ID" \
  --validation-status SUCCESSFUL \
  --validation-status-message "row-count and checksum match production canary"

The job’s measured duration is your empirical RTO; the gap between recovery-point timestamp and incident is your RPO. Track both in Audit Manager so an auditor sees evidence, not a runbook claiming a number. A validation status, once written, is immutable — compute it correctly.

Restore-testing plan options, end to end

The plan and selection schemas have several fields whose defaults are wrong for a serious DR program. Every option:

Field	Belongs to	Values	Default	When to change
`ScheduleExpression`	plan	`cron(...)` UTC	required	Weekly is the common cadence
`StartWindowHours`	plan	hours	168	Bound the test to off-peak
`RecoveryPointSelection.Algorithm`	plan	`LATEST_WITHIN_WINDOW` / `RANDOM_WITHIN_WINDOW`	—	RANDOM exercises older RPs too
`RecoveryPointTypes`	plan	`SNAPSHOT` / `CONTINUOUS`	—	Match the tier you’re proving
`SelectionWindowDays`	plan	days	—	How far back RPs are eligible
`IncludeVaults`	plan	vault ARNs	all	Point at the air-gap vault
`ProtectedResourceType`	selection	RDS / EBS / DynamoDB / EFS / S3 / EC2	required	One selection per type
`IamRoleArn`	selection	restore role ARN	required	Needs KMS + restore perms
`ValidationWindowHours`	selection	hours	0	>0 so the resource survives to validate
`ProtectedResourceConditions`	selection	tag conditions	none	Scope to `backup-tier=tier1`
`RestoreMetadataOverrides`	selection	key/value	none	Override subnet/SG/instance class for the test env

The recovery-point selection algorithm changes what your test actually proves:

Algorithm	Picks	Proves	Use when
`LATEST_WITHIN_WINDOW`	Most recent RP in the window	Your freshest copy restores	Default — proves current recoverability
`RANDOM_WITHIN_WINDOW`	A random RP in the window	Older copies also restore	Catch silent corruption in aging RPs

The validation status values and what each means downstream:

`validation-status`	Meaning	Effect on evidence
`SUCCESSFUL`	Integrity check passed	Counts as a passing test (immutable)
`FAILED`	Check ran, data wrong	Flags a real recoverability gap
`TIMED_OUT`	Validation didn’t finish in window	Raise `ValidationWindowHours`
(unset)	No validation wired	RTO measured, integrity unproven

Audit, drift, and evidence export

AWS Backup Audit Manager turns “are we compliant” into a queryable framework. It ships controls you parameterize — minimum retention enforced, resources protected by a plan, resources in a Vault-Lock-protected vault, cross-Region/cross-account copy present, last-recovery-point recency, and restore-time-meets-target (fed by restore testing).

aws backup create-framework \
  --framework-name org_dr_framework \
  --framework-controls '[
    {
      "ControlName": "BACKUP_RECOVERY_POINT_MINIMUM_RETENTION_CHECK",
      "ControlInputParameters": [
        { "ParameterName": "requiredRetentionDays", "ParameterValue": "365" }
      ]
    },
    { "ControlName": "BACKUP_RESOURCES_PROTECTED_BY_BACKUP_VAULT_LOCK" },
    { "ControlName": "BACKUP_RECOVERY_POINT_ENCRYPTED" }
  ]'

Schedule a report plan that drops compliance evidence (CSV/JSON) into S3 on a cadence — the artifact you hand auditors and the drift signal when a resource slips out of policy. For org-wide visibility, aggregate findings in the delegated admin via AWS Config aggregators or Security Hub rather than logging into each account. (See AWS CloudTrail & Config: Audit & Compliance for the wider audit fabric.)

The Audit Manager controls worth enabling

Each control answers a specific auditor question and emits a specific piece of evidence. The set you want:

Control	Question it answers	Parameter	Evidence emitted
`BACKUP_RECOVERY_POINT_MINIMUM_RETENTION_CHECK`	Are RPs kept long enough?	`requiredRetentionDays`	Per-RP retention compliance
`BACKUP_RESOURCES_PROTECTED_BY_BACKUP_PLAN`	Is everything in scope backed up?	resource type/tags	List of unprotected resources
`BACKUP_RESOURCES_PROTECTED_BY_BACKUP_VAULT_LOCK`	Are RPs in a locked vault?	—	Vault-Lock coverage
`BACKUP_RECOVERY_POINT_ENCRYPTED`	Are RPs encrypted?	—	Encryption status per RP
`BACKUP_RECOVERY_POINT_MANUAL_DELETION_DISABLED`	Can RPs be hand-deleted?	—	Vault access-policy posture
`BACKUP_LAST_RECOVERY_POINT_CREATED`	Is the latest RP recent?	`recoveryPointAgeValue/Unit`	Freshness (catches stalled jobs)
`_RESTORE_TIME_FOR_RESOURCES_MEET_TARGET`	Does RTO meet target?	`maxRestoreTime`	Measured restore-time compliance
`BACKUP_RESOURCES_PROTECTED_BY_CROSS_REGION`	Is there an off-Region copy?	—	Cross-Region copy presence
`BACKUP_RESOURCES_PROTECTED_BY_CROSS_ACCOUNT`	Is there an off-account copy?	—	Cross-account copy presence

Where each compliance signal aggregates for org-wide visibility:

Signal	Source	Aggregate in	How
Backup job success/failure	AWS Backup per account	Delegated admin	Cross-account monitoring
Framework control compliance	Audit Manager per account	S3 (report plan)	Scheduled CSV/JSON export
Config rule drift	AWS Config per account	Delegated admin	Config aggregator
Security findings	Security Hub	Delegated admin / security acct	Hub aggregation
API actions (who deleted what)	CloudTrail	Org trail → central S3	Organization trail

Architecture at a glance

The diagram traces the recovery path the way data actually moves through it, left to right, and drops a number on each control or failure point. Read it as a pipeline. On the far left, the org control plane (the management account) enables AWS Backup trusted access, turns on the BACKUP_POLICY type, and registers the delegate — and ① marks the trap that a registered delegate is still blind to jobs until you separately enable cross-account monitoring. The delegated admin authors the tag-targeted policy and runs Audit Manager, then attaches the policy to the production OU. That policy renders into every workload account, where a daily rule snapshots tagged resources into a local CMK-encrypted vault for fast restores — ② marks the bootstrap gotcha that the vault and AWSBackupCentralRole must already exist or the job fails before it starts.

From the workload accounts, copy_actions fans each recovery point cross-account and cross-Region into the air-gapped recovery account in us-west-2. That is where the program earns its keep: a logically air-gapped vault under a 7-year compliance-mode Vault Lock (③ — the lock must actually harden, and no “Always”-retention recovery point may sit inside it), a destination CMK or multi-Region key, and a recovery-OU SCP that denies the deletion verbs (④ — the copy still has to land, which fails if the destination access policy or that very SCP is wrong). Finally the restore-testing plane, running in the recovery account, fires real StartRestoreJob calls weekly against the air-gapped copies, a Lambda checksums the result, and PutRestoreValidationResult writes immutable evidence back to Audit Manager — ⑤ marks the restore that fails with a KMS-denied error when the destination key policy lacks kms:CreateGrant. Follow the numbers and you have the whole method: author policy centrally, snapshot locally, copy to an immutable air-gap, and prove you can restore.

Real-world scenario

A fintech platform team I worked with — call it Meridian Pay, 140 AWS accounts under a Control Tower landing zone, running card-processing on RDS PostgreSQL and a DynamoDB ledger — ran a clean per-account AWS Backup setup: every workload account backed itself up to a local vault, lifecycle was correct, dashboards were green. RPO was nominally one hour (continuous backup on the ledger), retention met their seven-year card-industry floor, and the quarterly “do we have backups?” checkbox was always ticked. The program had cost them about ₹95,000/month in storage and they considered it solved.

Then a CI/CD pipeline with an over-broad deployment role in their staging account was compromised through a poisoned npm dependency. The attacker enumerated AWS Backup and, because the same role could manage vaults, began deleting recovery points and the vault itself. The local backups in that account were gone in minutes — encrypted with a CMK the same role could also schedule for deletion. Staging was non-production, so the data loss was survivable, but the incident review asked a sharper question than the incident itself: prove the production copies could not have met the same fate. Nobody could. Production used the identical pattern; only luck (and the attacker stopping at staging) separated a bad day from a company-ending one.

The constraint made the fix non-trivial. Compliance forbade standing human or break-glass access to the immutable tier, yet they needed copies that survived a full account compromise and evidence the copies were both immutable and restorable. They could not simply “add a second vault” in the same account — that shared the trust boundary the staging incident had just walked through.

The fix was three changes, not a re-platform. First, every Tier-1 plan got a copy_actions fan-out into a logically air-gapped vault in a separate recovery account (444444444444) in us-west-2, locked in compliance mode with min-retention-days matching their seven-year floor — so even a root-equivalent compromise of a workload account could not shorten or delete those copies. Second, the recovery OU got an SCP denying the deletion verbs outright, closing the exact door the staging incident walked through:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": [
      "backup:DeleteRecoveryPoint",
      "backup:DeleteBackupVault",
      "backup:PutBackupVaultAccessPolicy",
      "kms:ScheduleKeyDeletion",
      "kms:DisableKey"
    ],
    "Resource": "*"
  }]
}

Third — the part that satisfied the auditor — a weekly restore-testing plan in the recovery account restored the latest air-gapped RDS and DynamoDB recovery points, a Lambda validated row counts and a ledger checksum against a production canary, and PutRestoreValidationResult recorded an immutable pass with a measured restore duration. The Audit Manager report plan exported that evidence to a locked S3 bucket every Monday.

When the auditor asked the sharp question, the answer was a CSV with timestamps, durations, and validation verdicts — not a slide deck. RTO went from “we believe under four hours” to “measured 2h41m on the last 11 consecutive weekly tests.” The incremental cost was about ₹40,000/month (the air-gap storage and cross-Region copy egress), which the CFO signed in one meeting once it was framed as “the difference between a finding and a clean audit.” The wall-lesson the team adopted: “A green backup dashboard is a claim. A passing restore test in an account you can’t delete from is a fact.”

The transformation as a before/after, because the gap is the lesson:

Dimension	Before (per-account)	After (centralized air-gap)
Trust boundary	Backups share the workload account	Separate recovery account + Region
Immutability	CMK + RPs deletable by the same role	Compliance Vault Lock, 7-year floor
Deletion door	Deploy role could manage vaults	SCP denies Delete*/key-deletion
RTO/RPO	“We believe under 4h”	Measured 2h41m over 11 weekly tests
Audit answer	A runbook claiming a number	CSV of timestamps, durations, verdicts
Cost	~₹95,000/mo	~₹135,000/mo (+air-gap copy/egress)
Survives account compromise?	No	Yes

Advantages and disadvantages

Centralizing backup as a separate trust boundary buys survivability and provability at the cost of complexity and cross-account egress. Weigh it honestly:

Advantages (why this model protects you)	Disadvantages (why it costs you)
Recovery path is a separate trust boundary — a compromised workload account can’t reach the immutable copies	Multi-account, multi-Region, multi-key — materially more moving parts to get right
Compliance Vault Lock makes copies immutable to everyone, including root and AWS — true ransomware survivability	The “Always”-retention-in-compliance-vault foot-gun bills forever and is unrecoverable short of closing the account
Tag-targeted org policy means coverage scales automatically as teams ship	Tag discipline is now load-bearing — an untagged resource is silently unprotected
Restore testing converts RTO/RPO from claims into measured, auditable evidence	Cross-Region copy adds egress and storage cost (the air-gap is not free)
One delegated admin authors policy org-wide; the management account stays thin	Half-configured delegation (policy without monitoring) is an easy, invisible mistake
Audit Manager + report plans hand auditors evidence instead of opinions	KMS key policy across accounts is fiddly; the `CreateGrant` bug is a common stumble
SCPs on the recovery OU close the exact deletion door incidents walk through	Over-broad SCPs can also block the legitimate copy-in path — must be scoped precisely

The model is right for any organization where a backup failing during an incident is unacceptable — regulated data, ransomware-exposed workloads, anything where “prove you can recover” is a real question. It is overkill for a single throwaway dev account. It bites hardest on teams with weak tagging discipline (coverage gaps), on those who lock a vault in compliance mode without understanding the irreversibility, and on anyone who configures the copy path but never tests the restore — discovering the KMS grant bug during a real disaster instead of a Monday drill.

Hands-on lab

Stand up the core of the program in a single sandbox account — a CMK, a vault, a compliance-mode lock in its grace window, a tag-targeted plan, and an on-demand backup you watch complete. Everything here is free-tier-friendly except a few paise of EBS snapshot storage; we tear it all down at the end. Run in CloudShell (the AWS CLI is pre-authenticated).

Step 1 — Variables and a dedicated KMS key. A dedicated CMK is mandatory for a real program; we make one even in the lab.

REGION=ap-south-1
VAULT=lab-backup-vault
KEY_ID=$(aws kms create-key --description "lab backup CMK" \
  --query KeyMetadata.KeyId --output text)
aws kms create-alias --alias-name alias/lab-backup --target-key-id "$KEY_ID"
echo "Key: $KEY_ID"

Expected: a key ID prints; alias/lab-backup now resolves to it.

Step 2 — Create a backup vault encrypted with that key.

aws backup create-backup-vault --backup-vault-name "$VAULT" \
  --encryption-key-arn "arn:aws:kms:$REGION:$(aws sts get-caller-identity --query Account --output text):key/$KEY_ID"
aws backup describe-backup-vault --backup-vault-name "$VAULT" \
  --query '{Name:BackupVaultName, Locked:Locked, RP:NumberOfRecoveryPoints}'

Expected: Locked: false, RP: 0 — an empty, unlocked vault.

Step 3 — Tag an EBS volume so the plan can select it. Create a tiny 1 GiB volume and tag it backup-tier=tier1.

AZ=${REGION}a
VOL_ID=$(aws ec2 create-volume --availability-zone "$AZ" --size 1 \
  --volume-type gp3 --query VolumeId --output text)
aws ec2 create-tags --resources "$VOL_ID" \
  --tags Key=backup-tier,Value=tier1 Key=Name,Value=lab-backup-target
echo "Volume: $VOL_ID"

Step 4 — Create a backup plan and a tag-based selection.

PLAN_ID=$(aws backup create-backup-plan --backup-plan '{
  "BackupPlanName": "lab_tier1",
  "Rules": [{
    "RuleName": "DailySnapshot",
    "TargetBackupVaultName": "'"$VAULT"'",
    "ScheduleExpression": "cron(0 18 ? * * *)",
    "Lifecycle": { "DeleteAfterDays": 35 }
  }]
}' --query BackupPlanId --output text)

ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/service-role/AWSBackupDefaultServiceRole"
aws backup create-backup-selection --backup-plan-id "$PLAN_ID" \
  --backup-selection '{
    "SelectionName": "tier1-by-tag",
    "IamRoleArn": "'"$ROLE_ARN"'",
    "ListOfTags": [{ "ConditionType": "STRINGEQUALS", "ConditionKey": "backup-tier", "ConditionValue": "tier1" }]
  }'

(If AWSBackupDefaultServiceRole does not exist, the AWS Backup console’s first-run creates it, or create it from the AWS managed AWSBackupServiceRolePolicyForBackup policy.)

Step 5 — Take an on-demand backup and watch it complete. Don’t wait for the 18:00 cron — fire one now.

JOB_ID=$(aws backup start-backup-job \
  --backup-vault-name "$VAULT" \
  --resource-arn "arn:aws:ec2:$REGION:$(aws sts get-caller-identity --query Account --output text):volume/$VOL_ID" \
  --iam-role-arn "$ROLE_ARN" \
  --query BackupJobId --output text)
aws backup describe-backup-job --backup-job-id "$JOB_ID" \
  --query '{State:State, Pct:PercentDone}'
# re-run the describe until State = COMPLETED (usually a couple of minutes for 1 GiB)

Expected: State moves CREATED → RUNNING → COMPLETED; a recovery point now exists in the vault.

Step 6 — Lock the vault in compliance mode (grace window) and inspect the state. We use the minimum 3-day grace so you can still unlock before teardown.

aws backup put-backup-vault-lock-configuration --backup-vault-name "$VAULT" \
  --changeable-for-days 3 --min-retention-days 1 --max-retention-days 365
aws backup describe-backup-vault --backup-vault-name "$VAULT" \
  --query '{Locked:Locked, LockDate:LockDate, Min:MinRetentionDays, Max:MaxRetentionDays}'

Expected: Locked: true with a LockDate ~3 days in the future — the cooling-off window. Because it hasn’t hardened, you can still delete the lock in teardown.

Validation checklist. You created a dedicated CMK, an encrypted vault, a tag-targeted plan, a real recovery point, and a compliance-mode lock in its grace window — the full local half of the program in one account. What each step proved:

Step	What you did	What it proves	Real-world analogue
1	Dedicated CMK	The key the program must own (not `aws/backup`)	Every prod vault uses a CMK
2	Encrypted vault	The permission boundary for recovery points	Local vault in each account
3	Tag the resource	Coverage follows tags, not ARNs	How org policy scales
4	Plan + tag selection	One plan protects whatever is tagged	The Tier-1 org policy
5	On-demand backup completes	A recovery point is real, not theoretical	The nightly job
6	Compliance lock (grace)	Immutability is a mode + a grace window	Vault Lock on prod/air-gap vaults

Teardown (delete the lock while still in grace, then everything else).

# Delete the lock config — only possible because LockDate is still in the future.
aws backup delete-backup-vault-lock-configuration --backup-vault-name "$VAULT"
# Delete the recovery point, then the plan/selection, vault, volume, and key.
RP_ARN=$(aws backup list-recovery-points-by-backup-vault --backup-vault-name "$VAULT" \
  --query 'RecoveryPoints[0].RecoveryPointArn' --output text)
aws backup delete-recovery-point --backup-vault-name "$VAULT" --recovery-point-arn "$RP_ARN"
aws backup delete-backup-plan --backup-plan-id "$PLAN_ID"
aws backup delete-backup-vault --backup-vault-name "$VAULT"
aws ec2 delete-volume --volume-id "$VOL_ID"
aws kms schedule-key-deletion --key-id "$KEY_ID" --pending-window-in-days 7

Cost note. A 1 GiB EBS snapshot is a fraction of a rupee; the CMK is ~₹85/month prorated (deleted here after a 7-day pending window, the KMS minimum). The whole lab runs to a few rupees. Critical: never run Step 6 with --changeable-for-days omitted in a sandbox you want to delete — a hardened compliance lock makes the vault and its recovery points undeletable until they expire, and the only escape is closing the AWS account.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark, because every one of these silently breaks a backup program and most produce a green dashboard while doing it. First as a scannable table, then the entries that bite hardest expanded with the full confirm-command detail.

#	Symptom	Root cause	Confirm (exact cmd)	Fix
1	Policy renders perfectly but every job fails	Vault/role not bootstrapped in the target account/Region	`describe-effective-policy` OK, but job error “role/vault not found”	StackSet vault + `AWSBackupCentralRole` to every OU/Region before policies
2	Delegate authors policy but sees no jobs org-wide	Cross-account monitoring not enabled (only delegation was)	`list-delegated-administrators` returns acct, `list-backup-jobs` empty	Enable cross-account monitoring in AWS Backup → Settings
3	Cross-account copies never appear in the recovery vault	Dest vault access policy missing / wrong `PrincipalOrgID`, or recovery SCP blocks copy-in	`list-recovery-points-by-backup-vault` (airgap) empty after a job	`put-backup-vault-access-policy` for `backup:CopyIntoBackupVault`; scope SCP to Delete* only
4	Restore fails: “KMS key cannot be accessed”	Dest key policy lacks `kms:CreateGrant` for Backup/restore role	Restore job `StatusMessage`/`AbortReason` cites KMS	Add Decrypt+GenerateDataKey+CreateGrant (`GrantIsForAWSResource`) to dest CMK
5	Vault “looks locked” but is still deletable	Governance mode (no `--changeable-for-days`) or `LockDate` still future	`describe-backup-vault` → `Locked:false` or future `LockDate`	Re-lock with `--changeable-for-days`; wait for `LockDate` to pass
6	A recovery point bills forever, can’t delete it	“Always”/indefinite retention inside a compliance-locked vault	`list-recovery-points` shows no `CalculatedLifecycle.DeleteAt`	Prevent via `max-retention-days`; existing one only clears by account closure
7	Some resources are simply never backed up	They aren’t tagged with the selection tag	`describe-effective-policy` selection tag vs `get-resources` by tag	Enforce tagging (Tag Policy / Config rule); add the tag
8	Backup job cancelled before it ran	`start_backup_window_minutes` too small under contention	Job `StatusMessage`: “window expired”	Raise start/complete windows; stagger schedules
9	Continuous-backup RPs vanish after ~35 days	PITR has a hard 35-day ceiling (working as designed)	RP age ~35d on a `CONTINUOUS` rule	Add a daily SNAPSHOT rule with long retention for the long tail
10	EC2/EBS restore fails though RP is intact	Source AMI was disabled	Restore error references a disabled image	SCP deny `ec2:DisableImage` on prod OUs; re-enable the AMI
11	Cross-Region copy slow / intermittently failing	Single-Region key re-wrapping per hop, or missing `ReEncrypt*`	Copy job latency/errors; key policy lacks `ReEncrypt*`	Use a multi-Region key; or add `kms:ReEncrypt*`
12	Audit Manager shows resources non-compliant for “protected”	Resource in scope but no plan covers its tag/type	Framework finding lists the resource ARN	Extend the policy selection; re-tag
13	`create-policy --type BACKUP_POLICY` errors	Policy type not enabled on the org root	`list-roots` → `PolicyTypes` lacks `BACKUP_POLICY`	`enable-policy-type --policy-type BACKUP_POLICY`
14	Restore test “passes” but data is wrong	No validation wired — only boot was checked	Restore job `ValidationStatus` unset	Add EventBridge→Lambda + `put-restore-validation-result`

The expanded form, with the full reasoning for the entries that bite hardest:

1. The org policy renders perfectly into a member account, yet every backup job fails. Root cause: The target vault and the IAM role the policy references do not exist in that account/Region — AWS Backup never creates them. Confirm: aws organizations describe-effective-policy --policy-type BACKUP_POLICY (run in the member) shows a correct policy; aws backup list-backup-jobs --by-state FAILED shows a StatusMessage referencing a missing role or vault. Fix: Deploy the vault and AWSBackupCentralRole to every target account/Region via a service-managed CloudFormation StackSet targeting the OUs, before attaching any policy. The role needs the AWS managed AWSBackupServiceRolePolicyForBackup (and ...ForRestores where you restore).

2. The delegated admin can author policies but sees no jobs anywhere. Root cause: Only register-delegated-administrator was done; cross-account monitoring was never enabled. They are two separate grants. Confirm: aws organizations list-delegated-administrators --service-principal backup.amazonaws.com returns the account, but aws backup list-backup-jobs from the delegate is empty even though members are backing up. Fix: In the delegate’s AWS Backup console → Settings, enable cross-account monitoring (and ensure isCrossAccountBackupEnabled is true for copies).

3. Cross-account copies never land in the recovery vault. Root cause: The destination vault access policy is missing or its aws:PrincipalOrgID doesn’t match, or the recovery OU’s SCP is so broad it blocks backup:CopyIntoBackupVault along with the deletion verbs. Confirm: aws backup list-recovery-points-by-backup-vault --backup-vault-name airgap-recovery-vault (run in 444444444444) is empty after a member job completes; aws backup get-backup-vault-access-policy shows a missing/incorrect policy; aws organizations describe-effective-policy --policy-type SERVICE_CONTROL_POLICY reveals an over-broad deny. Fix: put-backup-vault-access-policy allowing backup:CopyIntoBackupVault for PrincipalOrgID; scope the recovery SCP to only the Delete*/key-deletion verbs so copy-in still works.

4. Restore fails with “KMS key cannot be accessed” though the recovery point is intact. Root cause: The destination CMK’s key policy lacks kms:CreateGrant (with the GrantIsForAWSResource condition) for backup.amazonaws.com and/or the restore role — AWS Backup creates a transient grant to decrypt during restore. Confirm: aws backup describe-restore-job --restore-job-id <id> shows a StatusMessage/AbortReason citing KMS; the recovery point itself lists fine. Fix: Add kms:Decrypt, kms:GenerateDataKey, kms:DescribeKey, and kms:CreateGrant to the dest key policy for the Backup service and the restore role (the restore-role statement scoped by "kms:GrantIsForAWSResource": "true"). Across Regions, prefer a multi-Region key.

5. The vault “looks locked” but a privileged role can still delete recovery points. Root cause: It’s in governance mode (you omitted --changeable-for-days), or it’s compliance mode but LockDate is still in the future (the grace window). Confirm: aws backup describe-backup-vault --query '{Locked:Locked, LockDate:LockDate}' shows Locked:false, or Locked:true with a future LockDate. Fix: For true immutability, lock with --changeable-for-days 3 (compliance) and wait for LockDate to pass — only then is it un-deletable by anyone.

6. A recovery point bills indefinitely and refuses to delete. Root cause: It was created with “Always”/indefinite retention and the vault is under a hardened compliance lock, so its retention can never be shortened and it can never be deleted. Confirm: aws backup list-recovery-points-by-backup-vault shows the RP with no CalculatedLifecycle.DeleteAt; the vault shows Locked:true with a past LockDate. Fix: Prevent it up front with max-retention-days on the lock (which rejects indefinite-retention RPs). An already-hardened case has no escape short of closing the AWS account — which is exactly why you never combine “Always” with a compliance lock.

7. Some production resources are silently never backed up. Root cause: They lack the selection tag (backup-tier=tier1), so the tag-targeted policy never selects them. Tag-targeting scales coverage and silently drops the untagged. Confirm: Compare the policy’s selection tag (from describe-effective-policy) against aws resourcegroupstaggingapi get-resources --tag-filters Key=backup-tier. Fix: Enforce tagging with an Organizations Tag Policy and an AWS Config rule that flags untagged in-scope resources; backfill the tag.

9. Continuous-backup recovery points disappear after about 35 days. Root cause: PITR/continuous backup has a hard 35-day retention ceiling — this is by design, not a bug. Confirm: The vanishing RPs are on a rule with enable_continuous_backup: true; their age clusters at ~35 days. Fix: Continuous is your low-RPO tier only. Run a daily SNAPSHOT rule (with long, air-gapped retention) in the same plan for the long-term and ransomware tier.

Best practices

Three accounts, never two. Management enables and attaches; a delegated admin authors and monitors; an isolated recovery account in another Region holds the immutable copies. Collapsing any pair re-creates the blast radius you’re trying to escape.
Enable both delegation grants. register-delegated-administrator and cross-account monitoring. A policy author blind to execution is worse than useless — it looks configured.
Bootstrap vaults + roles before policies. StackSet (service-managed) the vault and AWSBackupCentralRole into every OU/Region first, so jobs never fail “role/vault not found” when an account joins.
Target by tag, then enforce the tag. Tag-targeting scales coverage automatically — but pair it with a Tag Policy and a Config rule so an untagged resource can’t slip through unprotected.
Compliance-mode lock on the air-gap vault, always. Governance mode is an operator guardrail; only compliance mode survives a root-equivalent compromise. Verify Locked:true with a past LockDate.
Never combine “Always” retention with a compliance lock. Set max-retention-days on every lock so an indefinite-retention recovery point can’t bill forever with no escape but account closure.
Dedicated CMKs, never aws/backup. The managed key can’t be shared cross-account, so the restore side can’t use it. Use customer-managed keys and a multi-Region key when copying cross-Region.
Get kms:CreateGrant right the first time. Add it (with GrantIsForAWSResource) for backup.amazonaws.com and the restore role — it’s the single most common reason a real restore fails.
SCP the recovery OU to deny deletion, scoped tightly. Deny backup:DeleteRecoveryPoint, DeleteBackupVault, kms:ScheduleKeyDeletion — but only those, so the legitimate CopyIntoBackupVault path still works.
Run two rules per plan: continuous + daily snapshot. Continuous (35-day ceiling) for low RPO; daily snapshot with long air-gapped retention for the long tail and ransomware survivability.
Test the restore, not just the backup. A scheduled restore-testing plan against the air-gap vault, with a validation Lambda, turns RTO/RPO into measured evidence. An untested backup is a hypothesis.
Export evidence continuously. Audit Manager framework + report plan to a locked S3 bucket — the artifact for auditors and the drift signal when a resource slips out of policy.
Block ec2:DisableImage on prod. An attacker can soft-brick your EC2/EBS restore path by disabling the AMI without deleting anything; close that door with an SCP.

The leading indicators worth alerting on before an incident or audit — not the lagging “job failed”:

Alert on	Signal / control	Threshold (starting point)	Why it’s leading
Stalled backups	`BACKUP_LAST_RECOVERY_POINT_CREATED`	RP age > 26h on a daily rule	Catches a quietly broken job before the next audit
Copy not landing	Recovery-point count in air-gap vault	0 new in 26h	The air-gap is the part that must not silently fail
Lock not hardened	`describe-backup-vault` Locked/LockDate	`Locked:false` on a prod/air-gap vault	An “immutable” vault that isn’t
Coverage gap	`_PROTECTED_BY_BACKUP_PLAN`	any in-scope resource non-compliant	Unprotected resource discovered before restore time
Restore-time regression	`_RESTORE_TIME_..._MEET_TARGET`	measured RTO > target	RTO drifting past the SLA the business bought
Deletion attempt	CloudTrail `DeleteRecoveryPoint`/`DeleteBackupVault`	any in the recovery account	The SCP should deny it — an attempt is a signal

Security notes

The recovery account is a trust boundary, not a folder. It must be a separate AWS account in a separate OU with no standing human access — break-glass only, via a heavily audited role assumed under MFA. The whole point is that the identity which can delete your last copy is not an identity an attacker can reach.
Compliance-mode Vault Lock is your anti-ransomware control. It makes recovery points immutable to every principal including root and AWS. Governance mode does not survive a privileged compromise; for the air-gap tier, compliance mode is the only acceptable choice.
Least privilege on the backup and restore roles. AWSBackupCentralRole gets AWSBackupServiceRolePolicyForBackup; the restore role gets ...ForRestores plus the scoped KMS actions — not backup:* or kms:*. The restore role’s CreateGrant must carry the GrantIsForAWSResource condition.
Dedicated CMKs with tightly scoped key policies. Only KeyAdmin holds kms:*; the Backup service and restore role get exactly the actions they need. Cross-Region copy uses a multi-Region key so data keys aren’t re-wrapped through extra principals.
SCP the deletion verbs at the OU. Deny backup:DeleteRecoveryPoint, DeleteBackupVault, PutBackupVaultAccessPolicy, kms:ScheduleKeyDeletion, kms:DisableKey on the recovery OU — and ec2:DisableImage on production OUs. SCPs bound every principal, including root, which IAM policies do not.
Vault access policy is a second lock, scope it precisely. It permits backup:CopyIntoBackupVault for your PrincipalOrgID and nothing else — not a wildcard principal with broad actions. Audit it; an over-permissive resource policy is an exfiltration path.
Audit every action on the recovery path. An organization CloudTrail to a central, locked S3 bucket captures who attempted what against vaults and keys. In the recovery account, any delete attempt is itself an alarm — the SCP denies it, but the attempt is signal.
Protect the evidence bucket too. The Audit Manager report S3 bucket gets Object Lock (WORM), an SCP-protected key, and bucket-policy isolation — evidence an attacker can rewrite is not evidence.

The security controls and what each defends against — note that “secure” and “recoverable” pull the same direction here:

Control	Mechanism	Defends against	Also enables
Separate recovery account	Distinct account + OU	Account-scoped compromise reaching backups	Clean blast-radius boundary
Compliance Vault Lock	`put-backup-vault-lock-configuration`	Ransomware/root deleting recovery points	Regulatory WORM evidence
Recovery-OU SCP	Deny Delete*/key-deletion	The exact door incidents walk through	Auditable “cannot be deleted” claim
Scoped KMS key policy	`CreateGrant` + `GrantIsForAWSResource`	Arbitrary key-use grants	Working cross-account restore
Vault access policy	`CopyIntoBackupVault` + `PrincipalOrgID`	Unauthorised copy-in / exfil	Cross-account copy landing
Organization CloudTrail	Org trail → locked S3	Tampering hiding deletion attempts	Forensic timeline of the recovery path
`ec2:DisableImage` SCP	Deny on prod OUs	Soft-bricking EC2/EBS restore	Guaranteed restorable AMIs

Cost & sizing

What drives the AWS Backup bill, and how each lever interacts with the architecture:

Warm vs cold recovery-point storage dominates. Warm storage is billed per GB-month at a rate comparable to the source snapshot; cold storage is dramatically cheaper but carries a 90-day minimum commitment and a small per-restore retrieval cost. The pattern that minimizes cost without hurting RTO: short warm local retention (fast restores), then cold for the long tail, with the long compliance retention living on the cheaper cold tier.
Cross-Region copy adds inter-Region data-transfer (egress) on every copied byte plus a second copy’s storage in the destination Region. This is the single largest incremental cost of the air-gap — and the one the CFO must understand is buying survivability, not redundancy for its own sake.
Cross-account copy within a Region has no inter-Region egress, but you still pay for the destination copy’s storage. The air-gap’s Region separation is what costs; the account separation is nearly free.
KMS adds ~₹85/key/month plus per-request charges; a handful of CMKs (one per vault account/Region, or a multi-Region key) is negligible against storage.
Restore testing costs a real restore each run — transient compute/storage for the restored resource during the validation window, then it’s torn down. Weekly tests of a few resources are a rounding error against the value of measured RTO.

A rough monthly picture for a mid-size estate (say 8 TB of warm Tier-1 data, 8 TB copied cross-Region, 7-year cold air-gap tail):

Cost driver	What you pay for	Rough INR / month	What it buys	Watch-out
Warm local storage (8 TB)	Fast-restore recovery points	~₹35,000–55,000	Low-RTO operational restores	Drops sharply once moved to cold
Cold storage (long tail)	Cheap long-retention RPs	~₹8,000–15,000	7-year compliance retention	90-day min commit; retrieval fee
Cross-Region copy egress	Inter-Region transfer of 8 TB	~₹40,000–60,000 (first copy build)	The air-gap (Region isolation)	Recurs on new data, not re-copied
Air-gap destination storage	Second copy in recovery Region	~₹35,000–55,000	Survives a regional event + compromise	Doubles storage for Tier-1
KMS keys	A few CMKs / one MRK	~₹300–800	Shareable, immutable-friendly encryption	Per-request charges at high volume
Restore testing	Transient restored resources	~₹2,000–5,000	Measured RTO/RPO evidence	Validation window compute

Free-tier note: there is no free tier for AWS Backup storage, but the control plane (policies, vaults, locks, Audit Manager frameworks) is free — you pay for stored bytes and copied bytes, not for the program’s machinery. Right-size by tiering aggressively to cold, copying only the tiers that genuinely need the air-gap (your Tier-1, not everything), and letting the daily-snapshot retention — not the continuous tier — carry the long, cheap tail.

Interview & exam questions

1. Why is a per-account backup setup dangerous even when every account backs itself up correctly? Because the backups share a trust boundary with the thing that gets compromised. A single over-broad identity (a deploy role, a poisoned pipeline) that can manage vaults can delete every recovery point and the vault itself — and if the CMK is the same account’s, schedule it for deletion too. The fix is a separate recovery account, in another Region, with compliance-mode immutability and an SCP denying deletion.

2. What are the two distinct grants delegated administration in AWS Backup requires, and what breaks with only one? register-delegated-administrator lets the account author and manage backup policies; separately, cross-account monitoring (AWS Backup console → Settings) lets it see jobs in member accounts. With only the first, you get a policy author who is blind to execution — it looks configured but can’t observe whether anything is actually backing up.

3. A backup policy renders perfectly into a member account but every job fails. Most likely cause? The vault and the IAM role the policy references don’t exist in that account/Region — AWS Backup does not create them. Confirm with describe-effective-policy (policy is fine) versus a failed job’s StatusMessage (“role/vault not found”). Fix by bootstrapping the vault and AWSBackupCentralRole to every target OU/Region via a service-managed StackSet before attaching policies.

4. What single flag selects compliance mode for Vault Lock, and why does it matter? Including --changeable-for-days (with a grace window ≥3 days) selects compliance mode — immutable to every principal including root and AWS once LockDate passes. Omitting it gives governance mode, removable by sufficiently privileged IAM. Only compliance mode survives a root-equivalent compromise; it’s the anti-ransomware control.

5. Describe the “Always-retention in a compliance vault” foot-gun. A recovery point with indefinite (“Always”) retention inside a hardened compliance-locked vault can never have its retention shortened and never be deleted — so it bills forever, with the only escape being closing the AWS account. Prevent it by setting max-retention-days on the lock (which rejects indefinite-retention RPs). Never combine indefinite retention with a compliance lock.

6. Cross-account restore fails with “KMS key cannot be accessed” though the recovery point is intact. What’s missing? The destination CMK’s key policy lacks kms:CreateGrant (with the GrantIsForAWSResource condition) for backup.amazonaws.com and/or the restore role. AWS Backup creates a transient grant to decrypt during restore; without CreateGrant it can’t. Add Decrypt, GenerateDataKey, DescribeKey, and CreateGrant to the dest key policy; across Regions use a multi-Region key.

7. Why must the recovery vault use a dedicated CMK and not the aws/backup managed key? The AWS-managed aws/backup key cannot be shared cross-account, so the restore side in a different account can’t use it — cross-account restore is impossible. A dedicated customer-managed key has a key policy you control, letting you grant the Backup service and the restore role exactly the actions they need.

8. How do you make the cross-account copy actually land in the recovery vault? Two backstops: (a) a vault access policy on the destination allowing backup:CopyIntoBackupVault for your aws:PrincipalOrgID, and (b) ensuring isCrossAccountBackupEnabled is on. A common failure is an over-broad recovery-OU SCP that denies CopyIntoBackupVault along with the deletion verbs — scope the SCP to only Delete*/key-deletion so copy-in still works.

9. What’s the difference between continuous (PITR) and snapshot protection, and how do you use both? Continuous/PITR (RDS, Aurora, DynamoDB, S3) restores to any second within a hard 35-day ceiling — your low-RPO tier. Snapshots restore to discrete points and support long retention — your long-term and ransomware tier. Run both rules in the same plan: continuous for low RPO, daily snapshot with long air-gapped retention for the long tail.

10. How does restore testing turn RTO/RPO from a claim into evidence? It runs real StartRestoreJob operations on a schedule against the air-gapped copies, measures completion time (your empirical RTO), optionally validates integrity via a Lambda, writes an immutable PutRestoreValidationResult, and tears the resource down. Audit Manager aggregates the durations and verdicts so an auditor sees timestamps and measured RTO, not a runbook asserting a number.

11. An EC2/EBS restore fails although the recovery point and lock are fine. What non-obvious cause should you check, and how do you prevent it? The source AMI has been disabled — the recovery point is intact but the resource isn’t restorable. Prevent it by an SCP denying ec2:DisableImage on production OUs, so an attacker can’t soft-brick the restore path without deleting anything; re-enable the AMI to restore.

12. Which SCP closes the exact door a compromised role walks through, and why an SCP rather than IAM? An SCP on the recovery OU denying backup:DeleteRecoveryPoint, backup:DeleteBackupVault, backup:PutBackupVaultAccessPolicy, kms:ScheduleKeyDeletion, and kms:DisableKey. SCPs bound every principal in the account including root, which IAM policies cannot — so even a fully compromised identity can’t delete the copies or schedule the key for deletion.

These map to AWS Certified Security – Specialty (data protection, key management, incident response), AWS Certified Solutions Architect – Professional (multi-account governance, DR, cross-Region/account design), and the resilience pillar of the Well-Architected reviews. A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Air-gap trust boundary, SCP deny	Security Specialty / SA Pro	Account isolation; preventive guardrails
Vault Lock compliance vs governance	Security Specialty	Data protection; WORM immutability
KMS key policy, `CreateGrant`	Security Specialty	Key management; cross-account access
Delegated admin, Organizations policy	SA Pro	Multi-account governance
Continuous vs snapshot, RTO/RPO	SA Pro / Well-Architected	DR design; reliability pillar
Restore testing + Audit Manager	Security Specialty / SA Pro	Recovery validation; auditability

Quick check

You ran register-delegated-administrator for AWS Backup and attached a policy, but the delegate’s console shows no backup jobs from member accounts. What second grant did you miss?
True or false: omitting --changeable-for-days when locking a vault gives you a stronger, immutable compliance-mode lock.
A cross-account restore fails with “KMS key cannot be accessed” even though the recovery point lists fine. Name the specific KMS permission that’s almost certainly missing.
Why must you bootstrap the backup vault and IAM role into a target account before attaching an Organizations backup policy to its OU?
Your continuous-backup recovery points keep disappearing at about 35 days. Is this a bug, and what do you add for long-term retention?

Answers

Cross-account monitoring (AWS Backup console → Settings). Delegation (register-delegated-administrator) only grants policy authoring; cross-account monitoring is the separate grant that lets the delegate see member-account jobs. Without it you have a policy author blind to execution.
False. Omitting --changeable-for-days gives governance mode, which a sufficiently privileged IAM principal can still remove and use to delete recovery points. Including --changeable-for-days (≥3) selects compliance mode — the immutable one that even root can’t undo once LockDate passes.
kms:CreateGrant (with the kms:GrantIsForAWSResource condition) on the destination CMK’s key policy, for backup.amazonaws.com and the restore role. AWS Backup creates a transient grant to decrypt during restore; without CreateGrant the restore is denied.
Because AWS Backup does not create the vault or the role — it only references them. If they don’t exist when a job fires, the job fails (“role/vault not found”) even though describe-effective-policy shows a perfect policy. A service-managed StackSet to the OU lands them before any account starts running jobs.
Not a bug — PITR/continuous backup has a hard 35-day retention ceiling by design. It’s your low-RPO tier; add a daily SNAPSHOT rule (with long, air-gapped retention) in the same plan to carry the long-term and ransomware tail.

Glossary

AWS Backup — a managed service that orchestrates backups across AWS services; recovery points physically live with each source service, wrapped in a vault and encrypted by a KMS key.
Backup plan — schedule + lifecycle + target vault + selection; when a rule fires it assumes an IAM role and snapshots the selected resources.
Organizations backup policy — declarative JSON (with @@assign inheritance operators) attached to an OU that renders into member accounts; targets resources by tag.
Backup vault — a logical container and permission boundary for recovery points; governs who can read, copy, and delete them (not a byte store like S3).
Logically air-gapped (LAG) vault — a distinct vault type with built-in cross-account/Region sharing and restore, used for the isolated recovery copy.
Recovery point (RP) — an encrypted pointer to a snapshot of a protected resource, held in a vault; the unit you restore.
Vault Lock — a WORM immutability configuration on a vault; compliance mode is immutable to all principals (including root) until RPs expire, governance mode is IAM-removable.
changeable_for_days — the grace window (min 3 days) before a compliance lock hardens; its presence selects compliance mode, and it’s your only undo until LockDate passes.
copy_actions — a rule block that fans a recovery point to another vault (another account/Region); has a lifecycle independent of the local rule.
Customer-managed key (CMK) — a KMS key with a policy you control; mandatory for cross-account restore because the managed aws/backup key can’t be shared.
kms:CreateGrant — the key-policy permission that lets AWS Backup create a transient decrypt grant during restore; with GrantIsForAWSResource, it’s load-bearing for cross-account restore.
Multi-Region key (MRK) — a KMS key replicated across Regions with shared material, avoiding data-key re-wrapping on cross-Region copies.
Delegated administrator — the account registered to author org-wide backup policies; needs the separate cross-account-monitoring grant to also view member jobs.
Cross-account monitoring — the AWS Backup setting (console → Settings) that lets the delegate observe jobs in member accounts; distinct from delegation.
Continuous backup / PITR — point-in-time recovery (RDS, Aurora, DynamoDB, S3) restoring to any second within a hard 35-day ceiling; the low-RPO tier.
Restore testing plan — a scheduled StartRestoreJob + optional validation + teardown that measures empirical RTO and proves recoverability.
PutRestoreValidationResult — the API that records an immutable integrity verdict for a restore test; the evidence an auditor sees.
Audit Manager (for Backup) — a framework of parameterized controls (retention, Vault-Lock coverage, encryption, restore-time, cross-Region/account copy) that exports compliance evidence to S3.
Service Control Policy (SCP) — an Organizations guardrail that bounds every principal in an account (including root); used to deny deletion verbs on the recovery OU.
RTO / RPO — Recovery Time Objective (how fast you recover, measured by restore testing) and Recovery Point Objective (how much data you can lose, set by protection mode).

Next steps

You can now stand up and prove an org-wide, air-gapped backup program. Build outward:

Foundation: AWS Organizations: SCP Guardrails & Delegated Admin — the delegation and SCP mechanics this entire program rides on.
Related: AWS Control Tower: Guardrails & Multi-Account Foundation — the landing zone that places accounts in the OUs your policies attach to.
Related: AWS KMS Deep Dive: Keys, Policies, Envelope Encryption & Rotation and KMS Multi-Region Keys — get the cross-account/Region key policy right so restores never fail.
Related: Ransomware Resilience: Immutable Backup & Isolated Recovery Environment — the threat model and isolated-recovery patterns behind the air-gap.
Related: Configure AWS Elastic Disaster Recovery (DRS): Cross-Region Failover — when you need server-level continuous replication and fast failover alongside point-in-time backups.
Related: Automate Cross-Account RDS & EBS Snapshot Copy with AWS Backup — a focused drill on the copy path for the two most common data services.
Capstone: AWS Zero to Hero: Well-Architected Landing Zone — where backup, governance, identity, and network come together.