Backups are not a feature you turn on; they are a control plane you operate. The failure mode that ends careers is not “we had no backups” — it is “we had backups, in the same account, encrypted with the same key, deletable by the same blast-radius identity that just got compromised.” Ransomware crews and fat-fingered automation both target the recovery path first. This guide builds that recovery path as a separate trust boundary: a delegated backup admin that authors policy, compliance-mode vaults that even your root user cannot empty, copies landing in a logically air-gapped account in another Region, and restore tests that prove RTO/RPO instead of asserting them. It assumes AWS Organizations with all features enabled and a multi-account landing zone underneath.
1. The account topology and where policy lives
Three planes, three accounts. Do not collapse them.
| Plane | Account | Responsibility |
|---|---|---|
| Policy authoring | Delegated backup admin | Owns Organizations backup policies, monitors jobs org-wide |
| Org control | Management account | Enables trusted access, registers the delegate, attaches policies to OUs |
| Recovery (air-gap) | Isolated recovery account | Receives cross-account copies into a logically air-gapped vault; ideally a separate OU with restrictive SCPs and no human standing access |
The management account stays thin — it enables the integration and attaches policies; day-to-day backup operations live in the delegated admin so you are not running consoles from the org root.
# Run in the MANAGEMENT account.
# 1) Trusted access so AWS Backup can act across the org.
aws organizations enable-aws-service-access \
--service-principal backup.amazonaws.com
# 2) Turn on the BACKUP_POLICY policy type (idempotent if already enabled).
ROOT_ID=$(aws organizations list-roots --query 'Roots[0].Id' --output text)
aws organizations enable-policy-type \
--root-id "$ROOT_ID" \
--policy-type BACKUP_POLICY
# 3) Register the backup admin account as delegated administrator.
aws organizations register-delegated-administrator \
--account-id 222222222222 \
--service-principal backup.amazonaws.com
Delegated administration in AWS Backup is two grants, not one.
register-delegated-administratorlets the account manage backup policies; you separately enable cross-account monitoring in the AWS Backup console (Settings) so the delegate can see jobs in member accounts. Enable both, or you get a policy author who is blind to execution.
One prerequisite that bites everyone: the vault and the IAM role a policy references must already exist in every target account and Region before a backup job runs — AWS Backup does not create them. Bootstrap them with a CloudFormation StackSet (service-managed permissions) targeting the OUs that receive policies, so vault + role land automatically when an account joins.
2. Defining backup policies with tag-based selection
An Organizations backup policy is JSON with three required keys per plan — regions, rules, selections — using declarative-policy inheritance operators (@@assign sets a value; child policies merge with or override parents). You target resources by tag, not ARN, so the same policy protects whatever a team launches as long as they tag it.
Save the following as pii-backup-policy.json. It runs daily, writes to the local vault, and fans out a copy to the recovery account in another Region. copy_actions is keyed by the destination vault ARN; $account / $region are placeholders AWS resolves per member account.
{
"plans": {
"Tier1_Daily": {
"regions": { "@@assign": ["ap-south-1", "us-east-1"] },
"rules": {
"DailySnapshot": {
"schedule_expression": { "@@assign": "cron(0 18 ? * * *)" },
"start_backup_window_minutes": { "@@assign": "60" },
"complete_backup_window_minutes": { "@@assign": "10080" },
"target_backup_vault_name": { "@@assign": "central-backup-vault" },
"lifecycle": {
"move_to_cold_storage_after_days": { "@@assign": "30" },
"delete_after_days": { "@@assign": "365" }
},
"copy_actions": {
"arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault": {
"target_backup_vault_arn": {
"@@assign": "arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"
},
"lifecycle": {
"move_to_cold_storage_after_days": { "@@assign": "30" },
"delete_after_days": { "@@assign": "2555" }
}
}
}
}
},
"selections": {
"tags": {
"tier1": {
"iam_role_arn": { "@@assign": "arn:aws:iam::$account:role/AWSBackupCentralRole" },
"tag_key": { "@@assign": "backup-tier" },
"tag_value": { "@@assign": ["tier1"] }
}
}
}
}
}
}
Create the policy in the delegated admin account and attach it to the OU that holds your production workloads:
POLICY_ID=$(aws organizations create-policy \
--name "Tier1-Daily-Backup" \
--type BACKUP_POLICY \
--content file://pii-backup-policy.json \
--query 'Policy.PolicySummary.Id' --output text)
aws organizations attach-policy \
--policy-id "$POLICY_ID" \
--target-id ou-abcd-prodaccts # the production OU
A few rules that keep this correct:
- Cold-storage minimum is 90 days.
delete_after_daysmust be at leastmove_to_cold_storage_after_days + 90. Above, the copy retains 7 years (2555 days) — set this to your regulatory floor. selectionsis eithertagsorresources, not both at the top level. Use theresourcesblock with aconditionsclause if you need resource-type plus tag logic.- Local-rule and
copy_actionlifecycles are independent — the common pattern is short local retention (fast, cheap restore) and long air-gapped retention (compliance + ransomware survivability).
3. Vault Lock in compliance mode (WORM immutability)
A vault you can delete is a vault an attacker can delete. Compliance-mode Vault Lock makes recovery points immutable until their lifecycle expires — no IAM principal, not the account root, not AWS, can shorten retention or delete the recovery point once the lock hardens.
The mode is decided by one flag. Including --changeable-for-days selects compliance mode; omitting it gives you governance mode (removable by sufficiently privileged IAM). The value is the cooling-off grace time: during it you can still adjust or delete the lock to fix mistakes. Minimum is 3 days (72 hours).
# Run in EACH account/Region that owns a vault (member accounts + recovery account).
aws backup put-backup-vault-lock-configuration \
--backup-vault-name central-backup-vault \
--changeable-for-days 3 \
--min-retention-days 30 \
--max-retention-days 2555
Treat the 3-day grace window as your only undo. Once
LockDatepasses, the configuration is set in stone. The classic foot-gun: a recovery point with retention set to “Always” inside a locked compliance vault is then un-deletable forever and bills forever. Never combine indefinite retention with compliance Vault Lock.
Verify the lock actually hardened — Locked: true plus a past LockDate is the only state that matters:
aws backup describe-backup-vault \
--backup-vault-name central-backup-vault \
--query '{Locked:Locked, LockDate:LockDate, Min:MinRetentionDays, Max:MaxRetentionDays}'
Apply the same lock to airgap-recovery-vault in the recovery account. That vault is the one that actually has to survive a credential compromise in the workload accounts.
4. Cross-account, cross-Region copy into an air-gapped account
The copy_actions block in Step 2 already routes copies to 444444444444 in us-west-2. For that copy to land, two backstops must be in place.
(a) Vault access policy on the destination — a resource policy that permits the source org/account to copy in:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowOrgCopyInto",
"Effect": "Allow",
"Principal": "*",
"Action": "backup:CopyIntoBackupVault",
"Resource": "*",
"Condition": {
"StringEquals": { "aws:PrincipalOrgID": "o-exampleorgid" }
}
}
]
}
aws backup put-backup-vault-access-policy \
--backup-vault-name airgap-recovery-vault \
--policy file://airgap-vault-policy.json # run in account 444444444444
(b) Logically air-gapped vault, not a standard vault. A logically air-gapped vault is a distinct AWS Backup vault type supporting direct sharing for cross-account/cross-Region restore without you hand-rolling key policy on the restore side. Target it on a rule via target_logically_air_gapped_backup_vault_arn. As of 2025 these vaults support customer-managed KMS keys and can receive primary backups directly. Pair that isolation with restrictive SCPs on the recovery OU (deny backup:DeleteRecoveryPoint, backup:DeleteBackupVault, kms:ScheduleKeyDeletion) and no standing human role, and the copy survives even if every workload account is owned.
5. KMS keys and key policies for cross-account restore
Encrypt the central and recovery vaults with dedicated customer-managed keys — never the AWS-managed aws/backup key, which you cannot share cross-account. The restore side needs to use the key; encode that in the key policy in the recovery account.
{
"Version": "2012-10-17",
"Id": "airgap-backup-key",
"Statement": [
{
"Sid": "KeyAdmins",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::444444444444:role/KeyAdmin" },
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "AllowBackupServiceUse",
"Effect": "Allow",
"Principal": { "Service": "backup.amazonaws.com" },
"Action": ["kms:Decrypt", "kms:GenerateDataKey", "kms:DescribeKey", "kms:CreateGrant"],
"Resource": "*"
},
{
"Sid": "AllowRestoreRoleDecrypt",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::444444444444:role/RecoveryRestoreRole" },
"Action": ["kms:Decrypt", "kms:DescribeKey", "kms:CreateGrant"],
"Resource": "*",
"Condition": {
"Bool": { "kms:GrantIsForAWSResource": "true" }
}
}
]
}
The kms:CreateGrant with the GrantIsForAWSResource condition is load-bearing — AWS Backup creates grants on your behalf during restore, and without it the restore fails with a KMS key cannot be accessed error that is maddening to debug. Copying across Regions, a multi-Region KMS key avoids re-wrapping data keys on every hop.
6. Restore testing: prove RTO/RPO, do not assert it
An untested backup is a hypothesis. AWS Backup restore testing runs real StartRestoreJob operations on a schedule, measures completion time, optionally validates integrity, then tears down the restored resource. Build it in the recovery account so you exercise the air-gapped copies.
It is two API calls. First the plan (cadence + which recovery points are eligible):
aws backup create-restore-testing-plan --restore-testing-plan '{
"RestoreTestingPlanName": "weekly_tier1_dr",
"ScheduleExpression": "cron(0 7 ? * MON *)",
"StartWindowHours": 4,
"RecoveryPointSelection": {
"Algorithm": "LATEST_WITHIN_WINDOW",
"RecoveryPointTypes": ["SNAPSHOT"],
"SelectionWindowDays": 7,
"IncludeVaults": ["arn:aws:backup:us-west-2:444444444444:backup-vault:airgap-recovery-vault"]
}
}'
Then a selection per resource type (here, RDS), with a validation window so the resource survives long enough to integrity-check before cleanup:
aws backup create-restore-testing-selection \
--restore-testing-plan-name weekly_tier1_dr \
--restore-testing-selection '{
"RestoreTestingSelectionName": "rds_tier1",
"ProtectedResourceType": "RDS",
"IamRoleArn": "arn:aws:iam::444444444444:role/RecoveryRestoreRole",
"ValidationWindowHours": 4,
"ProtectedResourceArns": ["*"],
"ProtectedResourceConditions": {
"StringEquals": [{ "Key": "aws:ResourceTag/backup-tier", "Value": "tier1" }]
}
}'
For integrity beyond “did it boot”, wire an EventBridge rule on the restore-job-completed event to a Lambda that connects to the restored DB / reads canary objects from the restored bucket / checksums an EBS volume, then writes the verdict back:
aws backup put-restore-validation-result \
--restore-job-id "$RESTORE_JOB_ID" \
--validation-status SUCCESSFUL \
--validation-status-message "row-count and checksum match production canary"
The job’s measured duration is your empirical RTO; the gap between recovery-point timestamp and incident is your RPO. Track both in Audit Manager so an auditor sees evidence, not a runbook claiming a number. A validation status, once written, is immutable — compute it correctly.
7. Per-service recovery: continuous vs snapshot
Restore mechanics differ by service. Match the protection mode to the workload’s RPO.
| Service | PITR / continuous? | Recovery notes |
|---|---|---|
| RDS / Aurora | Yes (continuous) | enable_continuous_backup: true → restore to any second within 35 days; otherwise snapshot restores to a new instance/cluster |
| DynamoDB | Yes (PITR) | Second-level restore within 35 days; AWS Backup also does full-table snapshots for longer retention |
| S3 | Yes (continuous) | Continuous backup enables point-in-time + item-level restore of objects/versions |
| EFS | Snapshot | Full or item-level file restore from recovery points |
| EBS | Snapshot | Restores as a new volume; AMI/EC2 restores can be blocked if the source AMI is disabled |
To enable point-in-time recovery in a policy, set it on the rule:
"rules": {
"ContinuousTier1": {
"schedule_expression": { "@@assign": "cron(0 */1 ? * * *)" },
"enable_continuous_backup": { "@@assign": "true" },
"target_backup_vault_name": { "@@assign": "central-backup-vault" },
"lifecycle": { "delete_after_days": { "@@assign": "35" } }
}
}
PITR-enabled (continuous) recovery points have a hard 35-day retention ceiling. They are your low-RPO tier; the daily snapshot rule with long air-gapped retention is your long-term and ransomware tier. Run both rules in the same plan.
One trap worth flagging: EC2/EBS restores fail if the underlying AMI has been disabled — the recovery point is intact and the lock holds, but the resource is not restorable. Block ec2:DisableImage via SCP on production OUs so an attacker cannot soft-brick your restore path without deleting anything.
8. Audit, drift, and evidence export
AWS Backup Audit Manager turns “are we compliant” into a queryable framework. It ships controls you parameterize — minimum retention enforced, resources protected by a plan, resources in a Vault-Lock-protected vault, cross-Region/cross-account copy present, last-recovery-point recency, and restore-time-meets-target (fed by Step 6).
aws backup create-framework \
--framework-name org_dr_framework \
--framework-controls '[
{
"ControlName": "BACKUP_RECOVERY_POINT_MINIMUM_RETENTION_CHECK",
"ControlInputParameters": [
{ "ParameterName": "requiredRetentionDays", "ParameterValue": "365" }
]
},
{ "ControlName": "BACKUP_RESOURCES_PROTECTED_BY_BACKUP_VAULT_LOCK" },
{ "ControlName": "BACKUP_RECOVERY_POINT_ENCRYPTED" }
]'
Schedule a report plan that drops compliance evidence (CSV/JSON) into S3 on a cadence — the artifact you hand auditors and the drift signal when a resource slips out of policy. For org-wide visibility, aggregate findings in the delegated admin via AWS Config aggregators or Security Hub rather than logging into each account.
Verify
Run this end-to-end before declaring victory. From the delegated admin unless noted.
# Delegate is registered for AWS Backup.
aws organizations list-delegated-administrators \
--service-principal backup.amazonaws.com \
--query 'DelegatedAdministrators[].Id'
# Effective policy actually rendered into a member account (run in that member).
aws organizations describe-effective-policy --policy-type BACKUP_POLICY
# Locks hardened (run per vault-owning account/Region).
aws backup describe-backup-vault --backup-vault-name central-backup-vault \
--query '{Locked:Locked, LockDate:LockDate}'
# Backup jobs are completing org-wide (requires cross-account monitoring enabled).
aws backup list-backup-jobs --by-state COMPLETED --max-results 5 \
--query 'BackupJobs[].{Res:ResourceType, Acct:AccountId, Done:CompletionDate}'
# Cross-account copies landed in the recovery account (run in 444444444444).
aws backup list-recovery-points-by-backup-vault \
--backup-vault-name airgap-recovery-vault \
--query 'RecoveryPoints[].{Arn:RecoveryPointArn, Status:Status}'
# Restore tests ran and passed.
aws backup list-restore-jobs --by-restore-testing-plan-arn "$PLAN_ARN" \
--query 'RestoreJobs[].{Status:Status, Validation:ValidationStatus, RTO:RestoreJobId}'
Locked: true with a past LockDate, recovery points present in airgap-recovery-vault, and restore jobs with ValidationStatus: SUCCESSFUL together mean the recovery path is real.
Enterprise scenario
A fintech platform team I worked with ran a clean per-account AWS Backup setup: every workload account backed itself up to a local vault, lifecycle was correct, dashboards were green. Then a CI/CD pipeline with an over-broad deployment role in their staging account was compromised through a poisoned dependency. The attacker enumerated AWS Backup and, because the same role could manage vaults, began deleting recovery points and the vault itself. The local backups in that account were gone in minutes. Their auditor’s later question was sharper than the incident: prove the production copies could not have met the same fate.
The constraint: they could not put humans or break-glass roles in the recovery account (compliance forbade standing access to the immutable tier), yet they needed copies that survived a full account compromise and evidence the copies were both immutable and restorable.
The fix was three changes, not a re-platform. First, every Tier-1 plan got a copy_actions fan-out into a logically air-gapped vault in a separate recovery account in another Region, locked in compliance mode with min-retention-days matching their 7-year regulatory floor — so even a root-equivalent compromise could not shorten or delete those copies. Second, the recovery OU got an SCP denying the deletion verbs outright, closing the door the staging incident walked through:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Deny",
"Action": [
"backup:DeleteRecoveryPoint",
"backup:DeleteBackupVault",
"backup:PutBackupVaultAccessPolicy",
"kms:ScheduleKeyDeletion",
"kms:DisableKey"
],
"Resource": "*"
}]
}
Third — the part that satisfied the auditor — a weekly restore-testing plan in the recovery account restored the latest air-gapped RDS and S3 recovery points, a Lambda validated row counts and object checksums against a production canary, and PutRestoreValidationResult recorded an immutable pass with a measured restore duration. The Audit Manager report plan exported that evidence to a locked S3 bucket every Monday. When the auditor asked the sharp question, the answer was a CSV with timestamps, durations, and validation verdicts — not a slide deck. RTO went from “we believe under four hours” to “measured 2h41m on the last 11 consecutive weekly tests.”