Prevention is a probability game you will eventually lose. Recovery is the game you have to win every time. Modern ransomware crews do not encrypt and leave - they spend days living off the land, escalate to domain admin, and the first thing they do once they own the directory is find and destroy your backups. Veeam’s own incident data has shown that the backup repository is targeted in the large majority of attacks, and a meaningful fraction of victims lose some or all of their backups. If your last line of defense is reachable with the same credentials that just got compromised, you do not have a last line of defense.
This guide is about that last line: backups the attacker cannot delete, a control plane they cannot pivot into, and a clean room where you can actually restore without re-detonating the malware. I will use Azure and AWS primitives because most of you run there, but the architecture - assume the production identity plane is fully compromised, and design backwards from that - transfers to any platform.
1. Design for assume-breach: the 3-2-1-1-0 model
Start by writing down the threat model explicitly, because it changes every downstream decision: the adversary has domain admin, global admin, or the backup service’s own credentials, and they have had them for two weeks. Everything reachable from those identities is presumed gone. The job is to ensure something survives anyway.
The classic 3-2-1 rule is no longer sufficient against an active human adversary. The current bar is 3-2-1-1-0:
| Digit | Requirement | Why it matters under ransomware |
|---|---|---|
| 3 | Three copies of data | Survives correlated hardware/site loss |
| 2 | Two distinct media types | One ransomware family can’t reach both |
| 1 | One copy off-site | Survives site-wide compromise or destruction |
| 1 | One copy immutable or air-gapped | Survives an attacker with admin on the backup system |
| 0 | Zero errors on recovery verification | A backup you never restored is a hypothesis, not a backup |
The two digits that matter most here are the second 1 and the 0. Immutability defeats the deletion attack; verification defeats the silent-corruption attack (encrypted-in-place data backed up faithfully for 60 days, so your retention window is full of garbage).
The mental shift: stop asking “is my backup running” and start asking “if my backup admin account is the attacker, what still survives, and have I proven I can restore it?” If the answer to either half is uncertain, you have a reporting system, not a recovery system.
A second principle: the backup control plane must live in a separate identity and trust boundary from production. In Azure, that means a dedicated tenant or at minimum a separate management group with no inherited role assignments and its own break-glass accounts. In AWS, a separate account in a locked-down OU. If a single global admin can both run production and purge the vault, you have one identity away from total loss.
2. Implement immutable and air-gapped backups (WORM + soft-delete)
Immutability has to be enforced by the storage platform, below the layer the backup admin operates at. There are two complementary mechanisms, and you want both:
- Soft-delete: deleted data is retained, recoverable, for a window - defeats accidental and malicious delete operations.
- WORM / immutability lock: data physically cannot be modified or deleted until a time-based retention expires - defeats overwrite and re-encryption, and (in locked mode) cannot be shortened even by the storage admin.
Azure: Backup vault with immutability locked + soft-delete
On an Azure Recovery Services or Backup vault, enable immutability and then lock it. Unlocked immutability can be disabled by an attacker with vault-admin rights; locked immutability is irreversible - that irreversibility is the entire point.
# Enable immutability on a Recovery Services vault, then LOCK it.
# Unlocked = reversible (test here first). Locked = irreversible.
az backup vault update \
--resource-group rg-backup-tier0 \
--name rsv-prod-immutable \
--immutability-state Unlocked
# After validating retention/policies, escalate to Locked.
# This CANNOT be undone, including by a Global Admin or Owner.
az backup vault update \
--resource-group rg-backup-tier0 \
--name rsv-prod-immutable \
--immutability-state Locked
# Enforce multi-user authorization (see step 3) and a hard soft-delete window.
az backup vault backup-properties set \
--resource-group rg-backup-tier0 \
--name rsv-prod-immutable \
--soft-delete-feature-state AlwaysON \
--soft-delete-retention-period-in-days 30
AlwaysON is deliberate: it means soft-delete can no longer be turned off, even by a vault admin. Combined with a locked immutability state, an attacker who fully owns the vault still cannot shorten retention or hard-delete a recovery point inside the window.
Azure: WORM on the storage account (immutability policy with legal hold or time-based)
For backups that land in blob storage (database dumps, archive tiers), use a version-level or container-level time-based retention policy and lock it:
# Account must have versioning + immutability support enabled at creation.
az storage container immutability-policy create \
--account-name stbackupworm \
--container-name db-archives \
--period 30 \
--allow-protected-append-writes true
# Lock the policy - after this, blobs are WORM for the retention window.
ETAG=$(az storage container immutability-policy show \
--account-name stbackupworm --container-name db-archives \
--query etag -o tsv)
az storage container immutability-policy lock \
--account-name stbackupworm --container-name db-archives \
--if-match "$ETAG"
allow-protected-append-writes lets backup software append to existing log blobs without being able to modify already-written data - useful for streaming backups, and it does not weaken the WORM guarantee on committed blocks.
AWS: S3 Object Lock in Compliance mode + MFA Delete
The AWS equivalent is S3 Object Lock in Compliance mode. Governance mode can be bypassed by a principal with s3:BypassGovernanceRetention; Compliance mode cannot be bypassed by anyone, including the root account, until retention expires. For ransomware resilience, use Compliance.
# Object Lock must be enabled at bucket creation (cannot be added later).
aws s3api create-bucket \
--bucket acme-cyber-recovery-vault \
--object-lock-enabled-for-bucket \
--region us-east-1
# Default retention: every new object is locked for 30 days, COMPLIANCE mode.
aws s3api put-object-lock-configuration \
--bucket acme-cyber-recovery-vault \
--object-lock-configuration '{
"ObjectLockEnabled": "Enabled",
"Rule": { "DefaultRetention": { "Mode": "COMPLIANCE", "Days": 30 } }
}'
The strongest pattern is air-gapped rather than merely immutable: a target with no inbound network path and no shared credentials. AWS Backup’s logically air-gapped vault is purpose-built for this - it is immutable by default, stored in an AWS-owned account outside your control plane, and shared cross-account only for restore. The principle on any platform: the backup target should be write-only from production and readable only from the recovery environment, never both at once from the same identity.
3. Harden the control plane: multi-user authorization and RBAC
Immutable storage stops deletion of data. It does nothing about an attacker who reconfigures the policy - shortens future retention, disables protection, or deletes the vault wholesale before locking takes effect. Destructive control-plane operations need a second human.
Azure Backup provides Multi-User Authorization (MUA) built on Resource Guard. You place a Resource Guard in a separate tenant or subscription that the backup admin has no access to. Critical operations (disable soft-delete, reduce retention, stop protection with delete data, modify MUA) then require a just-in-time PIM-approved role on the Resource Guard - so the backup admin alone cannot perform them.
# Resource Guard lives in a SEPARATE security subscription/tenant.
# Backup admins have ZERO standing access to this resource group.
az resource create \
--resource-group rg-security-guard \
--name rg-prod-backup-guard \
--resource-type "Microsoft.DataProtection/resourceGuards" \
--properties '{
"vaultCriticalOperationExclusionList": []
}' \
--api-version 2023-05-01
# Associate the vault with the Resource Guard. From now on, protected
# operations require a JIT role (granted via PIM) on the guard's scope.
az dataprotection resource-guard-mapping create \
--resource-group rg-backup-tier0 \
--vault-name rsv-prod-immutable \
--resource-guard-id "/subscriptions/<sec-sub>/resourceGroups/rg-security-guard/providers/Microsoft.DataProtection/resourceGuards/rg-prod-backup-guard"
Empty vaultCriticalOperationExclusionList means nothing is excluded - every protected operation is gated. That is what you want for tier-0.
On the RBAC side, apply least privilege ruthlessly:
- Backup Operator can trigger backups/restores but cannot delete recovery points or change policy.
- Backup Contributor (policy/vault changes) is the dangerous role - it gets MUA-gated and PIM-eligible only, never standing.
- The recovery environment uses a separate identity with read access to the vault for restore, and no production access at all.
Treat the backup vault exactly like a tier-0 asset, because it is one. The richest target in your estate is not your database - it is the system that can simultaneously read every byte of it and delete the only way back.
4. Architect the isolated recovery environment (clean room)
Here is the failure mode that catches teams who did keep good backups: they restore the encrypted, exfiltration-laden image straight back into production, the dormant payload re-detonates, and they are back to square one - now with the attacker tipped off. You need somewhere clean to restore into.
An Isolated Recovery Environment (IRE), also called a clean room or cyber recovery vault, is a network-isolated landing zone where you restore, inspect, and clean systems before promoting them back. Its non-negotiable properties:
- No inbound or outbound connectivity to production or the internet during restore (deny-all NSG/security-group, no peering, no default route).
- Its own identity plane - local accounts or a separate directory, never the production directory you assume is compromised.
- Its own forensic and AV/EDR tooling pre-staged inside, since you cannot pull it over the network mid-incident.
- Read-only restore access to the backup vault, granted just-in-time.
# Deny-all NSG for the clean-room subnet. No path to prod, no path out.
az network nsg create -g rg-ire -n nsg-cleanroom
az network nsg rule create -g rg-ire --nsg-name nsg-cleanroom \
-n deny-all-inbound --priority 4096 \
--direction Inbound --access Deny --protocol '*' \
--source-address-prefixes '*' --destination-address-prefixes '*' \
--destination-port-ranges '*'
az network nsg rule create -g rg-ire --nsg-name nsg-cleanroom \
-n deny-all-outbound --priority 4096 \
--direction Outbound --access Deny --protocol '*' \
--source-address-prefixes '*' --destination-address-prefixes '*' \
--destination-port-ranges '*'
# The clean-room VNet has NO peering and NO route to the prod hub.
az network vnet create -g rg-ire -n vnet-cleanroom \
--address-prefixes 10.250.0.0/16 \
--subnet-name snet-restore --subnet-prefixes 10.250.1.0/24
Access into the clean room for operators is via a single hardened, monitored jump host (Azure Bastion or a bastion in a separate management subnet) - not a flat RDP rule from the corporate LAN. Once a system is restored, scanned, confirmed clean, and patched against the original entry vector, only then does it get promoted to a rebuilt production network. The clean room is also where you mount immutable recovery points read-only to extract just the data when the OS itself is untrustworthy.
5. Define recovery tiers, RPO/RTO, and a tier-0-first sequence
Not everything recovers at once, and trying to recover everything in parallel during a real incident guarantees you recover nothing on time. Classify applications into recovery tiers up front, and recover the dependencies before the things that depend on them.
| Tier | Examples | Target RTO | Target RPO | Backup cadence |
|---|---|---|---|---|
| Tier 0 (foundation) | AD/Entra, DNS, PKI, IPAM, the backup system itself | < 4 h | < 1 h | Continuous / hourly |
| Tier 1 (critical revenue) | Core DB, payments, primary app | < 8 h | < 1 h | Hourly + log shipping |
| Tier 2 (important) | Internal apps, reporting | < 24 h | < 4 h | 4-hourly |
| Tier 3 (deferrable) | Dev/test, archives | Best effort | 24 h | Daily |
The recovery sequence is dependency order, not business-priority order. A common mistake is restoring the revenue app first; it then can’t authenticate because Active Directory isn’t back, can’t resolve names because DNS isn’t back, and can’t validate certs because PKI isn’t back. Restore the foundation, validate it in the clean room, then layer critical workloads on top.
Tier 0 deserves special handling: keep an AD forest recovery runbook that does not depend on any surviving production infrastructure - the System State / Veeam AD object backup of at least two DCs, the recovery sequence to seize FSMO roles, reset the krbtgt password twice, and clean up metadata. Microsoft publishes the canonical forest-recovery procedure; pre-stage it inside the IRE because you will not be able to download it when your domain is encrypted.
6. Validate backup integrity and detect tampering early
A backup you have not verified is Schrodinger’s recovery point. Two things must be automated: proving a restore works, and detecting that backups are being tampered with while there is still clean data behind the bad data.
Veeam’s SureBackup (or the equivalent in your stack) boots restored VMs in an isolated virtual lab, runs heartbeat/ping/application-level tests, and marks the recovery point verified - on a schedule, without touching production:
# Veeam SureBackup: schedule automated recovery verification in an
# isolated virtual lab. A pass = a restore you have actually performed.
$lab = Get-VSBVirtualLab -Name "CleanRoom-Lab"
$job = Get-VBRJob -Name "Tier1-Core-DB"
$app = New-VSBApplicationGroup -BackupJob $job
Add-VSBJob -Name "Verify-Tier1-Nightly" `
-VirtualLab $lab -ApplicationGroup $app `
-VirtualMachine $job
# Run and assert it passed; alert if not.
Start-VSBJob -Job (Get-VSBJob -Name "Verify-Tier1-Nightly")
For tamper detection, treat backup activity as a security signal. Sudden mass deletions, retention policy changes, soft-delete being disabled, or a spike in “backup failed” across many jobs are all early indicators that the adversary has reached the backup tier. Stream the control-plane logs to your SIEM and alert. In Azure, the vault’s diagnostic logs land in Log Analytics:
// Sentinel/Log Analytics: detect destructive operations against backup vaults.
AzureActivity
| where ResourceProviderValue in (
"MICROSOFT.RECOVERYSERVICES", "MICROSOFT.DATAPROTECTION")
| where OperationNameValue has_any (
"delete", "stopProtection", "softDelete", "immutabilitySettings",
"backupResourceGuardProxies")
| where ActivityStatusValue in ("Success", "Started")
| project TimeGenerated, Caller, OperationNameValue,
ActivityStatusValue, Resource, CallerIpAddress
| sort by TimeGenerated desc
Pair that with an immutability check that an attacker cannot alter: keep a content hash of each critical recovery point’s manifest written to a separate, append-only store (a different cloud, a WORM bucket), and reconcile periodically. If the vault claims a recovery point exists but its hash no longer matches your external ledger, you have detected tampering independent of the system being tampered with.
7. Run recovery rehearsals and measure time-to-restore honestly
The number that matters in a board update is not “we have backups.” It is “we restored tier-0 and tier-1 in the clean room last quarter in 6 hours 40 minutes, against an 8-hour RTO.” You only have that number if you rehearse.
Run a full isolated recovery drill at least quarterly for tier-0/tier-1:
- Assume the production identity plane is gone - log in to the IRE with break-glass only.
- Restore AD/DNS/PKI from immutable points into the clean room.
- Restore the tier-1 application stack on top of the recovered foundation.
- Validate application-level health, not just “the VM booted.”
- Record wall-clock time per phase. The honest RTO includes decision time, approvals, and the inevitable stumbles - not just the restore-job duration.
# Measure restore wall-clock per item and emit a CSV for the drill report.
# Real time-to-restore = decision + approval + restore + validate, not just this.
start=$(date +%s)
az backup restore restore-disks \
--resource-group rg-backup-tier0 --vault-name rsv-prod-immutable \
--container-name "VMappContainer;compute;rg-prod;vm-dc01" \
--item-name "VM;compute;rg-prod;vm-dc01" \
--rp-name "$RECOVERY_POINT" \
--storage-account stcleanroomstaging \
--target-resource-group rg-ire
end=$(date +%s)
printf 'vm-dc01,%s,%d\n' "$(date -u +%FT%TZ)" "$((end - start))" >> drill-rto.csv
Track the trend over quarters. RTO should fall as the runbook tightens; if it is flat or rising, the rehearsal is theater. Also rehearse the ugly paths: the recovery point you wanted is corrupt and you fall back one generation; the IRE capacity is half of production and you have to triage which tier-1 systems come first.
Enterprise scenario
A European logistics company running a roughly 400-VM VMware estate took a ransomware hit through a compromised VPN appliance. The crew had domain admin for nine days, and on detonation night they used the backup service account - which had vCenter admin - to delete the Veeam backup jobs and the primary repository before encrypting the VMs. Standard playbook.
What saved them was a control they had added eighteen months earlier after a tabletop exercise exposed exactly this gap: a hardened Veeam Linux repository with immutability, on a dedicated box outside the Windows domain, where the backup data is made immutable at the filesystem level via the XFS i (immutable) attribute for the retention period. The attacker’s domain credentials were useless against it - it had no domain trust, SSH single-use credentials managed out of band, and the repo service itself sets and clears the immutable flag; even root cannot delete a locked block within the window.
The constraint they hit during recovery was speed: restoring 400 VMs over the repository’s network link was projected at four days, blowing every RTO. They solved it by restoring in dependency-tier order into an isolated vSphere cluster (the clean room), bringing back DCs, DNS, and the four revenue-critical workloads first - about 30 VMs - and getting the business transacting in under twelve hours, then back-filling tier 2 and 3 over the following days. The immutable flag is what made the restore possible; the tiered clean-room sequence is what made it fast enough.
The single config that mattered:
# Hardened Linux repo: data made immutable at the filesystem layer.
# Veeam's repo service sets/clears the +i attr; the immutability lock
# means even root cannot delete locked blocks before retention expires.
chattr +i /backups/veeam/Tier1-Core-DB/*.vbk
lsattr /backups/veeam/Tier1-Core-DB/ # ----i--------- on locked files
Post-incident, they added an external append-only hash ledger (step 6) so they would have detected the job-deletion attempt within minutes rather than at detonation, and moved the repo’s single-use credentials into a PAM vault in a separate tenant.
Verify
Before you call this resilient, confirm each guarantee holds against an admin-level adversary, not just on paper:
- Immutability is locked, not just enabled. Attempt to disable soft-delete or shorten retention as a vault admin in a test vault - it must fail. If it succeeds, you have unlocked immutability and the protection is theater.
- Deletion fails. As a principal with full vault/repo rights, try to delete a locked recovery point or object. It must be refused by the storage layer.
# Should FAIL with "object is WORM-protected" / retention not expired. aws s3api delete-object --bucket acme-cyber-recovery-vault \ --key tier1/core-db/2026-06-08.vbk - MUA gates destructive ops. Without a PIM-approved role on the Resource Guard, disabling protection-with-delete must be blocked.
- The clean room is truly isolated. From a restored VM in the IRE, confirm no route to production or the internet:
Test-NetConnection <prod-dc> -Port 389and an outbound HTTPS test must both fail. - A real restore succeeded recently. SureBackup/verification shows a passed recovery for every tier-0 and tier-1 job within the last cycle. A green backup job is not a passed restore.
- Tamper detection fires. Trigger a test policy change and confirm the SIEM alert lands and pages the right rotation.
- The drill number exists. You can state last quarter’s measured tier-0+tier-1 time-to-restore and compare it to the committed RTO.