Every ransomware tabletop I have run ends at the same uncomfortable question: when the attacker has Backup Contributor on your subscription, what actually stops them from stopping backups, dropping retention to one day, and waiting you out? The honest answer for most tenants is “nothing.” Backups are the last line of defence, which makes the backup control plane the highest-value target in the blast radius. Modern attackers know this – they delete recovery points before they encrypt, because a clean restore turns a seven-figure extortion into a Tuesday-afternoon rebuild. Azure Backup – the platform service that schedules, stores and restores recovery points for Azure VMs, SQL/SAP in VMs, Blobs, Disks, PostgreSQL Flexible Server, Azure Files and AKS – ships four independent controls that together make the vault tamper-resistant even against an admin-level compromise: immutability locks the retention floor, multi-user authorization (MUA) puts destructive operations behind a second tenant’s approval, enhanced soft delete keeps deleted backups recoverable, and cross-region restore (CRR) gives you an out-of-region copy when the primary region is gone.
This is the deep dive on wiring all four correctly, in the right order, and proving they work. The emphasis throughout is sequencing and irreversibility: three of these controls have a one-way door (Locked immutability, AlwaysON soft delete, and the day-zero redundancy choice), and getting the order wrong leaves a gap an attacker walks through. We treat each control as a setting with a value matrix, a default, a “when to flip it”, a trade-off, and a gotcha – because “I enabled immutability” is not the same as “I locked it after a retention review”, and the difference is whether the lock is protection or a self-inflicted ten-year bill.
By the end you will stop trusting the configuration blade. You will know which vault type protects which workload, why GeoRedundant is a prerequisite you cannot retrofit after onboarding, exactly which operations a Resource Guard gates, how to undelete a maliciously deleted backup, and how to prove recoverability by booting from a secondary-region recovery point and timing it. Because this is a reference you will return to during an incident, every control, limit, error and recovery path is laid out as scannable tables – read the prose once, then keep the tables open when the pager goes off.
What problem this solves
The pain is concrete and it is always the same: backups are the one resource whose destruction is irreversible and whose owner – the backup admin – has exactly the standing rights an attacker wants. Standard RBAC does not save you here. Backup Contributor legitimately includes “stop protection and delete backup data”, “disable soft delete”, and “modify retention” – those are normal day-job operations. So a compromised CI service principal, a phished admin, or a malicious insider with that role can quietly demolish your recovery path before anyone notices the encryption, and standard role separation does nothing because the role is doing what it is designed to do.
What breaks without these controls: the recovery point that would have saved you is gone before you reach for it. The team discovers during the incident – the worst possible time – that “we have backups” meant “we had backups until the attacker, holding our own admin role, deleted them.” Soft delete was off or fixed at the old 14-day basic tier and was disabled in the same script. Immutability was never enabled, or was enabled-but-never-locked so the attacker disabled it first. The vault was LocallyRedundant so when the region had a real outage there was no second copy to restore from. Each of these is a five-minute configuration that nobody sequenced.
Who hits this: every platform team that centralises backup, every regulated estate (finance, health, public sector) that must demonstrate WORM (write-once-read-many) retention to an auditor, and every organisation that has done – or fears – a ransomware tabletop and asked the uncomfortable question above. It bites hardest where backup rights are inherited broadly (CI principals with Contributor at subscription scope), where the same team owns both the workload and the guard (separation of duties that is theatre), and where “protected” was assumed to mean “recoverable” without a single restore drill.
To frame the whole field before the deep dive, here is each control, the attack it defeats, the one-way-door risk, and where it is configured:
| Control | Attack it blocks | Reversible? | Configured on | Day-zero or anytime |
|---|---|---|---|---|
| Redundancy = GeoRedundant | Region-loss with no second copy | Only while 0 protected items | Vault backup-properties | Day-zero (locks after first item) |
| Cross-region restore (CRR) | “Region is down” becomes an outage | Flag toggles, needs GRS | Vault backup-properties | Day-zero (needs GRS first) |
| Immutability (Unlocked) | Delete-before-retention, retention cut | Yes (admin can disable) | Vault securitySettings | Anytime (soak here) |
| Immutability (Locked) | Same, but attacker-proof | No – irreversible | Vault securitySettings | After soak + retention review |
| Enhanced soft delete (AlwaysON) | Deleting backups to destroy recovery | No – can’t disable | Vault backup-properties | Anytime (extend only) |
| MUA via Resource Guard | Disabling any of the above | Yes (unmap the guard) | Cross-tenant guard mapping | Anytime (put guard cross-tenant) |
Learning objectives
By the end of this article you can:
- Choose the correct vault type – Recovery Services vault vs Backup vault – per workload, and explain what each protects and why they are not interchangeable.
- Set vault redundancy to
GeoRedundantand enable the CRR flag before onboarding any item, and explain why this is a day-zero, can’t-retrofit decision. - Enable vault immutability in the
Unlockedsoak state, soak it through a release cycle, then flip it toLockedafter a retention review – understanding thatLockedis irreversible. - Configure enhanced soft delete to
AlwaysONwith a 14–180 day retention window, and recover a maliciously or accidentally deleted backup withundelete+ resume. - Stand up a Resource Guard in a separate tenant/subscription, map every production vault to it, and gate destructive operations behind PIM-activated, time-bound Backup Operator – so no standing access can self-approve.
- Build a defensible GFS backup policy, right-size instant-restore snapshot retention for cost, and understand why you must do this before locking the vault.
- Perform a cross-region restore from the paired region on demand, wire vault diagnostics to Log Analytics, alert on destructive operations, and prove recoverability with a timed restore drill.
Prerequisites & where this fits
You should already understand Azure Backup basics: a vault is the resource that holds backup policies and recovery points; an App Service plan-style “rent the capacity” model does not apply – you pay for protected-instance count and storage consumed. You should know how to run az in Cloud Shell, read JSON output, and that RBAC roles like Backup Contributor / Backup Operator / Backup Reader scope to a vault or its parent. Familiarity with RTO (recovery time objective) and RPO (recovery point objective), geo-paired regions, and the difference between LRS/ZRS/GRS storage redundancy helps a great deal.
This sits in the Backup, DR & Resilience track and is the security-hardening capstone for it. It assumes the storage-redundancy fundamentals from the Azure Storage Accounts Deep Dive and the data-protection model in Azure Blob Storage: lifecycle, immutability & soft delete. It builds directly on Azure Backup & Site Recovery Deep Dive for the protection mechanics, and the cross-region story pairs with Azure Site Recovery: zone-to-zone & region failover runbooks and the RTO/RPO framing in HA vs DR. The MUA pattern leans on Azure PIM for resources & groups and break-glass emergency access. For the broader pattern across clouds, see Ransomware resilience: immutable backup & isolated recovery environment.
A quick map of who owns and confirms each control during a hardening project, so you assign the work correctly:
| Layer | What lives here | Who usually owns it | What it defends |
|---|---|---|---|
| Vault redundancy / CRR | Storage replication, paired-region copy | Platform / backup squad | Region loss, out-of-region restore |
| Immutability | WORM retention floor | Backup squad + compliance | Delete-before-retention, retention cut |
| Soft delete | Deleted-item recovery window | Backup squad | Accidental/malicious deletion |
| Resource Guard (MUA) | Approval gate for destructive ops | Security team (separate tenant) | Insider / compromised-admin attack |
| PIM on the guard | Just-in-time Backup Operator | Identity / security team | Standing-access elimination |
| Diagnostics & alerts | Job logs, destructive-op alerts | Observability / SOC | Detection of attempted strips |
Core concepts
Five mental models make every later decision obvious.
The backup control plane is the attack surface, not the data plane. Attackers do not brute-force your encrypted recovery points; they use your own RBAC to delete them through the management API. Every control here defends the control plane: it makes a destructive management operation either impossible (immutability, soft-delete AlwaysON) or subject to out-of-band approval (MUA). The data is incidental; the operation is what you gate.
Three of these doors only open once. The day-zero redundancy choice (changeable only at zero protected items), Locked immutability, and AlwaysON soft delete are all one-way. This is deliberate – a control an admin can switch off is a control an attacker-as-admin can switch off. The cost of the one-way door is that you must get the value right before you walk through it: lock a 10-year retention by mistake and you pay for 10 years; that is the trade for tamper-resistance.
Immutability gates the destructive direction only. Vault immutability blocks operations that reduce protection of existing recovery points – deleting data before retention expires, shortening a policy’s retention, disabling soft delete. It never blocks creating new backups or extending retention. So immutability is not “freeze the vault”; it is “you can add and lengthen, you can never shorten or delete early.”
MUA is separation of duties, not a checkbox. A Resource Guard is a separate resource (Microsoft.DataProtection/resourceGuards) that you place where the backup admin has no permissions – ideally a different tenant owned by the security team. After you map a vault to it, the gated destructive operations require a just-in-time Backup Operator role on the guard, granted by the other team via PIM. If you put the guard in the same subscription the backup admin owns, the admin (or the attacker who became them) can self-approve, and you have built a speed bump, not a control.
“Protected” is not “recoverable” until you have restored. A green configuration blade and an untested restore is the oldest trap in DR. Cross-region restore in particular fails for boring reasons – the staging storage account or target resource group does not exist in the secondary region, the redundancy was never GeoRedundant, the CRR flag was never set. You only know you can recover after you have booted a VM from a secondary-region recovery point and recorded the RTO.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters |
|---|---|---|---|
| Recovery Services vault | Vault for VM / SQL-in-VM / SAP HANA / Files | Microsoft.RecoveryServices/vaults |
The classic IaaS protection plane |
| Backup vault | Vault for Blob / Disk / PostgreSQL / AKS | Microsoft.DataProtection/backupVaults |
The newer managed-data-store plane |
| Immutability | Blocks retention-reducing operations | Vault securitySettings |
WORM floor; Locked is irreversible |
| Soft delete | Keeps deleted backups restorable 14–180d | Vault backup-properties | Recover from deletion; AlwaysON = can’t disable |
| Resource Guard | Approval gate for destructive ops | Microsoft.DataProtection/resourceGuards |
MUA / separation of duties |
| MUA | Multi-user authorization | Vault ↔ guard mapping | Second-team approval for strips |
| Cross-region restore (CRR) | On-demand restore in paired region | Vault flag (needs GRS) | Region-loss recovery without failover |
| GeoRedundant (GRS) | 6 copies, paired-region async | Vault redundancy | Prerequisite for CRR |
| Instant restore | Local snapshot tier (1–5 days) | Backup policy | Fast same-region restore, snapshot cost |
| Recovery point (RP) | One restorable backup at a point in time | In the vault | The thing an attacker deletes |
| GFS | Grandfather-father-son retention ladder | Backup policy | Daily/weekly/monthly/yearly retention |
| Backup Contributor | RBAC role with destructive rights | Vault / parent scope | The role the attacker wants |
The hard limits and quotas worth committing to memory – the ones that shape design decisions:
| Limit | Value | Why it matters |
|---|---|---|
| Soft-delete retention | 14–180 days | Floor 14 (can’t go lower), ceiling 180 |
| Instant-restore snapshot retention | 1–5 days | Snapshot cost lever; default 2 |
| VMs protectable per vault | ~2,000 | Shard large estates across vaults |
| Backup items per vault (all types) | ~5,000 | Plan vault topology for big fleets |
| Daily scheduled backups (enhanced) | up to 6/day | 4-hour minimum interval |
| Yearly retention max | 99 years | Effectively permanent once locked |
| Redundancy change | Only at 0 protected items | Day-zero decision, frozen after |
| Resource Guard protected ops | ~7 default | Some excludable, MUA-disable is not |
| Geo-replication (GRS) lag | up to several hours | Secondary RPs are not instant |
1. Recovery Services vault vs Backup vault, and what each protects
Azure has two vault resource types and they are not interchangeable. Picking the wrong one means re-onboarding workloads later, so get this right on day one. The split is historical: the Recovery Services vault is the original IaaS/in-guest protection plane; the Backup vault is the newer plane for managed data stores that arrived with the Data Protection API.
| Capability | Recovery Services vault | Backup vault |
|---|---|---|
| Resource type | Microsoft.RecoveryServices/vaults |
Microsoft.DataProtection/backupVaults |
| Azure VMs | Yes (snapshot + vault) | No |
| SQL in Azure VM | Yes | No |
| SAP HANA in Azure VM | Yes | No |
| Azure Files | Yes (snapshot) | Yes (vaulted) |
| Azure Blobs | No | Yes (operational + vaulted) |
| Azure Managed Disks | No | Yes |
| Azure Database for PostgreSQL Flexible Server | No | Yes |
| AKS (cluster state + PV) | No | Yes |
| Immutability | Yes | Yes |
| MUA via Resource Guard | Yes | Yes |
| Enhanced soft delete | Yes | Yes |
| Cross-region restore | Yes (VM/SQL/HANA) | Yes (selected workloads) |
The rule of thumb: Recovery Services vault for the classic IaaS and in-guest workloads (VMs, SQL-in-VM, SAP HANA-in-VM, snapshot-based Azure Files), Backup vault for the newer managed-data-store estate (Blobs, Disks, PostgreSQL Flexible Server, AKS, vaulted Azure Files). Many platform teams run both, and that is expected – they are governed the same way for immutability and MUA, which is the whole point of this article. Map your estate before you create anything:
| Workload | Vault to use | Backup type | CRR available |
|---|---|---|---|
| Azure VM (Windows/Linux) | Recovery Services | Snapshot + vaulted | Yes |
| SQL Server in Azure VM | Recovery Services | Log + full/diff | Yes |
| SAP HANA in Azure VM | Recovery Services | HANA backint | Yes |
| Azure Files (snapshot) | Recovery Services | Share snapshot | No (snapshot is in-region) |
| Azure Files (vaulted) | Backup vault | Vaulted | Limited |
| Azure Blob (operational) | Backup vault | Operational (no data copy) | No |
| Azure Blob (vaulted) | Backup vault | Vaulted copy | Selected |
| Azure Managed Disk | Backup vault | Incremental snapshot | No |
| PostgreSQL Flexible Server | Backup vault | Vaulted | Selected |
| AKS | Backup vault | Cluster + PV | No |
| On-prem servers (MARS agent) | Recovery Services | File/folder + system state | No |
| On-prem VMs (MABS/DPM) | Recovery Services | Disk-to-disk-to-vault | No |
Now the day-zero properties. Create a Recovery Services vault and immediately set redundancy and CRR – storage redundancy is only changeable while the vault has zero protected items, so this is the first decision, not a later tuning step:
az backup vault create \
--resource-group rg-backup-prod \
--name rsv-prod-weu \
--location westeurope
# GeoRedundant + CrossRegionRestore enabled is the prerequisite for CRR.
# This MUST happen before you onboard the first item.
az backup vault backup-properties set \
--resource-group rg-backup-prod \
--name rsv-prod-weu \
--backup-storage-redundancy GeoRedundant \
--cross-region-restore-flag true
resource vault 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
name: 'rsv-prod-weu'
location: 'westeurope'
sku: { name: 'RS0', tier: 'Standard' }
identity: { type: 'SystemAssigned' } // for cross-tenant guard + CMK later
properties: {}
}
// Redundancy + CRR are set on the backup config sub-resource.
resource vaultConfig 'Microsoft.RecoveryServices/vaults/backupstorageconfig@2023-04-01' = {
parent: vault
name: 'vaultstorageconfig'
properties: {
storageModelType: 'GeoRedundant'
crossRegionRestoreFlag: true
}
}
The redundancy options, what they cost you, and what they protect against:
| Redundancy | Copies | Protects against | CRR support | Relative cost |
|---|---|---|---|---|
| LocallyRedundant (LRS) | 3 (one datacentre) | Disk/rack/node failure | No | Lowest |
| ZoneRedundant (ZRS) | 3 (across AZs) | Single-AZ loss in-region | No (no 2nd region) | Medium |
| GeoRedundant (GRS) | 6 (3 local + 3 paired) | Full region loss | Yes | Highest |
| Geo-Zone-Redundant (GZRS) | 6 (3 AZ-spread + 3 paired) | AZ loss and region loss | Storage-account only (not vault default) | Highest+ |
Cross-region restore requires
GeoRedundantstorage. It does not work withLocallyRedundantorZoneRedundant. If you need both zone resilience and CRR, that is not a single setting – ZRS protects you within the region, GRS+CRR uses the geo-paired region. Decide which failure mode dominates your risk model before you onboard anything, because after the first protected item the redundancy is frozen.
2. Immutable vaults: unlocked vs locked, and the operational trade-off
Vault immutability prevents operations that would reduce the protection of existing recovery points: deleting backup data before its retention expires, shortening retention in a policy, or disabling soft delete. It does not block creating new backups or extending retention – only the destructive direction is gated. This is the single most-misunderstood control: people enable it, feel safe, and never lock it – which means an admin (or attacker) can simply disable it and then delete.
There are two states, and the difference is whether you can ever go back:
| State | Protection active? | Can an admin disable it? | Attacker-proof? | Use it as |
|---|---|---|---|---|
| Disabled | No | n/a | No | Pre-hardening default |
| Unlocked | Yes | Yes | No | The soak / test period |
| Locked | Yes | No – irreversible | Yes | Final production state |
Enable it unlocked first via the vault’s securitySettings. With the CLI you patch the vault property:
# Step 1: enable immutability in the "Unlocked" state for a soak period.
az resource update \
--resource-group rg-backup-prod \
--name rsv-prod-weu \
--resource-type Microsoft.RecoveryServices/vaults \
--set properties.securitySettings.immutabilitySettings.state=Unlocked
Run unlocked for a release cycle or two. Confirm no automation breaks – the usual offenders are decommissioning pipelines that delete backups early, or policy-as-code that lowers retention. Once you are confident, lock it. In Bicep the locked state is explicit and intentional:
resource vault 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
name: 'rsv-prod-weu'
location: 'westeurope'
sku: { name: 'RS0', tier: 'Standard' }
properties: {
securitySettings: {
immutabilitySettings: {
// 'Locked' is irreversible. Deploy this only after soaking on 'Unlocked'.
state: 'Locked'
}
}
}
}
Exactly which operations immutability blocks once active – this is the contract, memorise it:
| Operation | Blocked by immutability? | Why |
|---|---|---|
| Create a new backup / recovery point | No | Adds protection |
| Extend retention in a policy | No | Lengthens protection |
| Stop protection, retain data | No | Data is kept |
| Delete a recovery point before retention expires | Yes | Reduces protection |
| Shorten retention duration in a policy | Yes | Reduces protection |
| Stop protection with delete data | Yes | Destroys protection |
| Disable soft delete | Yes | Removes the safety net |
| Reduce soft-delete retention | Yes | Shrinks the recovery window |
| Modify a policy to lower retention | Yes | Reduces protection of existing RPs |
| Change vault redundancy | n/a | Separately frozen after first item |
The operational trade-off is real: once locked, you cannot shorten retention even for a legitimate cost-cutting exercise. If you set a 10-year policy by mistake and lock the vault, you pay for 10 years. Treat the lock like a production change-freeze decision – review every active policy’s retention before you flip it. The decision of when to move between states:
| If you are… | Immutability state | Because |
|---|---|---|
| Standing up a brand-new vault | Disabled → Unlocked same day | Start soaking immediately |
| Mid-soak, automation still being audited | Unlocked | You may need to disable if a pipeline breaks |
| Soaked clean, retention reviewed, compliance signed off | Locked | Now attacker-proof; one-way door accepted |
| Unsure whether a policy is over-long | Do not lock yet | Trim retention first; locking freezes it |
3. Multi-user authorization with Resource Guard across tenants
Immutability stops you reducing protection on existing data. MUA stops the other class of attack: disabling soft delete, deleting the protection entirely, or removing immutability while it is still unlocked. It does this by requiring that destructive vault operations be authorized through a Resource Guard – a separate Microsoft.DataProtection/resourceGuards resource that you deliberately place where the backup admin has no permissions.
The architecture that actually resists insider compromise puts the Resource Guard in a different tenant (or at minimum a different subscription governed by a different team):
Tenant A (workload) Tenant B (security)
+-----------------------+ +------------------------+
| Recovery Services | protected by | Resource Guard |
| vault |------------------>| (no Backup Operator |
| | | for Tenant A admins) |
| Backup admin: full | | Security admin: owns |
| rights EXCEPT the | | the guard, approves |
| guard-protected ops | | critical operations |
+-----------------------+ +------------------------+
Create the guard in the security tenant/subscription:
az dataprotection resource-guard create \
--resource-group rg-security-guards \
--name rg-prod-resourceguard \
--location westeurope
resource guard 'Microsoft.DataProtection/resourceGuards@2024-04-01' = {
name: 'rg-prod-resourceguard'
location: 'westeurope'
properties: {} // protects the default critical-operation set
}
By default the guard protects a set of critical operations. Knowing exactly which ones are gated – and which you can optionally exclude – is the difference between real MUA and a guard that protects nothing useful:
| Gated operation | Default protected? | Excludable? | Attack it stops |
|---|---|---|---|
| Disable MUA (remove the guard) | Yes | No | Attacker turning off the gate itself |
| Disable soft delete | Yes | Yes | Pre-deletion safety-net removal |
| Reduce soft-delete retention | Yes | Yes | Shrinking the recovery window |
| Disable immutability (while Unlocked) | Yes | Yes | Removing the WORM floor |
| Stop protection with delete data | Yes | Yes | Destroying recovery points |
| Modify / delete a backup policy | Yes | Yes | Retention tampering |
| Change passphrase (MARS agent) | Yes | Yes | Encryption-key theft for on-prem |
| Remove the Resource Guard mapping | Yes | No | Detaching the gate from the vault |
| Unregister a protected container | Yes | Yes | Orphaning recovery points |
Inspect and tune which operations are gated:
az dataprotection resource-guard list-protected-operations \
--resource-group rg-security-guards \
--name rg-prod-resourceguard \
--resource-type Microsoft.RecoveryServices/vaults
Now associate the vault with the guard. The backup admin in Tenant A needs Reader on the guard (cross-tenant) to create the association, and after this is in place they can no longer perform the protected operations without a just-in-time approval from Tenant B:
# Run as the backup admin, authenticated to BOTH tenants.
az backup vault resource-guard-mapping update \
--resource-group rg-backup-prod \
--vault-name rsv-prod-weu \
--resource-guard-id "/subscriptions/<security-sub>/resourceGroups/rg-security-guards/providers/Microsoft.DataProtection/resourceGuards/rg-prod-resourceguard"
The operating model after association: when the backup team genuinely needs to perform a protected operation (say, retire a workload), the security team grants the backup operator’s identity a time-bound Backup Operator role on the Resource Guard via Azure AD PIM, the operation is performed within the activation window, and the role expires. An attacker who has only compromised Tenant A cannot self-approve – they lack any standing access to the guard.
The roles involved, where they are assigned, and what each can do – get the scope wrong and you either break MUA or lock yourself out:
| Role | Assigned on | Held by | Purpose |
|---|---|---|---|
| Backup Contributor | Vault (Tenant A) | Backup squad (standing) | Day-job: configure, protect, restore |
| Reader | Resource Guard (Tenant B) | Backup admin (standing) | Create the vault↔guard mapping |
| Backup MUA Operator / Backup Operator | Resource Guard (Tenant B) | Backup admin (JIT via PIM only) | Approve a single destructive op in-window |
| Owner / User Access Admin | Resource Guard (Tenant B) | Security team only | Grant the JIT role; never Tenant A |
That separation of duties is the entire value of MUA. The placement decision is the whole control – if you co-locate the guard, you get nothing:
| Guard placement | Separation strength | An attacker-as-backup-admin can… | Verdict |
|---|---|---|---|
| Same subscription as vault | None | Self-grant Backup Operator on the guard | Theatre – do not do this |
| Different subscription, same tenant, same team | Weak | Escalate via tenant-level role | Better than nothing |
| Different subscription, same tenant, different team | Good | Nothing without the other team | Acceptable minimum |
| Different tenant, security team | Strong | Nothing – no cross-tenant standing access | Target architecture |
4. Enhanced soft delete and recovering from deletion
Soft delete keeps backup data retrievable after someone deletes a backup item or stops protection with “delete data.” Enhanced soft delete (the current model for Recovery Services vaults) makes the feature always-on and configurable: you set a retention between 14 and 180 days, and you can optionally make soft delete itself immutable (non-disablable). Basic soft delete was a fixed 14 days and could be turned off – enhanced is the one you want.
The two soft-delete generations side by side:
| Property | Basic soft delete | Enhanced soft delete |
|---|---|---|
| Retention | Fixed 14 days | Configurable 14–180 days |
| Can be disabled | Yes | Optional – AlwaysON makes it permanent |
| Cost during retention | Free | Free for 14 days, then charged |
| Applies to | Recovery Services vault | Recovery Services + Backup vault |
| Recommended | No | Yes |
The three soft-delete feature states and what each means operationally:
| State | Soft delete active? | Disablable? | When to use |
|---|---|---|---|
Disable |
No | n/a | Never in production |
Enable |
Yes | Yes (an admin can turn it off) | Soak period only |
AlwaysON |
Yes | No – irreversible | Production target |
Configure it:
# Configure enhanced soft delete to 30 days. AlwaysON makes it non-disablable.
az backup vault backup-properties set \
--resource-group rg-backup-prod \
--name rsv-prod-weu \
--soft-delete-feature-state AlwaysON \
--soft-delete-retention-period-in-days 30
resource vaultProps 'Microsoft.RecoveryServices/vaults/backupconfig@2023-04-01' = {
parent: vault
name: 'vaultconfig'
properties: {
enhancedSecurityState: 'Enabled'
softDeleteFeatureState: 'AlwaysON' // irreversible
softDeleteRetentionPeriodInDays: 30 // 14-180
}
}
AlwaysONis irreversible in the same spirit as locked immutability – you can extend the retention but never disable the feature. Combined with immutability and MUA, you now have three controls that an admin-level attacker cannot individually defeat: they cannot delete inside retention (immutability), cannot turn off soft delete (AlwaysON), and cannot disable any of it without the guard (MUA).
When a backup is deleted – maliciously or by a fat-fingered decommission script – the item moves to a soft-deleted state. Recovery is undelete-then-resume:
# List soft-deleted items.
az backup item list \
--resource-group rg-backup-prod \
--vault-name rsv-prod-weu \
--backup-management-type AzureIaasVM \
--query "[?properties.isScheduledForDeferredDelete].name" -o tsv
# Undelete and re-enable protection for a specific VM.
az backup protection undelete \
--resource-group rg-backup-prod \
--vault-name rsv-prod-weu \
--container-name <container> \
--item-name <vm-name> \
--backup-management-type AzureIaasVM \
--workload-type VM
The deletion-state lifecycle, so you know what is recoverable and for how long:
| Item state | What happened | Recoverable? | Window | Action to recover |
|---|---|---|---|---|
| Protected | Normal, active backups | n/a | n/a | none |
| Stop protection, retain data | Backups paused, RPs kept | Yes | Until retention expires | Resume protection |
| Soft-deleted | Deleted but within soft-delete window | Yes | 14–180 days | undelete + resume |
| Permanently deleted | Soft-delete window expired or skipped | No | gone | Restore from CRR copy if any |
For Backup vaults (Blobs, Disks, PostgreSQL), the equivalent is configured through the vault’s softDeleteSettings with the same 14–180 day window, set via az dataprotection backup-vault update or the portal:
az dataprotection backup-vault update \
--resource-group rg-backup-prod \
--vault-name bv-prod-weu \
--soft-delete-state AlwaysOn \
--soft-delete-retention-in-days 30
5. Backup policies, retention, and instant-restore snapshots
Policy is where retention lives, and retention is what immutability and MUA enforce. Build the policy deliberately. For Azure VMs, the instant restore tier keeps local snapshots (1–5 days) for fast restores that never touch vault storage, while GRS-replicated recovery points serve long-term and cross-region needs.
A defensible IaaS policy template – daily plus a weekly/monthly/yearly grandfather-father-son ladder:
{
"schedulePolicy": {
"schedulePolicyType": "SimpleSchedulePolicy",
"scheduleRunFrequency": "Daily",
"scheduleRunTimes": ["2026-06-08T01:00:00Z"]
},
"retentionPolicy": {
"retentionPolicyType": "LongTermRetentionPolicy",
"dailySchedule": { "retentionDuration": { "count": 30, "durationType": "Days" } },
"weeklySchedule": { "daysOfTheWeek": ["Sunday"], "retentionDuration": { "count": 12, "durationType": "Weeks" } },
"monthlySchedule": { "retentionScheduleFormatType": "Weekly", "retentionScheduleWeekly": { "daysOfTheWeek": ["Sunday"], "weeksOfTheMonth": ["First"] }, "retentionDuration": { "count": 36, "durationType": "Months" } },
"yearlySchedule": { "retentionScheduleFormatType": "Weekly", "monthsOfYear": ["January"], "retentionScheduleWeekly": { "daysOfTheWeek": ["Sunday"], "weeksOfTheMonth": ["First"] }, "retentionDuration": { "count": 7, "durationType": "Years" } }
},
"instantRpRetentionRangeInDays": 5,
"timeZone": "UTC"
}
az backup policy set \
--resource-group rg-backup-prod \
--vault-name rsv-prod-weu \
--name policy-iaas-gfs \
--policy @iaas-policy.json
The GFS ladder explained – each tier, its purpose, the typical count, and the cost driver:
| Tier | Frequency | Typical retention | Purpose | Cost driver |
|---|---|---|---|---|
| Instant restore | per backup | 1–5 days (snapshot) | Fast same-region restore | Snapshot storage in source sub |
| Daily | daily | 7–30 days | Operational recovery | Vault storage, churn |
| Weekly | 1/week | 4–12 weeks | Rollback past a bad week | Vault storage |
| Monthly | 1/month | 12–36 months | Monthly compliance points | Vault storage |
| Yearly | 1/year | 1–10 years | Long-term / audit WORM | Vault storage (locked = permanent) |
The retention limits and policy knobs that catch people:
| Setting | Range / default | When to change | Trade-off / gotcha |
|---|---|---|---|
instantRpRetentionRangeInDays |
1–5, default 2 | Lower for large chatty VMs (cost) | Snapshot cost in source sub; short = slower same-region restore |
| Daily retention | up to 9999 days | Match operational RPO | Storage grows with churn × retention |
| Weekly/monthly/yearly | up to 99 years (yearly) | Compliance mandate | Locked immutability freezes this |
| Backup frequency | up to several/day (enhanced) | Tighter RPO | More RPs = more storage + snapshot cost |
| Time zone | any | Match maintenance window | Wrong TZ = backup during peak |
| Daily backups per policy (enhanced) | up to 6/day (4-hour min interval) | Tighter RPO on critical DBs | Snapshot + storage cost scales |
| Log backup frequency (SQL) | 15 min–24 h | Sub-15-min RPO for transactions | Storage churn; log chain integrity |
Two retention facts that catch people:
- Instant restore snapshots (
instantRpRetentionRangeInDays, max 5) live in the source subscription and incur snapshot storage cost. Lower it to 1–2 days for chatty, large VMs to control cost; raise it where fast same-region restore matters. - Once the vault is locked-immutable, you cannot shorten any of these durations. Right-size the ladder against your real compliance requirement before locking, or you will overpay for years.
6. Cross-region restore and zone-redundant storage
CRR lets you restore a VM, SQL-in-VM, or SAP HANA backup into the Azure-paired region without waiting for a regional failover or a Microsoft-declared outage – you choose to restore in the secondary on demand. It is the control that turns “the region is down” from an outage into a runbook. The prerequisites, in order:
| # | Prerequisite | Set where | When | If missing |
|---|---|---|---|---|
| 1 | Redundancy = GeoRedundant |
Vault backup-properties | Day-zero, 0 items | No 2nd copy; can’t enable CRR |
| 2 | crossRegionRestore flag = true |
Vault backup-properties | Day-zero (needs GRS) | Secondary RPs not exposed |
| 3 | Workload type supports CRR | n/a (VM/SQL/HANA only) | by design | Other types: no CRR |
| 4 | Staging storage account in secondary | Pre-provisioned | Before incident | Restore fails mid-incident |
| 5 | Target resource group in secondary | Pre-provisioned | Before incident | Nowhere to land disks |
CRR and zone-redundant storage solve different problems and you cannot have both on one vault. The comparison that drives the day-zero choice:
| Dimension | ZoneRedundant (ZRS) | GeoRedundant (GRS) + CRR |
|---|---|---|
| Protects against | Single-AZ loss in primary region | Full primary-region loss |
| Second-region copy | None | Yes (paired region) |
| CRR (on-demand secondary restore) | No | Yes |
| Restore latency | In-region, fast | Cross-region, slower |
| Best when dominant risk is | Zone failure, low-latency in-region HA | Region outage, ransomware, compliance |
For a production vault whose dominant risk is regional or ransomware, choose GeoRedundant + CRR. List the secondary-region recovery points and restore:
# Enumerate recovery points available in the SECONDARY (paired) region.
az backup recoverypoint list \
--resource-group rg-backup-prod \
--vault-name rsv-prod-weu \
--container-name <container> \
--item-name <vm-name> \
--backup-management-type AzureIaasVM \
--workload-type VM \
--use-secondary-region \
--query "[].{name:name, time:properties.recoveryPointTime}" -o table
# Restore disks into the secondary region from a secondary recovery point.
az backup restore restore-disks \
--resource-group rg-backup-prod \
--vault-name rsv-prod-weu \
--container-name <container> \
--item-name <vm-name> \
--rp-name <recovery-point-id> \
--use-secondary-region \
--target-resource-group rg-dr-northeurope \
--storage-account <staging-sa-in-secondary>
The restore lands disks in the secondary region; you then build the VM from those disks (or use the full-VM restore flow). Note the staging storage account and target resource group must already exist in the secondary region – pre-provision them as part of your DR landing zone, not during the incident. The common geo-pairs you will target:
| Primary region | Azure-paired secondary | Notes |
|---|---|---|
| West Europe | North Europe | Classic EU pair |
| North Europe | West Europe | Symmetric |
| East US | West US | US pair |
| Central India | South India | In-country pair (data residency) |
| Southeast Asia | East Asia | APAC pair |
| UK South | UK West | In-country pair |
Verify – prove each control
Do not trust the configuration blade. Prove each control with a command and, for restore, with an actual recovery:
# 1. Immutability state is Locked.
az resource show \
--resource-group rg-backup-prod --name rsv-prod-weu \
--resource-type Microsoft.RecoveryServices/vaults \
--query "properties.securitySettings.immutabilitySettings.state" -o tsv
# Expect: Locked
# 2. Soft delete is AlwaysON with your retention; redundancy + CRR set.
az backup vault backup-properties show \
--resource-group rg-backup-prod --name rsv-prod-weu \
--query "{softDelete:softDeleteFeatureState, days:softDeleteRetentionPeriodInDays, redundancy:storageModelType, crr:crossRegionRestoreFlag}"
# Expect: AlwaysON / 30 / GeoRedundant / true
# 3. Resource Guard mapping exists.
az backup vault resource-guard-mapping show \
--resource-group rg-backup-prod --vault-name rsv-prod-weu \
--query "properties.resourceGuardOperationDetails" -o table
// 4. In Log Analytics (vault diagnostics -> CoreAzureBackup), confirm a
// successful secondary-region restore in the last 7 days.
AddonAzureBackupJobs
| where TimeGenerated > ago(7d)
| where BackupItemUniqueId != ""
| where JobOperation == "Restore"
| project TimeGenerated, JobStatus, JobOperation, BackupManagementType, JobUniqueId
| order by TimeGenerated desc
The fourth check is the one that matters. A green config and an untested restore is exactly the trap from the ASR world: “protected” is not “recoverable” until you have booted from a secondary-region recovery point and timed it. The verification matrix you run before signing off:
| Control | Confirm command / path | Expected | If wrong |
|---|---|---|---|
| Redundancy | backup-properties show → storageModelType |
GeoRedundant | Recreate vault (frozen after items) |
| CRR flag | backup-properties show → crossRegionRestoreFlag |
true | Set flag (needs GRS) |
| Immutability | resource show → immutabilitySettings.state |
Locked | Soak then lock |
| Soft delete | backup-properties show → softDeleteFeatureState |
AlwaysON | Set AlwaysON |
| MUA mapping | resource-guard-mapping show |
Guard present | Re-map cross-tenant guard |
| Restore proof | AddonAzureBackupJobs Restore in 7d |
Completed | Run a real drill |
The az command cheat-sheet for this whole posture, in one place to keep open during operations:
| Task | Command (az …) |
|---|---|
| Set redundancy + CRR | backup vault backup-properties set --backup-storage-redundancy GeoRedundant --cross-region-restore-flag true |
| Enable soft delete AlwaysON | backup vault backup-properties set --soft-delete-feature-state AlwaysON --soft-delete-retention-period-in-days 30 |
| Enable immutability (Unlocked) | resource update --set properties.securitySettings.immutabilitySettings.state=Unlocked |
| Lock immutability (irreversible) | resource update --set properties.securitySettings.immutabilitySettings.state=Locked |
| Create Resource Guard | dataprotection resource-guard create -g <rg> -n <guard> -l <loc> |
| Map vault to guard | backup vault resource-guard-mapping update --resource-guard-id <id> |
| List soft-deleted items | backup item list --query "[?properties.isScheduledForDeferredDelete].name" |
| Undelete an item | backup protection undelete --container-name <c> --item-name <i> |
| List secondary-region RPs | backup recoverypoint list --use-secondary-region |
| Cross-region restore disks | backup restore restore-disks --use-secondary-region --target-resource-group <dr-rg> |
| Stream diagnostics | monitor diagnostic-settings create --workspace <law-id> --logs '[{"categoryGroup":"allLogs","enabled":true}]' |
7. Monitoring with Backup center, alerts, and Backup reports
Backup center is the single pane across every vault in the tenant – jobs, alerts, policy compliance, and security posture in one place. Even with MUA gating destructive operations, you want to be told the moment one is attempted, because an attempted strip is a detection signal. Two monitoring layers matter:
- Built-in Azure Monitor alerts for Azure Backup: enable the alert rules for backup failure and destructive operations (stop protection with delete data, disable soft delete, delete backup data). These fire on the operations MUA gates – so even with MUA in place, you get told the moment someone attempts one.
- Backup reports: a Log Analytics-backed workbook for trends – protected instances, storage consumed, policy adherence, jobs over time. Wire vault diagnostics to a workspace to power it:
az monitor diagnostic-settings create \
--name backup-diag \
--resource "/subscriptions/<sub>/resourceGroups/rg-backup-prod/providers/Microsoft.RecoveryServices/vaults/rsv-prod-weu" \
--workspace "/subscriptions/<sub>/resourceGroups/rg-obs/providers/Microsoft.OperationalInsights/workspaces/law-platform" \
--logs '[{"categoryGroup":"allLogs","enabled":true}]'
Alert on the security-relevant operations specifically:
CoreAzureBackup
| where TimeGenerated > ago(1d)
| where OperationName has_any ("StopProtectionWithRetainData", "StopProtectionWithDeleteData", "DisableSoftDelete")
| project TimeGenerated, OperationName, BackupItemUniqueId, State
The alerts to wire, what they catch, and their severity:
| Alert | Fires on | Severity | Route to |
|---|---|---|---|
| Backup failure | Job status = Failed | Sev 2 | Backup squad |
| Stop protection + delete data | Destructive op attempted | Sev 0 | SOC + backup squad |
| Disable soft delete | Safety-net removal attempt | Sev 0 | SOC |
| Disable / reduce immutability | WORM floor tampering | Sev 0 | SOC |
| Restore started | Any restore job | Sev 3 (informational) | Backup squad |
| Resource Guard unmap attempt | MUA being disabled | Sev 0 | Security team |
| Delete backup data | RP deletion | Sev 1 | SOC + backup squad |
| Reduce retention in policy | Retention tampering | Sev 1 | Compliance + backup squad |
| GRS replication lag high | Secondary copy falling behind | Sev 2 | Backup squad |
The diagnostic log categories worth streaming and what each powers:
| Category | Contains | Powers |
|---|---|---|
CoreAzureBackup |
Vault-level operations + state | Destructive-op alerting |
AddonAzureBackupJobs |
Job success/failure/duration | Restore-drill proof, SLA |
AddonAzureBackupPolicy |
Policy associations | Compliance reporting |
AddonAzureBackupStorage |
Storage consumed | Cost trend in Backup reports |
AddonAzureBackupProtectedInstance |
Protected-instance count | Billing reconciliation |
Architecture at a glance
The diagram traces the destructive path through all four controls, left to right, exactly as an attacker would attempt it and exactly as your hardening blocks it. On the left, the workload Tenant A holds the backup admin (standing Backup Contributor) and the ~900 protected items – VMs, SQL, Blob, Disk, PostgreSQL. The admin’s normal flow is the blue “protect / backup” arrow into the primary vault (West Europe), which is GeoRedundant with the CRR flag set. That vault carries three of the four controls stacked: the immutability WORM floor (state=Locked, badge 1), and enhanced soft delete (AlwaysON, 14–180 days, badge 2). When any destructive operation is attempted – the red “destructive op → authorize” arrow – it cannot complete inside Tenant A; it must round-trip to the security Tenant B, where the Resource Guard (badge 3) gates the five destructive operations and a PIM activation grants a just-in-time, time-bound Backup Operator role that flows back as the green “JIT approval” arrow. There is no standing path for an attacker-as-admin to self-approve.
The right half is recovery and proof. The vault asynchronously replicates over the teal “GRS replicate” arrow to the paired region (North Europe), where the read-only secondary recovery points live and CRR (badge 4) restores disks into a pre-provisioned DR resource group and staging storage account. Finally everything – the destructive-operation attempts especially (badge 5) – streams as diagnostics to Log Analytics (CoreAzureBackup), where the Sev-0 destructive-operation alert fires the moment someone tries a strip, even though MUA already blocked it. Read the five legend numbers as the five things that must hold: immutability floor, soft-delete net, MUA approval, cross-region copy, and the detection alert. Defeat any one in isolation and the attacker wins; together and locked, they do not.
Real-world scenario
A European fintech platform team – call them Helvetia Pay – ran ~900 production VMs across two GeoRedundant Recovery Services vaults (West Europe primary, North Europe pair), governed by a central backup squad holding Backup Contributor on the landing-zone subscriptions. Their CI/CD platform used a service principal that, through role inheritance at subscription scope, also held Backup Contributor – a fact nobody had registered as a risk. A scheduled red-team exercise compromised that CI service principal via a leaked pipeline secret.
The red team’s playbook was textbook ransomware: before touching any workload, destroy the recovery path. The first phase report was damning. With only immutability enabled in the Unlocked state, the attacker path was: (1) disable immutability (it was never locked, because the team feared losing the ability to shorten retention), (2) disable soft delete in the same call, then (3) stop-protection-with-delete-data on the 40 crown-jewel VMs. Every one of those operations succeeded in the lab because the compromised principal held the rights and nothing gated them. The simulated blast radius: zero recoverable backups for the payment-processing tier, against a regulatory RPO of 24 hours. Had this been real ransomware, Helvetia Pay would have been choosing between paying and going out of business.
The fix was sequencing and separation, not new technology. Over a two-week hardening sprint they:
| Step | Action | Control hardened | Why this order |
|---|---|---|---|
| 1 | Audited every policy’s retention; trimmed 3 over-long yearly schedules from 10y to 7y | Cost / pre-lock hygiene | Locking freezes retention forever |
| 2 | Set enhanced soft delete to AlwaysON / 30 days on both vaults |
Soft delete | Net must exist before lock |
| 3 | Flipped immutability to Locked on both vaults |
Immutability | Now attacker-proof, retention reviewed |
| 4 | Stood up a Resource Guard in a separate security tenant, mapped both vaults | MUA | Removes self-approval entirely |
| 5 | Replaced CI’s standing Backup Contributor with PIM-activated, scoped roles | Least privilege | Kills the inherited-rights path |
| 6 | Wired vault diag → Log Analytics; enabled Sev-0 destructive-op alerts | Detection | Get paged on attempts |
| 7 | Ran a timed CRR drill into North Europe; recorded RTO = 42 min | Recoverability proof | “Protected” ≠ “recoverable” |
The re-run red team, holding the same compromised Backup Contributor in the workload tenant, was fully blocked. Their first destructive call failed authorization:
# Re-run attacker, holding Backup Contributor in the workload tenant, tries to
# strip protection. With the Resource Guard mapped, this FAILS authorization
# because the identity has no Backup Operator role on the guard in Tenant B.
az backup protection disable \
--resource-group rg-backup-prod --vault-name rsv-prod-weu \
--container-name <c> --item-name <vm> \
--backup-management-type AzureIaasVM --delete-backup-data true
# -> ResourceGuard: operation requires authorization on the Resource Guard.
Better still, the attempt tripped the Sev-0 alert and the SOC saw it within seconds. The lesson the team wrote into their platform standard: these four controls are only worth anything combined and locked. Any single one, left unlocked or co-located with the admin who would be the attacker, is theatre. The total spend was roughly ₹40,000/month in extra soft-deleted and GRS storage across both vaults – trivially less than one hour of the outage they avoided.
Advantages and disadvantages
The hardened posture is not free – the irreversibility that makes it attacker-proof is the same property that bites you if you mis-size before locking. The honest trade-off:
| Advantages | Disadvantages |
|---|---|
| Survives an admin-level / insider compromise | Three controls are one-way doors (Locked, AlwaysON, redundancy) |
| Satisfies WORM / regulatory retention audits | Locked immutability freezes retention – mis-sizing = years of overpay |
| Deleted backups recoverable for up to 180 days | Soft-deleted RPs cost storage after the free 14 days |
| Out-of-region copy and on-demand CRR | GRS costs more than LRS/ZRS; no ZRS+CRR combo |
| MUA removes standing destructive rights | Cross-tenant guard adds operational friction (PIM round-trip) |
| Attempted strips are detected and alerted | Requires a second team / tenant to operate the guard |
| Recoverability is provable via timed drills | Drills take effort and a pre-built DR landing zone |
When each matters: the irreversibility is your friend in any regulated or high-extortion-risk estate – it is exactly what an auditor wants to see and exactly what defeats the attacker. It is your enemy only if you skip the soak and retention review, which is why the sequencing discipline (audit → soft delete → lock → MUA) is non-negotiable. The GRS cost premium matters most for very large, high-churn estates; for them, tune instant-restore retention down and consider operational-only Blob backup where a second copy is not mandated. The cross-tenant friction matters for small teams who do not have a separate security org – for them, a different-subscription-different-team guard is the pragmatic floor.
Hands-on lab
This builds a fully hardened single-VM vault end to end, proves every control, then tears it down. It uses a B1s VM and minimal storage; the soft-deleted/GRS storage cost for an afternoon is negligible. You need an Azure subscription, the az CLI, and (for the MUA step) Owner on a second subscription to host the guard.
1. Resource group and a tiny VM to protect.
az group create -n rg-bkup-lab -l westeurope
az vm create -g rg-bkup-lab -n vm-lab --image Ubuntu2204 \
--size Standard_B1s --admin-username azureuser --generate-ssh-keys
2. Create the vault and set redundancy + CRR FIRST (day-zero).
az backup vault create -g rg-bkup-lab -n rsv-bkup-lab -l westeurope
az backup vault backup-properties set -g rg-bkup-lab -n rsv-bkup-lab \
--backup-storage-redundancy GeoRedundant --cross-region-restore-flag true
3. Enable enhanced soft delete (AlwaysON) and immutability Unlocked.
az backup vault backup-properties set -g rg-bkup-lab -n rsv-bkup-lab \
--soft-delete-feature-state AlwaysON --soft-delete-retention-period-in-days 14
az resource update -g rg-bkup-lab -n rsv-bkup-lab \
--resource-type Microsoft.RecoveryServices/vaults \
--set properties.securitySettings.immutabilitySettings.state=Unlocked
4. Protect the VM with the default IaaS policy and run a backup now.
az backup protection enable-for-vm -g rg-bkup-lab --vault-name rsv-bkup-lab \
--vm vm-lab --policy-name DefaultPolicy
az backup protection backup-now -g rg-bkup-lab --vault-name rsv-bkup-lab \
--container-name vm-lab --item-name vm-lab \
--backup-management-type AzureIaasVM \
--retain-until 30-06-2026
# Expected: a Backup job appears; wait for Completed.
5. Create a Resource Guard (in a second subscription) and map the vault.
az account set --subscription <security-sub-id>
az group create -n rg-guard-lab -l westeurope
az dataprotection resource-guard create -g rg-guard-lab -n guard-lab -l westeurope
GUARD_ID=$(az dataprotection resource-guard show -g rg-guard-lab -n guard-lab --query id -o tsv)
az account set --subscription <workload-sub-id>
az backup vault resource-guard-mapping update -g rg-bkup-lab \
--vault-name rsv-bkup-lab --resource-guard-id "$GUARD_ID"
6. Prove every control (the verification matrix).
az resource show -g rg-bkup-lab -n rsv-bkup-lab \
--resource-type Microsoft.RecoveryServices/vaults \
--query "properties.securitySettings.immutabilitySettings.state" -o tsv # Unlocked
az backup vault backup-properties show -g rg-bkup-lab -n rsv-bkup-lab \
--query "{sd:softDeleteFeatureState, crr:crossRegionRestoreFlag, r:storageModelType}"
# Expect: AlwaysON / true / GeoRedundant
7. Test the MUA gate – this should be authorization-blocked.
# With the guard mapped and no PIM Backup Operator on it, this FAILS:
az backup protection disable -g rg-bkup-lab --vault-name rsv-bkup-lab \
--container-name vm-lab --item-name vm-lab \
--backup-management-type AzureIaasVM --delete-backup-data true
# -> Expected: ResourceGuard authorization error. The gate works.
8. (Optional) Test soft-delete recovery. Stop protection with retain, delete the item, then list soft-deleted and undelete as shown in section 4.
9. Teardown. Unmap the guard, disable protection (retain or delete in the lab), then delete both resource groups. Note that with Locked immutability you could not delete protected data early – which is why the lab uses Unlocked.
# Remove protection (lab uses Unlocked immutability so this is allowed),
az backup protection disable -g rg-bkup-lab --vault-name rsv-bkup-lab \
--container-name vm-lab --item-name vm-lab \
--backup-management-type AzureIaasVM --delete-backup-data true --yes
az group delete -n rg-bkup-lab --yes --no-wait
az group delete -n rg-guard-lab --subscription <security-sub-id> --yes --no-wait
Common mistakes & troubleshooting
The failure modes here are operational and they cluster around the irreversible doors and the cross-tenant plumbing. This is the playbook – symptom, root cause, the exact command or portal path to confirm, and the fix:
| # | Symptom | Root cause | Confirm (exact command / path) | Fix |
|---|---|---|---|---|
| 1 | Can’t enable CRR – flag won’t set | Redundancy is LRS/ZRS, not GRS | backup-properties show → storageModelType |
Set GRS before items; if items exist, new vault |
| 2 | “Redundancy change not allowed” | Vault already has protected items | az backup item list (non-empty) |
Recreate vault empty; redundancy is day-zero |
| 3 | Destructive op succeeds despite “immutability on” | Immutability is Unlocked, attacker disabled it first |
resource show → immutabilitySettings.state = Unlocked |
Lock it after soak + retention review |
| 4 | Can’t shorten an over-long retention | Vault is Locked immutable |
state = Locked | None – right-size before locking |
| 5 | MUA op blocked even for a legit change | No PIM Backup Operator on the guard | resource-guard-mapping show |
Security team grants JIT role for the window |
| 6 | Can’t map vault to guard | Backup admin lacks Reader on the guard (cross-tenant) | RBAC on the guard resource | Grant Reader on the guard in Tenant B |
| 7 | Soft-deleted item gone before expected | Soft delete was Enable (disablable) and got turned off |
backup-properties show → softDeleteFeatureState |
Set AlwaysON; can’t be disabled |
| 8 | CRR restore fails: “storage account not found” | No staging SA in secondary region | check secondary RG/SA exists | Pre-provision staging SA + target RG in pair |
| 9 | Secondary recovery points empty | GRS replication lag (up to hours) or CRR flag off | recoverypoint list --use-secondary-region |
Wait for replication; confirm CRR flag true |
| 10 | Backup reports / KQL empty | Diagnostics not wired to a workspace | az monitor diagnostic-settings list on vault |
Create diagnostic setting → Log Analytics |
| 11 | Destructive-op alert never fired | Built-in alert rules not enabled | Backup center → Alerts config | Enable failure + destructive-op alert rules |
| 12 | Locked the wrong (10y) retention | Skipped pre-lock retention review | yearly schedule count = 10 | None – this is the cautionary tale; audit first |
| 13 | Guard “protects nothing” | Excluded all operations when creating it | list-protected-operations (short list) |
Re-add the critical ops to the guard |
| 14 | MUA bypassed in incident | Guard co-located in same sub admin owns | guard subscription = vault subscription | Move guard cross-tenant/cross-team |
The decision table for “which control failed me” during a real incident:
| If you see… | It’s probably… | Do this |
|---|---|---|
| Backups deleted despite “immutability” | Immutability was Unlocked, not Locked | Lock it everywhere; this is the #1 gap |
| A destructive op went through with no approval | No MUA, or guard co-located | Map a cross-tenant Resource Guard |
| Deleted backup unrecoverable after 5 days | Basic soft delete (14d) or it was disabled | Enhanced soft delete AlwaysON |
| No copy when the region went down | Vault was LRS/ZRS, not GRS | GRS + CRR (rebuild if items exist) |
| Restore worked but took 6 hours | No pre-built DR landing zone | Pre-provision staging SA + target RG |
Best practices
- Decide redundancy on day zero.
GeoRedundant+ CRR flag, set before the first protected item. You cannot change it afterwards without re-creating the vault and re-onboarding. - Match the vault type to the workload. Recovery Services for VM/SQL-in-VM/HANA/snapshot-Files; Backup vault for Blob/Disk/PostgreSQL/AKS/vaulted-Files. Run both if your estate needs both.
- Soak immutability
Unlocked, then lock. Run unlocked for a release cycle or two, confirm no automation breaks, review every policy’s retention, then flip toLocked. The lock is irreversible – treat it like a change freeze. - Right-size retention before locking. Trim over-long yearly schedules first. A 10-year policy locked by mistake is a 10-year bill.
- Use enhanced soft delete at
AlwaysON. 14–180 days sized to your recovery window;AlwaysONso an attacker-as-admin cannot disable it. - Put the Resource Guard in a separate tenant (or at minimum a different subscription owned by a different team). Co-located is theatre.
- No standing destructive rights. Gate destructive operations behind PIM-activated, time-bound Backup Operator on the guard. Day-job rights (Backup Contributor) stay; strip rights do not.
- Tune instant-restore retention for cost. Snapshots live in the source subscription; 1–2 days for large chatty VMs, more where fast same-region restore matters.
- Stream vault diagnostics to Log Analytics and stand up the Backup reports workbook. You cannot prove compliance or trend storage from the blade.
- Enable built-in alerts for backup failure AND destructive operations. Even with MUA, you want to be paged on the attempt – it is a high-fidelity intrusion signal.
- Prove recoverability with a timed CRR drill. Boot a VM from a secondary-region recovery point, record the RTO, and repeat quarterly. A green blade is not a recovery.
- Pre-build the DR landing zone. Staging storage account and target resource group in the paired region exist before the incident, not during it.
Security notes
- Least privilege on the backup plane: the only standing role most identities need is Backup Reader (monitoring) or Backup Operator scoped tightly; Backup Contributor is broad and includes destructive rights – assign it sparingly and never inherit it accidentally at subscription scope (the Helvetia Pay CI-principal trap).
The backup RBAC roles, what each can and cannot do, and who should hold them:
| Role | Can configure | Can trigger backup/restore | Can delete data / stop protection | Typical holder |
|---|---|---|---|---|
| Backup Reader | No | No (read-only) | No | Monitoring, auditors, SOC |
| Backup Operator | No (no policy create) | Yes | No (cannot delete backup data) | Day-job operators |
| Backup Contributor | Yes (policies, protection) | Yes | Yes | Backup squad (sparingly) |
| Owner | Yes (everything) | Yes | Yes | Break-glass only |
| Reader (on guard, Tenant B) | No | No | No | Backup admin (to map guard) |
| Backup MUA Operator (on guard) | n/a | n/a | Approves a gated op in-window | Backup admin via PIM only |
- Separation of duties is the whole point of MUA: the security team owns the guard tenant and the PIM approvals; the backup team owns the vault. Neither can unilaterally destroy recovery points. Audit cross-tenant role assignments on the guard quarterly.
- Encryption: vaults encrypt at rest with platform-managed keys by default; for regulated estates use customer-managed keys (CMK) via the vault’s system-assigned identity and Key Vault – see Azure encryption at rest with CMK & double encryption and manage the keys per Azure Key Vault: secrets, keys & certificates.
- Network isolation: restrict vault and Backup-vault management to private endpoints where the workload demands it, so backup traffic and management stay off the public internet.
- Identity for cross-tenant: the vault’s system-assigned managed identity authenticates to the guard and (with CMK) to Key Vault – never use a shared secret or a service principal with a long-lived credential for the guard mapping.
- Detection-in-depth: route the Sev-0 destructive-operation alerts to your SIEM. Pair with Microsoft Sentinel for correlation – a destructive-op attempt alongside an anomalous sign-in is a far stronger ransomware signal than either alone.
- Break-glass for the guard: ensure a break-glass emergency-access path exists for the security tenant so a locked-out approver does not become a single point of failure during a real recovery.
Cost & sizing
What drives the Azure Backup bill, in rough order of impact – price points are indicative (≈₹/USD, vary by region and commitment):
| Cost driver | What it is | Rough indicative cost | How to right-size |
|---|---|---|---|
| Protected-instance fee | Per protected VM/DB/etc. per month | ~₹400–₹800 / $5–$10 per instance/mo (size-banded) | Decommission stale items; consolidate |
| Vault storage (GRS) | Backup data stored, geo-replicated | ~₹2/GB/mo LRS; GRS ≈ 2× | Right-size retention; LRS where 2nd copy not mandated |
| Instant-restore snapshots | Local snapshots in source sub (1–5d) | Snapshot storage rate × churn | Lower to 1–2 days for large chatty VMs |
| Soft-deleted storage | Deleted RPs kept past free 14 days | Same as vault storage rate | Right-size soft-delete window (14–180) |
| CRR / cross-region egress | Geo-replication + restore data movement | Per-GB egress on restore | Drill cost is one-off; replication is in GRS price |
| Log Analytics ingestion | Diagnostic logs for reports/alerts | ~₹/GB ingested | Filter categories; cap retention |
Sizing guidance: the protected-instance fee dominates for fleets of small VMs; the storage dominates for a few large, high-churn machines with long retention. The single biggest lever you control is retention × churn – a 7-year yearly point on a high-churn VM is expensive, and once the vault is locked you cannot reduce it, so size it before locking. GRS doubles storage versus LRS; pay it where a second-region copy or CRR is a real requirement, and consider LRS/ZRS for non-critical or operational-only protection. There is no free tier for production Azure Backup, but the lab in this article (one B1s VM, one afternoon) costs a few rupees. For the broader cost-conscious DR pattern on small teams, see Disaster recovery on a budget: backup & restore for small teams.
Interview & exam questions
These map to AZ-104 (Azure Administrator), AZ-305 (Solutions Architect), and SC-100 (Cybersecurity Architect) backup/resilience objectives.
-
Why can’t you change vault redundancy after onboarding the first item? Redundancy determines where backup data is physically stored; changing it would require re-replicating all existing recovery points, so Azure freezes it once any item is protected. It is a day-zero decision – set
GeoRedundantbefore onboarding if you want CRR. -
What is the difference between Unlocked and Locked immutability? Both block retention-reducing operations on existing recovery points.
Unlockedcan be disabled by an admin (a soak state);Lockedis irreversible – not even the subscription owner or Microsoft support can disable it, which is what makes it attacker-proof and WORM-compliant. -
What exactly does immutability NOT block? Creating new backups and extending retention. It only gates the destructive direction: delete-before-retention, shortening retention, and disabling soft delete.
-
What is a Resource Guard and why place it in a different tenant? A
Microsoft.DataProtection/resourceGuardsresource that gates destructive vault operations behind a second authorization (MUA). Placing it in a tenant the backup admin does not control means a compromised backup admin cannot self-approve a destructive operation – that is the separation of duties. -
Which operations does a Resource Guard gate by default? Disabling MUA itself, disabling/reducing soft delete, disabling immutability, stop-protection-with-delete-data, and modifying/deleting backup policies (and the MARS passphrase change). Several can be optionally excluded.
-
Basic vs enhanced soft delete? Basic is a fixed 14 days and can be turned off. Enhanced is configurable 14–180 days and can be made
AlwaysON(non-disablable). Enhanced is the current recommended model. -
Prerequisites for cross-region restore? Vault redundancy =
GeoRedundant, thecrossRegionRestoreflag enabled, a workload type that supports CRR (Azure VM, SQL-in-VM, SAP HANA-in-VM), and a pre-provisioned staging storage account + target resource group in the paired region. -
Can a vault have both ZRS and CRR? No.
ZoneRedundantprotects within the primary region against a single-AZ loss but has no second-region copy. CRR reads from theGeoRedundantpaired-region copy. Choose based on whether zone-loss or region-loss dominates your risk. -
How do you recover a maliciously deleted backup? If enhanced soft delete is on, the item is soft-deleted (14–180 day window). List soft-deleted items and run
az backup protection undelete, then resume protection. After the window, only a CRR/secondary copy can help. -
An attacker holds Backup Contributor. Which single control stops them from stopping protection and deleting data? MUA via a cross-tenant Resource Guard – it requires an approval they cannot grant. Immutability (Locked) independently stops early deletion; combined, neither can be defeated. The exam answer for “stop the operation entirely” is the Resource Guard.
-
Why is
instantRpRetentionRangeInDaysa cost lever? Instant-restore snapshots live in the source subscription and incur snapshot storage; lowering the range (1–5 days) reduces that cost at the price of slower same-region restore once snapshots age out. -
How do you prove a vault is recoverable, not just protected? Run an actual cross-region restore drill: enumerate secondary-region recovery points, restore disks into the paired region, boot the VM, and record the RTO. A green configuration blade is not proof.
Quick check
- Your vault is
LocallyRedundantand already protecting 50 VMs. The CISO now wants cross-region restore. What must you do? - Immutability is enabled but a red team still deleted backups. What state was it in, and what is the fix?
- You set a 10-year yearly retention, locked immutability, then realised it should be 3 years. Can you fix it? Why or why not?
- Where must the Resource Guard live for MUA to actually resist a compromised backup admin?
- A backup item was deleted and is unrecoverable after 6 days. What was misconfigured?
Answers
- Recreate the vault as
GeoRedundantand re-onboard. Redundancy is frozen once any item is protected, so you cannot convert in place – create a new GRS vault, enable the CRR flag, and re-protect the VMs. - It was
Unlocked, so an admin (the red team) disabled immutability first and then deleted. The fix is toLockit after a retention review –Lockedis irreversible and cannot be disabled by anyone. - No. Locked immutability blocks shortening retention. You will pay for the 10-year retention for its full duration. This is why you right-size and review retention before locking.
- In a separate tenant (or at minimum a different subscription owned by a different team) where the backup admin has no permissions – so a compromised admin cannot self-approve the destructive operation.
- Soft delete was either basic (fixed 14 days but presumably disabled) or set to
Enableand then turned off – the deletion outlived the recovery window. Enhanced soft delete atAlwaysONwith a 14–180 day window prevents this.
Glossary
- Recovery Services vault: The vault resource (
Microsoft.RecoveryServices/vaults) for Azure VMs, SQL/SAP-HANA-in-VM, and snapshot-based Azure Files protection. - Backup vault: The newer vault resource (
Microsoft.DataProtection/backupVaults) for Blobs, Managed Disks, PostgreSQL Flexible Server, AKS, and vaulted Azure Files. - Immutability: A vault security setting that blocks operations reducing the protection of existing recovery points;
Unlockedis reversible,Lockedis irreversible (WORM). - WORM: Write-once-read-many; the compliance property that recovery points cannot be altered or deleted before retention expires – delivered by
Lockedimmutability. - Soft delete: Keeps deleted backup data restorable for a retention window (enhanced: 14–180 days);
AlwaysONmakes the feature non-disablable. - Resource Guard: A separate resource (
Microsoft.DataProtection/resourceGuards) that gates destructive vault operations behind a second authorization (MUA). - Multi-user authorization (MUA): The pattern of requiring destructive operations to be approved through a Resource Guard the backup admin does not control.
- Cross-region restore (CRR): On-demand restore of a backup into the Azure-paired region without waiting for a Microsoft-declared regional failover; requires GRS.
- GeoRedundant (GRS): Storage redundancy that keeps six copies – three local plus three in the geo-paired region; the prerequisite for CRR.
- Instant restore: A backup-policy tier keeping local snapshots (1–5 days) in the source subscription for fast same-region restores.
- GFS: Grandfather-father-son – the daily/weekly/monthly/yearly retention ladder in a backup policy.
- Recovery point (RP): A single restorable backup captured at a point in time; the artefact an attacker deletes to destroy your recovery path.
- PIM (Privileged Identity Management): Entra capability for time-bound, just-in-time role activation – used to grant Backup Operator on the guard only during an approved window.
- Backup Contributor: The broad RBAC role that includes destructive backup operations; the role an attacker most wants on the backup plane.
Next steps
- Azure Backup & Site Recovery Deep Dive – the protection-mechanics foundation this hardening sits on top of.
- Azure Site Recovery: zone-to-zone & region failover runbooks – replication-based DR for workloads where backup-restore RTO is too slow.
- Azure PIM for resources & groups: JIT elevation – the just-in-time mechanism that powers MUA approvals on the Resource Guard.
- Ransomware resilience: immutable backup & isolated recovery environment – the cross-cloud pattern and the isolated-recovery-environment concept.
- Azure encryption at rest with CMK & double encryption – bring-your-own-key for the vault when platform-managed keys are not enough.