Azure Backup and Site Recovery: Protecting Workloads from Loss

The worst phone call of a career is not “the site is down.” It is “the site is down and the backups don’t restore.” A manufacturing client of mine backed up their file servers religiously for three years and never once tried to recover a whole machine. When ransomware encrypted their production estate on a Tuesday night, the backup jobs were green, the data was intact — and it still took eighteen hours to get the application serving customers, because nobody had ever built or rehearsed an orchestrated recovery. The data came back; the service did not, for the better part of a day, because restoring four hundred files is a different problem from standing up an application tier. That gap — between “I have the bytes” and “I can run the app” — is exactly the gap that Azure Backup and Azure Site Recovery (ASR) fill, and they fill different halves of it.

Azure Backup answers one question: can I get my data back to how it was at a point in time? It takes application-consistent point-in-time copies of VMs, files, SQL Server, SAP HANA, Azure Files and blobs into a hardened Recovery Services vault (or, for newer workloads, a Backup vault), and lets you restore a single file, a single disk, or an entire machine. Azure Site Recovery answers a completely different question: can I run my whole application somewhere else, fast, in a defined order? It continuously replicates a VM’s disks to a paired region, and on failover it powers on the replicas, attaches networking, and walks a recovery plan you authored — web tier, then app tier, then database, with scripts in between. Backup is your time machine against deletion and corruption; Site Recovery is your second site against a regional outage. You need both, and confusing them is how you end up with eighteen-hour Tuesdays.

This article is the production playbook for both. You will learn the vault model and why soft delete plus immutability is the only thing standing between you and a ransomware operator who got domain admin; every backup policy knob (frequency, retention tiers, instant-restore snapshots) and what each costs; how ASR replication actually works (the appliance, the cache storage account, crash- vs app-consistent recovery points, the RPO/RTO you can realistically promise); how to build and — the part everyone skips — test a recovery plan without touching production; and a structured failure→cause→confirm→fix table for the dozen ways these jobs break in the real world. Every operation gets the exact az command, a Bicep equivalent, and a KQL query where the answer lives in logs. The prose explains the why; the tables — there are many — are the reference you keep open at 02:00 when the vault is throwing UserErrorGuestAgentStatusUnavailable and the CFO is on the bridge.

What problem this solves

Data and applications die from causes that look nothing alike, and a single mechanism cannot defend against all of them. An engineer fat-fingers a DROP TABLE or deletes the wrong resource group. A bad deployment corrupts data subtly for six hours before anyone notices. A ransomware operator encrypts every reachable disk and then hunts down and deletes the backups, because they know that intact backups are the only thing that lets you refuse the ransom. An Azure region has a storage incident and your entire workload — perfectly healthy code — is unreachable for hours. Each of these is a different attack surface, and “we have backups” is a meaningless statement until you say which failure mode you mean.

What breaks without a real strategy is not the backup — it is the recovery. Teams discover, mid-incident, that their retention was 7 days and the corruption started on day 9; that the backups were in the same region that just failed; that the vault had no soft delete so the ransomware deleted the recovery points along with the data; that nobody knew the order to bring tiers up, or that the application needs a connection-string rewrite and a DNS swap that lives only in one person’s head — and that person is on a flight. The cruel truth of disaster recovery is that an untested recovery plan is a hypothesis, not a capability. DR plans decay silently: an IP changes, a dependency is added, a script rots, and the plan that worked at the last audit fails at the real incident.

Who hits this: everyone running production in the cloud, but it bites hardest on teams that treat backup as a checkbox rather than a tested capability. The finance and healthcare teams who must prove recoverability to auditors. The lean startups who set up daily VM backups, feel safe, and never once run a test restore. The enterprises with sprawling estates where backup coverage (is every new VM actually protected?) silently drifts. And anyone who thinks a snapshot is a backup — snapshots live next to the thing they protect and die with it, which is precisely useless against ransomware or a region loss.

To frame the whole field before the deep dive, here is the threat model: every loss event this article defends against, which service answers it, and the one control that actually saves you.

Loss event	What’s lost	Primary defence	The control that saves you
Accidental delete (file/VM/RG)	Specific objects	Azure Backup	Soft delete on the vault + retention ≥ 14 days
Slow data corruption	Recent good state	Azure Backup	Long-enough retention + point-in-time choice
Ransomware encrypts disks	All reachable data	Azure Backup	Immutable vault + soft delete + MUA
Ransomware deletes backups	Your only recovery	Azure Backup	Immutability lock + Multi-User Authorization
Single-VM crash / OS rot	One machine	Azure Backup (restore VM)	Tested whole-VM restore, not just file restore
Availability-zone failure	One zone	Zone-redundant design	ZRS storage / zonal redundancy (often not DR)
Regional outage	The whole workload	Azure Site Recovery	Replication to a paired region + tested failover
Region loss + no order to recover	Time and sanity	ASR recovery plan	Sequenced, script-driven, rehearsed plan

Learning objectives

By the end of this article you can:

Decide, for any workload, whether you need Azure Backup, Azure Site Recovery, or both — and articulate the difference between “get my data back” and “run my app elsewhere.”
Stand up a Recovery Services vault and a Backup vault, choose the right redundancy (LRS / ZRS / GRS), and enable soft delete and immutability so backups survive a ransomware operator with admin rights.
Author an Azure Backup policy with the right frequency, retention tiers (daily/weekly/monthly/yearly) and instant-restore window — and know what each setting costs and where it bites.
Configure Azure Site Recovery replication for an Azure VM, read crash- vs application-consistent recovery points, and set an RPO/RTO you can actually defend to the business.
Build a multi-tier recovery plan with sequenced groups and pre/post scripts, then run a non-disruptive test failover in an isolated network — and clean it up.
Drive the core operations fluently with az backup, az site-recovery / classic ASR cmdlets, Bicep, and KQL over the vault’s diagnostic logs.
Diagnose the dozen common failures — agent unreachable, restore-point gap, replication lag, failover stuck, soft-deleted item — with the exact command/portal path to confirm and the precise fix.
Right-size the bill: understand what drives protected-instance, storage, snapshot and ASR replication charges, and where the free allowances and cheap tiers actually are.

Prerequisites & where this fits

You should already be comfortable with the Azure resource model — subscriptions, resource groups, regions and Azure paired regions — and able to run az in Cloud Shell, read JSON output, and reason about managed disks and VNets. Helpful but not required: a working mental model of RTO (how long until the app is back) and RPO (how much data you can afford to lose), and a passing familiarity with managed identities and RBAC, because the vault’s access model leans on both. If those resilience terms are fuzzy, the conceptual groundwork lives in High Availability vs Disaster Recovery: RTO and RPO Explained; the region/zone substrate these services replicate across is covered in Azure Regions and Availability Zones: Designing for Resilience.

This sits in the Resiliency & Business Continuity track. It is downstream of basic compute and storage and upstream of full multi-region architecture. Backup and ASR are components of a resilience strategy, not the whole thing: they pair with active-active patterns from Azure Multi-Region Active-Active Architecture: Designing for Zero-Downtime (for workloads that cannot tolerate even a short failover), with the storage redundancy concepts in Azure Storage Account Fundamentals: Blobs, Files, Queues and Tables (LRS/ZRS/GRS, which also govern vault redundancy), and with the secret-protection discipline in Azure Key Vault: Secrets, Keys and Certificates Done Right (because customer-managed keys and the keys your failed-over app needs both live there). The hardest ransomware variant of this topic — air-gapped, immutable, isolated recovery — gets its own deep treatment in Ransomware Resilience: Immutable Backups, Recovery Vaults, and Isolated Recovery Environments.

A quick map of who owns what during a recovery, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it causes
Source workload (VM/DB/files)	The data being protected	App / DBA team	Agent down → backup fails; app inconsistency
In-guest agent (MARS / VM ext)	Snapshot coordination	Platform + app	Extension unhealthy → job fails
Recovery Services / Backup vault	Recovery points, policy, soft delete	Backup / platform team	Misconfigured retention; no immutability
Vault redundancy (LRS/ZRS/GRS)	Where copies physically live	Platform / architecture	Same-region copy lost in a regional outage
ASR replication path	Disk replication + recovery points	Platform / network	Replication lag → RPO breach
Recovery plan + scripts	Failover order, automation	App + platform	Wrong order, stale script → long RTO
DNS / networking / identity	Cutover plumbing	Network + identity	App up but unreachable; auth broken

Core concepts

Six mental models make every later decision obvious.

Backup protects state; Site Recovery protects service. This is the master distinction and it drives everything. Azure Backup captures point-in-time copies of data so you can roll an object back to how it was — a file, a disk, a database, a whole VM. Azure Site Recovery captures a continuously updated replica of a running machine so you can power it on elsewhere. Backup’s unit of value is a recovery point (a moment you can return to); ASR’s unit of value is a failover (the act of running the workload in the secondary site). Backup defends against deletion and corruption, which are time problems; ASR defends against unavailability, which is a location problem. A VM can need both: Backup to undo a bad change, ASR to survive a region outage.

The vault is the trust boundary — and its hardening is the whole game against ransomware. Recovery points live in a vault (Recovery Services vault for the classic estate; the newer Backup vault for blobs, disks, Azure Database for PostgreSQL flexible server, AKS and more). The vault is a control-plane object with its own RBAC, its own redundancy setting, and — critically — its own data-protection controls: soft delete (deleted recovery points are retained, recoverable, for a window rather than purged immediately), immutability (recovery points cannot be deleted or shortened before expiry), and Multi-User Authorization (MUA) (destructive operations require a second approver via a Resource Guard). A modern ransomware playbook is encrypt the data, then delete the backups; these three controls are specifically what defeat the second step. A vault without them is a backup that an attacker with your credentials can erase.

Crash-consistent is not application-consistent — and the difference is your data integrity. When Backup or ASR captures a recovery point, it is one of three consistency levels. A crash-consistent point is “as if you pulled the power cord” — disks captured at an instant, in-flight writes possibly torn; it boots, but a database may need crash recovery and could lose the last transactions. A file-system-consistent point flushes the OS file cache (Linux) so on-disk files are coherent. An application-consistent point uses VSS on Windows (or pre/post scripts on Linux) to quiesce the application — flush database buffers, freeze writers — so the recovery point is a clean, transactionally consistent moment. For databases and stateful apps you want application-consistent points; ASR creates them on a configurable cadence, and Backup uses VSS by default for VMs. If you only have crash-consistent points, plan for recovery time and possible last-seconds data loss.

RPO and RTO are promises with prices, not aspirations. RPO (Recovery Point Objective) is the maximum data loss you accept, measured in time — “we can lose at most 15 minutes.” It is governed by how often you create recovery points: backup frequency for Backup (hourly to daily), and continuous replication for ASR (RPO often a few minutes, app-consistent points every hour by default). RTO (Recovery Time Objective) is the maximum time to restore service — “we are back within 2 hours.” It is governed by how fast you can restore or fail over and re-plumb: restoring a 2 TB VM from backup takes real time; an ASR failover boots a replica in minutes but DNS, identity and dependency cutover add to it. Tighter RPO/RTO costs more (more frequent points, hot replicas, more automation). The discipline is to set them from business impact, not ambition, and then test that you meet them.

Restore is a spectrum, not a button. Azure Backup does not just “restore the VM.” It offers, from cheapest/fastest to most complete: file-level restore (mount a recovery point and copy individual files), disk restore (recover specific managed disks and attach them), replace existing (overwrite the source VM’s disks), and create new VM (build a fresh VM from the recovery point). Instant restore uses snapshots retained in the source region for a configurable window (1–5 days) so recent restores are near-instant and don’t pull from vault storage. Choosing the right restore type for the incident — one file vs a whole machine — is the difference between a five-minute fix and an hour-long rebuild.

Failover has phases, and “test” is the most important one. ASR failover is not a single act. A test failover spins up the replica in an isolated network with no impact to production or replication — this is your rehearsal and your audit evidence, and you should run it quarterly. A planned failover (zero data loss, for a controlled migration) shuts the source down cleanly first. An unplanned failover (the real disaster) runs from the latest available recovery point because the source is gone. After the dust settles you commit the failover (finalising it) and, when the primary region returns, re-protect and fail back. The lifecycle — replicate → test → fail over → commit → re-protect → fail back — is the thing you must understand, because skipping “test” is how the eighteen-hour Tuesday happens.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Term	One-line definition	Which service	Why it matters
Recovery Services vault	Classic vault for VM/file/SQL/SAP backup + ASR	Both	The hardened store; RBAC + redundancy + soft delete live here
Backup vault	Newer vault for blobs, disks, AKS, flexible-server DBs	Backup	Where modern workload backups go; supports immutability + MUA
Recovery point	A point-in-time copy you can restore to	Backup / ASR	The unit of “how far back can I go”
Backup policy	Schedule + retention rules attached to items	Backup	Defines RPO and how long you keep history
Soft delete	Deleted points retained, recoverable, for a window	Both	Defeats “attacker deletes the backups”
Immutability	Points can’t be deleted/shortened before expiry	Both	The lock ransomware can’t pick
MUA / Resource Guard	Destructive ops need a second approver	Both	Stops a single compromised admin
RPO	Max acceptable data loss (in time)	Both	Set by frequency / replication cadence
RTO	Max acceptable time to restore service	Both	Set by restore/failover speed + plumbing
Crash- vs app-consistent	Power-pull vs quiesced (VSS) recovery point	Both	Data integrity of what you restore
Instant restore	Snapshot-backed fast restore in source region	Backup	Speeds recent restores; costs snapshot storage
Replication	Continuous disk copy to a target region	ASR	The mechanism behind regional DR
Recovery plan	Sequenced failover with groups + scripts	ASR	Turns a pile of VMs into a recoverable app
Test failover	Failover into an isolated network, no impact	ASR	The rehearsal that makes DR real
Failback / re-protect	Return to primary after it recovers	ASR	Closing the loop post-incident

Backup vs Site Recovery: choosing the right tool

The single most expensive mistake in this space is reaching for the wrong tool — backing up a workload that needed replication, or replicating one that just needed a longer retention. They are complements, not substitutes. Here is the head-to-head that settles it:

Dimension	Azure Backup	Azure Site Recovery
Question it answers	“Can I get my data back?”	“Can I run my app elsewhere?”
Protects	VMs, files, SQL, SAP HANA, Azure Files, blobs, disks	VMs (Azure + on-prem VMware/Hyper-V/physical)
Unit of value	Recovery point (point-in-time)	Failover (running replica)
Typical RPO	Hours to a day (per schedule)	Seconds to minutes (continuous)
Typical RTO	Minutes to hours (restore time)	Minutes (boot replica) + cutover
Defends against	Delete, corruption, ransomware	Regional / large-scale outage
Storage cost model	Vault storage for retained points	Continuous replica storage + cache
Granularity	File / disk / DB / whole VM	Whole VM (and its dependencies)
History retained	Days to years (LTR)	Hours of recovery points (e.g. 24–72h)
Orchestration	Restore (manual/automated)	Recovery plans (sequenced, scripted)

The decision rule, as a table — match the workload requirement to the tool:

If the requirement is…	Use	Why
Undo an accidental delete weeks later	Backup (long retention)	ASR keeps only hours of points
Recover one corrupted file	Backup (file-level restore)	ASR is whole-VM only
Keep 7 years of monthly snapshots for audit	Backup (yearly retention / LTR)	Compliance archive is Backup’s job
Survive a full region outage in minutes	Site Recovery	Continuous replica, fast failover
Sequence web→app→DB failover with scripts	Site Recovery (recovery plan)	Orchestration is ASR’s job
Protect against ransomware deleting backups	Backup (immutable vault + MUA)	Hardened vault controls
Both: undo bad changes AND survive region loss	Both	They cover different failure modes
Near-zero downtime, no failover step at all	Neither alone → active-active	DR ≠ HA; see multi-region design

A blunt rule I give every team: Backup is mandatory for anything that holds data; Site Recovery is for the subset that cannot tolerate a prolonged regional outage. Most estates over-buy ASR (it is the more expensive, more operationally demanding service) and under-invest in testing Backup. Protect everything with Backup; reserve ASR for the tier-1 workloads where minutes of regional downtime translate to real money or real harm.

A snapshot is not a backup — and three other myths

The most dangerous belief in this space is “we take snapshots, so we’re covered.” A snapshot lives next to the thing it protects, shares its fate, and offers no immutability — it is a convenience, not a recovery strategy. Here is each common myth against the reality:

Common belief	Reality	Why it bites
“Disk snapshots are our backup”	Snapshots sit in the same subscription/region and have no immutability	Ransomware/region loss takes them with the data
“RAID / ZRS protects our data”	That’s hardware/zone availability, not point-in-time recovery	A `DROP TABLE` or corruption replicates instantly to every copy
“GRS storage means we have DR”	GRS is async replication, not a tested failover capability	No orchestration, no restore test, surprise RTO
“Backups are green, so we’re safe”	A green job proves capture, not recoverability	Untested restores fail when you finally need them
“ASR replaces Backup”	ASR keeps only hours of points and is whole-VM only	Can’t undo a file delete from three weeks ago
“We can change vault redundancy later”	Redundancy is immutable once an item is protected	Stuck on LRS during a regional outage

The vault: Recovery Services vs Backup vault

Everything Backup and ASR do is anchored to a vault. There are two kinds today, and picking the wrong one wastes a day of rework because you cannot migrate items between them.

Recovery Services vault is the long-standing vault. It backs up Azure VMs, on-prem files/folders (via the MARS agent), SQL Server in Azure VMs, SAP HANA in Azure VMs, and Azure File Shares — and it is the control plane for Azure Site Recovery. If you are protecting VMs or running ASR, this is your vault.

Backup vault is the newer model for workloads the Recovery Services vault never covered: Azure Blobs (operational + vaulted backup), Azure Managed Disks, Azure Database for PostgreSQL flexible server, AKS (cluster state), and Azure Database for MySQL/PostgreSQL. It has a cleaner data-protection model (native immutability, MUA) and is where Microsoft is investing for cloud-native workloads. It does not do VMs or ASR.

Here is which vault each workload belongs to — get this right before you create anything:

Workload	Vault type	Backup style	Notes
Azure VM	Recovery Services	Snapshot + vault	VSS app-consistent by default (Windows)
On-prem files/folders (MARS)	Recovery Services	Agent → vault	The MARS agent, scheduled
SQL Server in Azure VM	Recovery Services	Stream (log/diff/full)	15-min log RPO possible
SAP HANA in Azure VM	Recovery Services	Backint stream	Certified Backint integration
Azure File Share	Recovery Services	Snapshot-based	Snapshots managed by the vault
Azure Blob	Backup vault	Operational + vaulted	Point-in-time / continuous
Azure Managed Disk	Backup vault	Incremental snapshot	Snapshot in a resource group
PostgreSQL flexible server	Backup vault	Vaulted	Long-term retention beyond service default
AKS	Backup vault	Cluster + PV (via extension)	Backup extension + trusted access
Azure VM replication (DR)	Recovery Services	ASR replication	Not “backup” — it’s DR

Create a Recovery Services vault and immediately set its storage redundancy (you can only change it before the first protected item exists):

# Recovery Services vault with GRS (cross-region) redundancy
az backup vault create \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --location centralindia
# Set redundancy BEFORE protecting anything (GeoRedundant / LocallyRedundant / ZoneRedundant)
az backup vault backup-properties set \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --backup-storage-redundancy GeoRedundant \
  --cross-region-restore-flag true

The Bicep equivalent, with soft delete and cross-region restore baked in:

resource rsv 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
  name: 'rsv-prod-cin'
  location: location
  sku: { name: 'RS0', tier: 'Standard' }
  identity: { type: 'SystemAssigned' }
  properties: {}
}

resource rsvConfig 'Microsoft.RecoveryServices/vaults/backupconfig@2024-04-01' = {
  name: '${rsv.name}/vaultconfig'
  properties: {
    enhancedSecurityState: 'Enabled'        // soft delete + security features
    softDeleteFeatureState: 'Enabled'
    storageModelType: 'GeoRedundant'
    crossRegionRestoreFlag: true
  }
}

The vault redundancy choice is the same LRS/ZRS/GRS decision as a storage account, and it is consequential for DR — an LRS vault keeps every copy in one region, so a regional disaster takes the backups with the data. Match redundancy to the threat:

Redundancy	Copies kept	Survives	When to use	Cost
LRS (Locally redundant)	3 copies, one datacenter	Disk/rack/node failure	Dev/test; data with a separate regional copy	Lowest
ZRS (Zone redundant)	3 copies across AZs	A whole availability zone	Prod where region loss is covered elsewhere	Medium
GRS (Geo redundant)	LRS + async copy to paired region	A whole region	Production default for backups	Higher
GRS + Cross-Region Restore	GRS, restorable from secondary on demand	Region loss, with self-service restore	Tier-1; restore without waiting for failover	Higher + restore I/O

A note that has cost people their backups: redundancy is immutable once the vault holds a protected item. If you create an LRS vault, protect 200 VMs, then realise during a regional incident that you needed GRS — you cannot change it without deleting all protected items first. Decide redundancy at creation. For anything production, default to GRS with Cross-Region Restore enabled, which lets you restore in the paired region on your schedule rather than waiting for Microsoft to declare a failover.

Hardening the vault: soft delete, immutability and MUA

This is the section that matters most and that most teams skip. A backup an attacker can delete is not a backup. Three layered controls turn the vault from a convenience into a genuine ransomware defence, and they compound: soft delete buys you a recovery window, immutability removes the delete capability entirely, and MUA ensures no single compromised admin can disable either.

Soft delete retains deleted backup data — for 14 days by default, configurable up to 180 days, free of charge during the soft-delete window — so that an accidental or malicious “delete this backup item” can be undone. With enhanced soft delete you can make it always-on (irreversible: it cannot be turned off, closing the loop where an attacker simply disables soft delete first). Check and configure it:

# Inspect soft-delete state and retention
az backup vault backup-properties show \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --query "{soft:softDeleteFeatureState, days:softDeleteRetentionPeriodInDays}" -o table

# Set enhanced soft delete to always-on (irreversible) with 30-day retention
az backup vault backup-properties set \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --soft-delete-feature-state AlwaysON \
  --soft-delete-duration 30

Immutability makes recovery points un-deletable and un-shortenable before their expiry. You enable it on the vault and then optionally lock it. Unlocked immutability can be turned off (good while you pilot); a locked immutable vault is irreversible — not even Microsoft support can delete a recovery point before it expires. That irreversibility is the entire point: it is the property a ransomware operator cannot defeat with stolen credentials.

resource rsvImmutability 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
  name: 'rsv-prod-cin'
  location: location
  sku: { name: 'RS0', tier: 'Standard' }
  properties: {
    securitySettings: {
      immutabilitySettings: {
        state: 'Locked'   // 'Unlocked' while piloting; 'Locked' is irreversible
      }
    }
  }
}

Multi-User Authorization (MUA) protects operations, not just data: critical actions (disable soft delete, reduce retention, delete a backup item, stop protection with delete) require approval through a Resource Guard held in a different subscription or tenant, governed by a security team the workload admins don’t control. So even a fully compromised backup admin cannot quietly weaken the vault — the destructive op stalls awaiting a second party. Configure the guard’s scope:

# Associate a Resource Guard (created by the security team) with the vault for MUA
az dataprotection resource-guard create \
  --resource-group rg-security --name rg-guard-prod --location centralindia
# Then link the vault to it and choose which operations are guarded in the portal/Bicep

These controls layer; understand what each stops and its escape hatch (or lack of one):

Control	What it stops	Default	Can an admin disable it?	Recommended prod setting
Soft delete (basic)	Permanent loss on accidental/malicious delete	On, 14 days	Yes (then 14-day window still applies)	Enable, ≥ 30 days
Enhanced soft delete (Always-on)	Attacker disabling soft delete first	Off	No (irreversible)	Enable, Always-on
Immutability (Unlocked)	Deleting/shortening points before expiry	Off	Yes	Enable while piloting
Immutability (Locked)	Same, irreversibly	Off	No (irreversible)	Enable + Lock for tier-1
Multi-User Authorization	A single compromised admin weakening the vault	Off	Only with the second approver	Enable for prod vaults
RBAC least privilege	Over-broad backup/restore rights	—	—	Backup Operator, not Owner

And the ransomware kill-chain, mapped to the control that breaks each step — this is why you layer them:

Attacker step	Without hardening	Control that breaks it
Gains admin via phishing	Full control of estate	(out of scope — identity hardening)
Encrypts production disks	Data unusable	Backup itself (restore clean points)
Deletes backup items	No recovery → pay ransom	Soft delete (retains them)
Disables soft delete, then deletes	Soft delete bypassed	Enhanced soft delete (Always-on)
Shortens retention to expire points	Points vanish “legitimately”	Immutability (Locked)
Uses one stolen admin to do all above	One credential = total loss	MUA / Resource Guard

These controls have states with one-way doors — the irreversible transitions are deliberate (an attacker can’t undo them either), so understand them before you flip the switch:

State / transition	Reversible?	Effect	When to choose
Soft delete → Off	Yes	No retention of deleted points	Never on prod
Soft delete → On (basic)	Yes	14–180 day recovery window	Minimum baseline
Soft delete → Always-on	No (one-way)	Can’t be disabled by anyone	Production hardening
Immutability → Unlocked	Yes	Points protected, can be turned off	While piloting immutability
Immutability → Locked	No (one-way)	Points immutable, irreversibly	Tier-1, once confident
MUA → Enabled	Yes (via approver)	Destructive ops need 2nd party	All production vaults

If you take one thing from this article: for any vault holding production backups, enable enhanced soft delete (always-on), immutability (locked) once you are confident, and MUA. Those three turn “we have backups” into “we have backups an attacker cannot erase.”

Azure Backup policy: every knob that sets your RPO and cost

A backup policy is the schedule-plus-retention contract attached to your protected items. It is where you set RPO (how often) and how much history you keep (retention), and it is the single biggest lever on both your recoverability and your bill. The Azure VM policy has these moving parts.

Backup frequency sets your RPO. Standard policy is daily (one recovery point per day); Enhanced policy (for VMs) supports hourly backups (every 4/6/8/12 hours), tightening RPO and enabling multiple-backups-per-day and support for Trusted Launch / larger VMs. SQL-in-VM goes far tighter — transaction-log backups as frequently as every 15 minutes.

Retention is tiered — daily, weekly, monthly and yearly points kept for different durations, the classic grandfather-father-son scheme. You keep many recent daily points and a few long-lived yearly points, balancing recoverability against storage cost. Azure Backup supports retention up to 99 years for long-term archival.

Instant-restore snapshot retention controls how many days (1–5) snapshots are kept in the source region for near-instant restores before the data is only in vault storage. Longer instant-restore = faster recent restores but more snapshot storage cost.

Create a policy and protect a VM with it:

# Show the default policy, then protect a VM with it
az backup policy show --vault-name rsv-prod-cin --resource-group rg-resiliency \
  --name DefaultPolicy -o json

# Enable backup for a VM under a named policy
az backup protection enable-for-vm \
  --vault-name rsv-prod-cin --resource-group rg-resiliency \
  --vm $(az vm show -g rg-app -n vm-web-01 --query id -o tsv) \
  --policy-name DefaultPolicy

A custom policy in Bicep — daily at 02:00 UTC, 30 daily / 12 weekly / 12 monthly / 7 yearly points, 5-day instant restore:

resource vmPolicy 'Microsoft.RecoveryServices/vaults/backupPolicies@2024-04-01' = {
  name: '${rsv.name}/pol-vm-prod'
  properties: {
    backupManagementType: 'AzureIaasVM'
    instantRpRetentionRangeInDays: 5
    schedulePolicy: {
      schedulePolicyType: 'SimpleSchedulePolicy'
      scheduleRunFrequency: 'Daily'
      scheduleRunTimes: [ '2026-06-23T02:00:00Z' ]
    }
    retentionPolicy: {
      retentionPolicyType: 'LongTermRetentionPolicy'
      dailySchedule:   { retentionTimes: ['2026-06-23T02:00:00Z'], retentionDuration: { count: 30,  durationType: 'Days'   } }
      weeklySchedule:  { daysOfTheWeek: ['Sunday'], retentionTimes: ['2026-06-23T02:00:00Z'], retentionDuration: { count: 12, durationType: 'Weeks' } }
      monthlySchedule: { retentionScheduleFormatType: 'Weekly', retentionScheduleWeekly: { daysOfTheWeek:['Sunday'], weeksOfTheMonth:['First'] }, retentionTimes:['2026-06-23T02:00:00Z'], retentionDuration: { count: 12, durationType: 'Months' } }
      yearlySchedule:  { retentionScheduleFormatType: 'Weekly', monthsOfYear:['January'], retentionScheduleWeekly: { daysOfTheWeek:['Sunday'], weeksOfTheMonth:['First'] }, retentionTimes:['2026-06-23T02:00:00Z'], retentionDuration: { count: 7, durationType: 'Years' } }
    }
  }
}

Every policy setting, its default, when to change it, and the trade-off — this is the option matrix to keep open while you design:

Setting	Values	Default	When to change	Trade-off / gotcha
Policy type (VM)	Standard / Enhanced	Standard	Need hourly RPO, Trusted Launch, larger VMs	Enhanced costs more; some regions/SKUs only
Backup frequency	Daily / Hourly (4–12h)	Daily	Tighter RPO than a day	More points = more storage + snapshot churn
Daily retention	7–9999 days	30 days	Longer corruption-detection window	Storage grows with retention
Weekly retention	1–5163 weeks	off	Keep weekly checkpoints	More long-lived points
Monthly retention	1–1188 months	off	Compliance / monthly archive	Long-term storage cost
Yearly retention	1–99 years	off	Audit / legal hold	Cheapest per-point but accumulates
Instant-restore days	1–5	2	Faster recent restores	Snapshot storage cost in source region
Time zone	Any TZ	UTC	Align backup window to off-peak local	Mis-set window can hit business hours
SQL log frequency	15 min–24 h	— (when SQL)	Tight DB RPO	More log backups, more storage

The retention-tier mental model, with the cost intuition for each tier:

Tier	Typical retention	Recovers from	Cost intuition
Daily	7–30 days	Recent accidents, fast corruption	Most points; bulk of recent storage
Weekly	4–12 weeks	Slow corruption noticed weeks later	Fewer points, modest cost
Monthly	6–36 months	Compliance “show me last quarter”	Long-lived, accumulates
Yearly	1–10 (up to 99) years	Audit, legal hold	Cheap per-point but never-ending

Two real-world rules: set daily retention to at least 14–30 days so corruption noticed a week or two late is still recoverable (7 days is a common, painful default that loses you the good copy); and keep instant-restore at 5 days for production VMs so the restores you actually run during an incident are fast rather than pulling slowly from vault tiers.

Restore is a spectrum — pick the cheapest type that solves the incident

A policy creates recovery points; a restore uses one, and Backup gives you several restore types ranging from “copy one file” to “rebuild the whole machine.” Reaching for “create new VM” when the incident was a single deleted file wastes an hour. Match the restore type to the failure:

Restore type	What it does	Speed	Use when	Gotcha
File-level (item) restore	Mount the RP, copy individual files	Fast (no full restore)	One or few files lost/corrupted	Mounts via iSCSI; unmount after, or it lingers
Disk restore	Recover specific managed disks, attach	Medium	One disk corrupted; need data, not OS	You attach + reconfigure the VM
Replace existing	Overwrite the source VM’s disks from RP	Medium	Whole VM corrupted, same identity wanted	Original disks swapped; brief downtime
Create new VM	Build a fresh VM from the RP	Slowest (full copy)	Source gone; want a clean rebuild	New name/IP; re-plumb networking/DNS
Instant restore (snapshot)	Restore from source-region snapshot	Near-instant	Recent point within instant-restore window	Only covers the 1–5 day snapshot window
Cross-Region Restore	Restore in the paired region from GRS copy	Medium	Primary region unavailable	GRS vault + CRR flag only; egress cost

The restore-type decision as a quick lookup:

If you need to recover…	Use this restore type
A handful of files from last week	File-level restore
One corrupted data disk, keep the OS	Disk restore
The same VM rolled back in place	Replace existing
A clean machine because the original is wrecked	Create new VM
A recent point, as fast as possible	Instant restore (snapshot)
Anything while the primary region is down	Cross-Region Restore

Workload-specific backup: SQL, SAP HANA and Azure Files

VMs are the common case, but the Recovery Services vault protects database and file workloads with their own mechanisms and far tighter RPOs than daily VM snapshots. Know the model per workload:

Workload	Backup mechanism	Tightest RPO	Restore granularity	Key requirement
Azure VM	VM snapshot + vault	~1 h (Enhanced)	File / disk / whole VM	Healthy VM Agent
SQL Server in Azure VM	Full + differential + log stream	15 min (log)	Point-in-time to the second	SQL extension, `db_backupoperator`
SAP HANA in Azure VM	Backint full + log stream	15 min (log)	Point-in-time	Certified Backint config
Azure File Share	Vault-managed snapshots	Per schedule (hourly+)	Individual files / full share	Share registered to vault
Azure Blob (Backup vault)	Operational + vaulted	Continuous (operational)	Point-in-time within window	Backup vault, not RSV
Azure Managed Disk (Backup vault)	Incremental snapshot	Per schedule	Whole disk	Snapshot resource group

For SQL-in-VM specifically, the three backup types compose into point-in-time recovery — and missing the log backups is the usual reason “we can only restore to last midnight”:

SQL backup type	What it captures	Typical frequency	Role in PITR
Full	Entire database	Daily/weekly	The base to restore from
Differential	Changes since last full	Daily	Speeds restore, less log replay
Transaction log	Every committed transaction	Every 15 min	Rolls forward to any point in time

Azure Site Recovery: how replication actually works

Site Recovery’s job is to keep a bootable replica of your VM in another region, continuously, so you can fail over fast. Understanding the mechanism removes the mystery from the failure modes later.

For an Azure-to-Azure scenario (the common case), enabling replication on a source VM sets up the Site Recovery Mobility extension inside the VM, which intercepts disk writes and ships them to a cache storage account in the source region; ASR then asynchronously replicates that to target-region managed disks that form the replica. ASR continuously builds crash-consistent recovery points (typically every 5 minutes) and application-consistent recovery points on a configurable cadence (default every hour, using VSS on Windows / pre-post scripts on Linux). The result: an RPO usually in the single-digit minutes, and a menu of recovery points to fail over to. For on-premises sources (VMware, Hyper-V, physical), the architecture adds a configuration/process server appliance that aggregates and forwards replication, but the recovery-point concepts are identical.

ASR supports several source/target scenarios, and the moving parts differ — know which architecture you’re running before you debug it:

Scenario	Source → target	Extra infrastructure	Typical use
Azure-to-Azure (A2A)	Azure VM → another Azure region	None (Mobility ext + cache SA only)	Regional DR for cloud VMs
VMware → Azure	On-prem VMware → Azure region	Configuration + process server appliance	Migrating/DR’ing VMware estates
Hyper-V → Azure	On-prem Hyper-V → Azure region	Provider on host (+ VMM if used)	DR for Hyper-V workloads
Physical → Azure	Bare-metal server → Azure region	Process server appliance	DR for legacy physical servers
Azure-to-Azure (zonal)	VM in one AZ → another AZ	None	Intra-region zone resilience

Enable replication for an Azure VM from the CLI (modern az extension):

# Replicate an Azure VM to a target region via an ASR-enabled Recovery Services vault
az site-recovery protected-item create \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --fabric-name asr-cin --protection-container-name pc-cin \
  --replication-protected-item-name vm-web-01 \
  --policy-id "<replication-policy-id>" \
  --source-vm-id $(az vm show -g rg-app -n vm-web-01 --query id -o tsv) \
  --recovery-resource-group-id $(az group show -n rg-dr-southindia --query id -o tsv)

The replication policy itself controls the consistency cadence and how many points you keep:

resource asrPolicy 'Microsoft.RecoveryServices/vaults/replicationPolicies@2024-04-01' = {
  name: '${rsv.name}/pol-asr-a2a'
  properties: {
    providerSpecificInput: {
      instanceType: 'A2A'
      recoveryPointHistory: 1440           // minutes of recovery points retained (24h)
      appConsistentFrequencyInMinutes: 60  // app-consistent point cadence
      crashConsistentFrequencyInMinutes: 5 // crash-consistent point cadence
      multiVmSyncStatus: 'Enable'
    }
  }
}

The replication-policy knobs and their trade-offs:

Setting	Values	Default	When to change	Trade-off
Recovery-point retention	0–72 hours (A2A)	24 h	More points to choose from	More cache + storage
App-consistent frequency	1 min–12 h (or off)	60 min	Tighter clean-restore granularity	VSS overhead in the guest
Crash-consistent frequency	5 min (fixed for A2A)	5 min	—	—
Multi-VM consistency	On / Off	Off	App spans VMs needing same instant	Groups VMs; shared replication group
Target region	Any paired/allowed region	paired	Compliance / latency	Egress + capacity in target
Target disk type	Standard/Premium SSD	match source	Cost vs failover IOPS	Cheaper disk = slower failover perf

The three consistency levels, side by side — know which one your restore needs:

Consistency level	How it’s captured	Data integrity on restore	Best for	Cost/overhead
Crash-consistent	Disk state at an instant (no quiesce)	Boots; DB may run crash recovery, lose last writes	Stateless tiers; tight RPO	Lowest (every 5 min)
File-system-consistent	OS cache flushed (Linux)	Files coherent on disk	General Linux servers	Low
Application-consistent	VSS / scripts quiesce the app	Transactionally clean moment	Databases, stateful apps	Higher (VSS pauses writers)

The honest RPO/RTO you can promise — and what each tier actually costs in effort and money:

Approach	Realistic RPO	Realistic RTO	Cost	When it’s the right call
Daily Backup only	Up to 24 h	Hours (restore time)	Low	Non-critical; data, not uptime
Hourly Backup (Enhanced)	~1–4 h	Hours	Low-medium	Important data, lax uptime
ASR replication	Minutes	Minutes + cutover	Medium	Tier-1 needing fast regional DR
ASR + automated recovery plan	Minutes	Tighter, repeatable	Medium-high	Multi-tier apps, audited RTO
Active-active multi-region	~Zero	~Zero (no failover)	Highest	Can’t tolerate any failover gap

A blunt truth about ASR RTO: the boot is fast (minutes), but your real RTO includes DNS propagation, identity/dependency cutover, and any manual verification. Teams that promise “15-minute RTO” because the VM boots in 15 minutes get a nasty surprise when DNS TTLs and a forgotten connection-string change add an hour. Measure RTO end-to-end in a test failover, not from the boot time.

Recovery plans and orchestrated failover

A pile of replicated VMs is not a recoverable application — the database must come up before the app tier, the app tier before the web tier, and somewhere in there a script rewrites a connection string and updates DNS. A recovery plan encodes that: an ordered set of groups of VMs, with pre/post actions (manual steps or Azure Automation runbooks) between groups, so a failover executes as a single, repeatable, auditable operation instead of a frantic improvisation.

A typical three-tier plan:

Group	Contents	Pre-action	Post-action
Group 1	Database VMs	(none)	Runbook: verify DB online, open firewall
Group 2	App-tier VMs	Manual: confirm DB healthy	Runbook: update app config / conn string
Group 3	Web-tier VMs	(none)	Runbook: update Traffic Manager / DNS
Post-plan	—	—	Runbook: smoke test, notify on-call

Trigger the three failover types from the CLI:

# TEST failover into an isolated network (no production impact) — your rehearsal
az site-recovery recovery-plan failover-test \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --recovery-plan-name rp-shop-prod \
  --recovery-point-type Latest \
  --network-id $(az network vnet show -g rg-dr-southindia -n vnet-dr-isolated --query id -o tsv)

# UNPLANNED failover (the real disaster — source may be gone)
az site-recovery recovery-plan failover-unplanned \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --recovery-plan-name rp-shop-prod --recovery-point-type Latest

# COMMIT once you've verified the failed-over app
az site-recovery recovery-plan commit \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --recovery-plan-name rp-shop-prod

The failover types, when to use each, and the data-loss implication:

Failover type	Source state	Data loss	When to use	Networking
Test failover	Source still running	None (isolated)	Quarterly rehearsal, audit evidence	Isolated VNet, no prod impact
Planned failover	Source healthy, controlled	Zero (clean shutdown first)	Migration, scheduled DR drill	Production target
Unplanned failover	Source degraded/gone	From latest available point (RPO)	Real disaster	Production target
Failback (re-protect)	Primary recovered	Minimal (reverse-replicate first)	Return to primary post-incident	Reverse direction

The full failover lifecycle — the order is the discipline:

Phase	What happens	You do	Common miss
Replicate	Continuous disk copy to target	Monitor RPO health	Ignoring replication-lag alerts
Test failover	Replica boots in isolated net	Verify app, then clean up	Forgetting cleanup → orphan cost
Unplanned failover	Replica boots in production	Run recovery plan, verify	No DNS/identity cutover plan
Commit	Failover finalised, points freed	Confirm before committing	Committing before verifying
Re-protect	Reverse replication primary↔secondary	Enable once primary returns	Skipping → no way back
Failback	Return workload to primary	Planned failover in reverse	Never testing failback

The non-negotiable habit: run a test failover every quarter. It is the only thing that proves the plan works, surfaces drift (a new VM not in the plan, a script that rots, an IP that changed), and gives auditors evidence. A test failover into an isolated network has zero production impact — there is no excuse not to. Then clean up the test (a single action) so you are not paying for orphaned test VMs.

Architecture at a glance

The diagram traces both protection paths from the same source workload, so you can see how Backup and Site Recovery operate in parallel on the very same VMs. Read it left to right. On the far left sits the source estate in the primary region — your web, app and database VMs, each with an in-guest agent (the Backup VM extension and the ASR Mobility extension) doing two jobs at once. The Backup path (top) snapshots each VM and writes application-consistent recovery points into a Recovery Services vault, where the hardening lives: soft delete, immutability (locked) and Multi-User Authorization are the controls that keep those points alive even if an attacker with admin rights tries to delete them. The vault’s GRS redundancy with Cross-Region Restore means a second copy already sits in the paired region, restorable on your schedule. The Site Recovery path (bottom) streams disk writes through a source-region cache storage account into continuously replicated target-region managed disks, building crash- and application-consistent recovery points minutes apart.

Follow the flows to the right and the two paths converge on recovery. From Backup you choose a restore type — file, disk, or whole VM — to undo a deletion or roll back corruption. From Site Recovery you trigger a failover that a recovery plan orchestrates: database group first, app group next, web group last, with runbooks rewriting connection strings and updating DNS / Traffic Manager in between, landing the running application in the secondary region. The numbered badges mark the five places this architecture most often fails — an unhealthy guest agent that silently breaks backups, a vault left without immutability that ransomware erases, replication lag that quietly breaches your RPO, a failover that stalls because the recovery plan was never tested, and the cutover plumbing (DNS, identity) that leaves the app running but unreachable. The legend narrates each as symptom, the command to confirm it, and the fix — the same method as every incident: localise the failure to one hop, confirm with the named tool, apply the fix.

Real-world scenario

Northwind Financial runs a customer loan-origination platform on Azure: a three-tier app — two web VMs, two app VMs, and a clustered SQL Server pair — on Standard D-series instances in Central India, fronted by Application Gateway, serving roughly 3,000 loan applications a day. Compliance requires a 4-hour RTO and a 15-minute RPO for the loan database, plus seven-year retention of monthly backups for audit. The platform team is five engineers; the resilience budget is about ₹85,000/month.

Their original setup looked responsible and wasn’t. Azure Backup ran daily VM backups with 7-day retention into a Recovery Services vault — LRS, in the same region. No Site Recovery. No immutability. They had never run a restore test. On paper: “we have backups.” In reality, three latent failures stacked: 7-day retention couldn’t satisfy a 7-year audit or a corruption noticed late; an LRS vault would die with the region in a regional outage; and without ASR there was no way to meet a 4-hour RTO if Central India went dark.

The wake-up call was a near-miss, not a disaster. A botched schema migration corrupted a loan-status column, and the bad data wasn’t noticed for nine days — by which point the only clean copy had aged out of the 7-day retention. They recovered by manually reconstructing the column from downstream audit logs over a weekend. The post-incident review was blunt: the backups had worked perfectly and were useless, because the retention window was shorter than their detection latency. That single sentence reset the whole programme.

The rebuild had three parts. First, the vault. They recreated it as GRS with Cross-Region Restore, enabled enhanced soft delete (always-on, 30 days), and turned on immutability (locked) plus MUA with the security team holding the Resource Guard — so a compromised platform admin could no longer weaken backups. Second, the policy. They moved to a tiered retention — 30 daily, 12 weekly, 36 monthly, 7 yearly points — and added SQL transaction-log backups every 15 minutes to hit the 15-minute database RPO, with a 5-day instant-restore window for fast recent restores. Third, Site Recovery. They enabled ASR replication of all five VMs to South India, authored a recovery plan sequencing SQL → app → web with runbooks to rewrite the app’s connection string and update Traffic Manager DNS, and — the crucial habit — scheduled a quarterly test failover into an isolated VNet.

The first test failover was humbling and exactly the point: the database came up, but the app tier failed because the runbook still pointed at the old connection string, and the measured end-to-end RTO was 5 hours 40 minutes — over their 4-hour target, almost entirely DNS TTL (set to 1 hour) and manual verification. They fixed the runbook, dropped the DNS TTL to 60 seconds, and automated the smoke test. The next quarterly test measured 2 hours 50 minutes, comfortably inside RTO, with the database at a 12-minute RPO. Eight months later, when Central India had a genuine storage-tier incident, they failed over for real in 2 hours 35 minutes with 9 minutes of data loss — inside both targets, no heroics, because the plan had been rehearsed four times. The lesson on the wall: “A green backup job is a hypothesis. A passed test restore is a capability. Only one of them pays out at 2 a.m.”

The programme as a before/after, because the gaps are the lesson:

Aspect	Before (looked safe)	After (was safe)	Why it mattered
Vault redundancy	LRS (same region)	GRS + Cross-Region Restore	Survives a region loss
Soft delete / immutability	Off	Enhanced (always-on) + locked	Survives ransomware deleting backups
Daily retention	7 days	30 days	Corruption noticed day 9 still recoverable
Long-term retention	None	12 wk / 36 mo / 7 yr	Meets the 7-year audit
Database RPO	24 h (daily)	15 min (log backups)	Meets compliance RPO
Regional DR	None	ASR replica to South India	Meets the 4-hour RTO
Recovery orchestration	None	Recovery plan + runbooks	Repeatable, auditable failover
Tested?	Never	Quarterly test failover	Found the broken runbook before the disaster
Measured RTO	Unknown (hope)	2h35m (real incident)	A number, not a prayer

Advantages and disadvantages

The Backup-plus-Site-Recovery model gives you broad, managed protection without a secondary datacenter to run — but it is not free, and it decays without discipline. Weigh it honestly:

Advantages (why this model helps you)	Disadvantages (why it bites)
Backup gives granular recovery (file → disk → whole VM) for delete and corruption	Backup alone can’t meet a tight RTO for a region loss — restore takes real time
ASR gives whole-workload failover to another region in minutes	ASR is the more expensive, more operationally demanding service; over-buying it is common
Hardened vault (soft delete, immutability, MUA) defeats ransomware that deletes backups	Defaults are unsafe: LRS, no immutability, no MUA — you must turn the knobs
Recovery plans turn a pile of VMs into a sequenced, auditable failover	A recovery plan is a hypothesis until tested; plans decay silently (drift)
No secondary infrastructure to run/patch until you actually fail over	You pay continuously for replica storage and protected instances even when idle
Long-term retention (up to 99 years) covers compliance archival cheaply per-point	Storage cost accumulates relentlessly with retention; easy to over-retain
Cross-region restore lets you recover in the paired region on your schedule	Cross-region restore I/O and egress add cost; only on GRS vaults
Application-consistent points (VSS) give clean database restores	App-consistency adds in-guest overhead; misconfigured scripts give only crash-consistent

The model is right for the overwhelming majority of estates: protect everything with Backup (it is cheap insurance against the most common loss — accidental deletion and corruption), and layer ASR onto the tier-1 subset that cannot tolerate regional downtime. It bites hardest on teams who confuse Backup with DR (and discover at the incident that restoring 50 VMs serially blows their RTO), who deploy with default redundancy and no immutability (and lose backups to ransomware), and who set up DR and never test it (and find the recovery plan broken when it matters). Every disadvantage is manageable — but only if you know it exists, which is the entire point of doing this deliberately.

Hands-on lab

Protect a VM with Azure Backup, harden the vault, take an on-demand backup, and run a file-level restore — all on a single small VM you delete at the end. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-backup-lab
LOC=centralindia
VAULT=rsv-lab-$RANDOM
VM=vm-lab-01
az group create -n $RG -l $LOC -o table

Step 2 — Create a small Linux VM to protect.

az vm create -g $RG -n $VM --image Ubuntu2204 --size Standard_B1s \
  --admin-username azureuser --generate-ssh-keys --public-ip-sku Standard -o table
# Drop a file we'll later "lose" and restore
az vm run-command invoke -g $RG -n $VM --command-id RunShellScript \
  --scripts "echo 'critical-loan-data-v1' | sudo tee /home/azureuser/important.txt"

Step 3 — Create a Recovery Services vault and harden it.

az backup vault create -n $VAULT -g $RG -l $LOC -o table

# Enhanced soft delete (always-on, 14 days) — irreversible hardening
az backup vault backup-properties set -n $VAULT -g $RG \
  --soft-delete-feature-state AlwaysON --soft-delete-duration 14

# Confirm the hardening took
az backup vault backup-properties show -n $VAULT -g $RG \
  --query "{soft:softDeleteFeatureState, days:softDeleteRetentionPeriodInDays}" -o table

Expected: soft = AlwaysON, days = 14.

Step 4 — Enable backup on the VM with the default policy.

az backup protection enable-for-vm -v $VAULT -g $RG \
  --vm $(az vm show -g $RG -n $VM --query id -o tsv) \
  --policy-name DefaultPolicy -o table

Step 5 — Trigger an on-demand backup (don’t wait for the schedule).

CONTAINER=$(az backup container list -v $VAULT -g $RG \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)
ITEM=$(az backup item list -v $VAULT -g $RG \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)

az backup protection backup-now -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" \
  --retain-until $(date -d "+30 days" +%d-%m-%Y) -o table

Watch the job until it completes (this takes several minutes — the first backup copies the full disk):

az backup job list -v $VAULT -g $RG --query "[0].{op:properties.operation, status:properties.status}" -o table

Expected: eventually status = Completed.

Step 6 — List recovery points and start a file-level restore.

RP=$(az backup recoverypoint list -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" \
  --query "[0].name" -o tsv)

# Mount the recovery point as an iSCSI target with a download script (file recovery)
az backup restore files mount-rp -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" --rp-name "$RP" -o json
# The output gives a script + password; running it mounts the recovery point's disks
# so you can copy /home/azureuser/important.txt back. Unmount when done:
az backup restore files unmount-rp -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" --rp-name "$RP"

Validation checklist. You created a hardened vault (enhanced soft delete, always-on), protected a VM, took an on-demand application-consistent recovery point, and exercised file-level restore by mounting the recovery point — without ever needing the original VM intact. That is the whole Backup loop. What each step proves:

Step	What you did	What it proves	Real-world analogue
3	Enhanced soft delete always-on	The vault resists “delete the backups”	Ransomware hardening
4	Enable backup with a policy	Protection is policy-driven, not ad-hoc	Onboarding every prod VM
5	On-demand backup	You can force a point before risky change	Pre-deployment safety snapshot
6	Mount RP for file restore	Granular recovery without a full VM rebuild	The 90% case: “restore one file”

Cleanup (avoid lingering vault/VM charges). You must stop protection before deleting the resource group, or the vault blocks deletion:

# Stop protection AND delete backup data (lab only — never --delete-backup-data in prod casually)
az backup protection disable -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" \
  --delete-backup-data true --yes

az group delete -n $RG --yes --no-wait

Cost note. A B1s VM is a few rupees per hour and a single recovery point is a tiny storage charge; an hour of this lab is well under ₹50. Deleting the resource group (after disabling protection) stops everything.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you read mid-incident, then the entries that bite hardest with the full confirm-command detail.

#	Symptom	Root cause	Confirm (exact cmd / portal path)	Fix
1	VM backup job fails with `UserErrorGuestAgentStatusUnavailable`	VM Agent / extension not running or unhealthy	`az vm get-instance-view -g RG -n VM --query "instanceView.vmAgent.statuses"`	Restart/repair VM Agent; reinstall the backup extension
2	Backup job stuck “In Progress” for hours	Snapshot/VSS hang, or another op holding the VM	Backup jobs blade; `az backup job list`; VM activity log	Cancel job; check VSS writers; retry; reboot if VSS wedged
3	Need to restore but the recovery point you want isn’t there	Retention too short — point aged out	`az backup recoverypoint list` (oldest point’s date)	Increase retention now; recover from CRR/secondary if any
4	“Cannot change vault redundancy” when moving LRS→GRS	Vault already holds a protected item	`az backup item list` shows items	Decide redundancy at creation; new vault + re-protect
5	Deleted a backup item by mistake — is it gone?	Soft delete window (if enabled) still holds it	Backup items → “soft-deleted” filter	`az backup protection undelete`; re-enable protection
6	ASR replication health “Critical”, RPO climbing	Replication lag — network/throughput/cache full	Site Recovery → Replicated items → RPO; cache SA metrics	Increase cache SA, check egress/throughput, throttle source I/O
7	Test failover boots but app unreachable	DNS/identity/dependency cutover not done	App in isolated VNet; check name resolution + app config	Add DNS/identity to recovery-plan runbooks; verify in test
8	Failover stuck / fails at a recovery-plan group	Stale runbook, missing target resource, dependency miss	Recovery plan job; Automation runbook error	Fix runbook; ensure target RG/VNet/NSG exist; re-run group
9	Restored VM boots but the database needs recovery	Only crash-consistent points captured	Recovery points show consistency type	Enable app-consistent (VSS / scripts); pick app-consistent point
10	SQL-in-VM backup fails / no log backups	SQL extension/AAD config wrong; perms missing	Vault → Backup items (SQL); SQL ext logs	Re-register SQL, grant `db_backupoperator`, fix extension
11	Backup costs ballooning month over month	Over-retention + high churn + instant-restore days	Cost analysis by vault; protected-instance count	Trim retention tiers; review instant-restore days; archive tier
12	Can’t delete a recovery point / shorten retention	Immutability locked (working as intended)	Vault security settings show “Locked”	Wait for natural expiry; (locked = irreversible by design)
13	New VMs silently unprotected	No backup-coverage policy/automation	Backup center → protectable items; Azure Policy compliance	Azure Policy to auto-enable backup; Backup center coverage
14	Failback not possible after primary returns	Never re-protected after failover	ASR → replicated items show no reverse replication	Enable re-protect (reverse replication), then planned failback

The expanded form, with the full reasoning for the entries that bite hardest:

1. VM backup fails with UserErrorGuestAgentStatusUnavailable (or ExtensionStuckInDeletionOrTransitioning). Root cause: The Azure VM Agent is stopped, outdated, or the backup (VM snapshot) extension is unhealthy — Backup coordinates the snapshot through the agent, so a dead agent means no application-consistent point. Confirm:

az vm get-instance-view -g rg-app -n vm-web-01 \
  --query "instanceView.vmAgent.{status:statuses[0].displayStatus, version:vmAgentVersion}" -o table

Fix: Ensure the VM Agent is running and current (restart it inside the guest), then let Backup re-deploy its extension (or remove and re-enable protection). On Linux, confirm the walinuxagent service is up; on Windows, the WindowsAzureGuestAgent service.

3. The recovery point you need isn’t there — retention was too short. Root cause: The classic Northwind failure — corruption noticed after the point aged out of retention. Backup did its job; the retention window was shorter than your detection latency. Confirm:

# Oldest available recovery point — if it's newer than the corruption, you're out of luck
az backup recoverypoint list -v rsv-prod-cin -g rg-resiliency \
  --container-name "$C" --item-name "$I" \
  --query "sort_by([].{time:properties.recoveryPointTime}, &time)[0]" -o table

Fix: There is no fix after the fact except recovering from a cross-region/secondary copy if one exists. The real fix is preventive: set daily retention to ≥ 14–30 days so late-noticed corruption is still recoverable. Treat 7-day retention as dev-only.

6. ASR replication health goes Critical and RPO climbs above target. Root cause: Replication lag — the source is generating writes faster than they replicate, usually because the cache storage account is throttled/full, egress bandwidth is constrained, or a burst of disk churn overwhelmed the pipe. Confirm: Site Recovery → Replicated items shows per-VM RPO and health; the cache storage account’s metrics show throttling. Via KQL over the vault’s ASR logs:

// ASR replication health / RPO breaches in the last 6 hours
ASRReplicationStats
| where TimeGenerated > ago(6h)
| where RpoInSeconds > 900   // your RPO target in seconds (15 min)
| project TimeGenerated, ReplicationProtectedItemName, RpoInSeconds, ReplicationHealth
| order by TimeGenerated desc

Fix: Move the cache to a higher-performance storage account, ensure the source region has egress headroom, reduce a runaway write workload, and verify the replication policy’s retention isn’t oversized for the available throughput. RPO health is a leading indicator — alert on it before a real failover needs a fresh point.

7. Test failover boots the VMs but the application is unreachable. Root cause: The VMs are up in the isolated network, but DNS, identity, and dependency cutover were never part of the plan — the app can’t resolve names, can’t authenticate, or points at a primary-region dependency that isn’t in the test bubble. Confirm: In the isolated VNet, check name resolution from a failed-over VM and inspect the app’s configuration for primary-region hostnames; the application logs will show connection failures to unresolvable or unreachable endpoints. Fix: Add DNS updates, identity/endpoint rewrites, and dependency stand-ins to the recovery-plan runbooks, and verify them in the test failover — which is exactly what test failovers are for. An app that boots but can’t serve is the most common “passed the failover, failed the recovery” trap.

9. The restored VM boots but the database runs crash recovery / lost recent transactions. Root cause: You restored from a crash-consistent recovery point, not an application-consistent one — the DB’s in-flight writes were torn at capture. Confirm: The recovery-point list shows each point’s consistency type; if your latest is crash-consistent, that’s why. Fix: Ensure application-consistent points are being created (VSS on Windows is default for VM backup; on Linux configure pre/post scripts), and when restoring a database, choose an application-consistent recovery point even if it is slightly older than the latest crash-consistent one — clean beats recent for stateful data.

13. New VMs are silently unprotected — coverage drift. Root cause: Backup is enabled per-item, and without automation, new VMs ship without protection. The estate’s coverage silently decays as it grows. Confirm: Backup center → protectable items lists VMs with no backup; Azure Policy compliance shows the gap. Fix: Use an Azure Policy that auto-enables backup on new VMs (built-in policies exist for “Configure backup on VMs”), and review Backup center coverage as a routine. Coverage is a governance problem; solve it with policy, not vigilance.

Best practices

Protect everything with Backup; reserve ASR for tier-1. Backup is cheap insurance against the most common loss (deletion, corruption); ASR is for the subset that cannot tolerate a regional outage. Don’t over-buy ASR.
Harden every production vault. Enable enhanced soft delete (always-on), immutability (locked) once confident, and Multi-User Authorization with the Resource Guard held by a separate team. These three defeat ransomware that deletes backups.
Choose GRS + Cross-Region Restore at vault creation. Redundancy is immutable once the vault holds an item. An LRS vault dies with its region — useless for the regional disaster you’re insuring against.
Set daily retention to at least 14–30 days. Seven days is a common, painful default that loses you the clean copy when corruption is noticed late. Match retention to your detection latency, not optimism.
Test your recovery — restores and failovers — on a schedule. A green backup job and an enabled replication are hypotheses. Run a file/VM restore test and a test failover quarterly, then clean up. Untested DR is theatre.
Sequence failover with recovery plans and runbooks. Encode the tier order (DB → app → web) and automate the cutover (connection strings, DNS) so a real failover is repeatable, not improvised.
Make recovery points application-consistent for stateful workloads. VSS (Windows) / pre-post scripts (Linux) give transactionally clean restores; crash-consistent alone risks last-seconds data loss and DB recovery time.
Set RPO/RTO from business impact and prove you meet them. Measure RTO end-to-end in a test failover (including DNS TTL and verification), not from VM boot time. A number from a rehearsal beats a number from a hope.
Automate backup coverage with Azure Policy. New VMs should be protected by policy, not by someone remembering. Coverage drift is a governance failure.
Right-size retention to control cost. Trim daily/weekly/monthly tiers to what compliance and recovery actually need; over-retention is the top cause of a ballooning backup bill.
Tighten DNS TTLs on DR-fronted endpoints. A 1-hour TTL adds an hour to your failover RTO. Drop critical-path TTLs to 30–60 seconds so cutover propagates fast.
Alert on leading indicators. ASR RPO health, backup job failures, and backup coverage — not just “the restore failed,” which is a lagging signal you find too late.

The alerts worth wiring before the next incident — the leading indicators:

Alert on	Signal	Threshold (starting point)	Why it’s leading
Backup job failure	Failed backup jobs	≥ 1 failure	Catches agent/extension breakage before a restore needs it
ASR RPO health	`RpoInSeconds` per item	> your RPO target	Warns RPO is breaching before a real failover
Replication health	ASR item health = Critical	Any item Critical	Lag/throughput problem surfacing early
Backup coverage	Unprotected protectable VMs	≥ 1	Coverage drift as the estate grows
Soft-deleted items	Items in soft-delete state	≥ 1 unexpected	Possible malicious/accidental deletion in flight
Vault security drift	Soft delete / immutability off	Any disabled on prod	Someone weakened the hardening

Security notes

Harden the vault as the primary control. Soft delete + immutability + MUA are security controls, not just operational ones — they are what stands between a credential-theft incident and total backup loss. Treat the vault’s data-protection settings as security-critical configuration.
Least-privilege RBAC on backup operations. Grant Backup Operator (run backups/restores) or Backup Reader rather than Owner; reserve Backup Contributor for those who manage policy. Destructive operations should require MUA approval.
Multi-User Authorization with a cross-boundary Resource Guard. Hold the Resource Guard in a different subscription/tenant controlled by a security team, so a compromised workload admin cannot both encrypt data and weaken the vault. This is the single most important anti-ransomware control after backup itself.
Customer-managed keys (CMK) where compliance requires it. Vaults encrypt at rest with platform-managed keys by default; for regulatory control you can use CMK from Key Vault — but then guard the Key Vault as carefully as the backups, and keep the key recoverable (its loss makes backups undecryptable). See Azure Key Vault: Secrets, Keys and Certificates Done Right.
Private endpoints for vault traffic. Use private endpoints so backup/restore traffic to the vault stays off the public internet, and so a compromised network can’t exfiltrate or tamper with recovery traffic.
Isolate the recovery environment for ransomware. For the highest tier, recover into a clean, isolated environment (an Isolated Recovery Environment) so you don’t re-introduce the malware during restore — covered in depth in Ransomware Resilience: Immutable Backups, Recovery Vaults, and Isolated Recovery Environments.
Protect the failed-over identity and secrets path. A failed-over app needs its secrets (Key Vault), identity (managed identity / Entra), and certificates available in the secondary region — replicate or co-locate them, or the app boots but can’t authenticate.
Audit destructive operations. Send vault diagnostic logs to a Log Analytics workspace and alert on disable-soft-delete, retention-reduction, and delete-backup-item operations — these are the fingerprints of an attacker preparing to delete backups.

The security controls that also improve resilience — secure and recoverable pull together here:

Control	Mechanism	Secures against	Also improves
Enhanced soft delete (always-on)	Vault data-protection setting	Attacker deleting backups	Accidental-delete recovery
Immutability (locked)	Vault security setting	Shortening/deleting points before expiry	Compliance retention guarantees
Multi-User Authorization	Resource Guard (separate tenant)	A single compromised admin	Change-control discipline
Least-privilege RBAC	Backup Operator/Reader roles	Over-broad backup/restore rights	Cleaner operational ownership
Private endpoints	Vault private link	Public exposure of backup traffic	Network reliability/isolation
CMK + guarded Key Vault	Customer-managed encryption keys	Regulatory key-control gaps	Defined key lifecycle
Diagnostic logging + alerts	Vault logs → Log Analytics	Silent malicious operations	Faster incident detection

Cost & sizing

The bill has a few dominant drivers, and they interact with every design choice above.

Protected-instance charge is the per-item monthly fee for each thing you back up (it scales by the size of the protected instance in tiers, e.g. up to 50 GB, 50–500 GB, then per 500 GB). This is often the largest line on small estates — every protected VM/DB carries it regardless of storage used.
Backup storage is billed per GB of retained recovery-point data, and it is LRS/ZRS/GRS-priced — GRS storage costs more than LRS because it keeps a geo copy. Retention multiplies this: 7 years of monthly points accumulates relentlessly. Archive tier for long-term, rarely-touched points cuts this materially.
Instant-restore snapshots are charged as managed-disk snapshots in the source region for the instant-restore window (1–5 days). More instant-restore days = faster recent restores = more snapshot cost.
ASR replication charges a per-protected-instance monthly fee plus the target-region replica storage and cache storage plus egress for the replication traffic. ASR is materially more expensive than Backup per workload — which is why you reserve it for tier-1.
Cross-region restore / egress adds I/O and egress when you actually restore in the paired region — cheap insurance relative to the outage it covers, but real.

Free and cheap angles worth knowing:

Item	Cost reality	Cheap lever
Soft delete (within window)	Free during the soft-delete retention period	Always enable it — no cost to safety
First 5 GB / month per region (Azure Files snapshot)	Often within free allowance for small shares	Keep small-share snapshots lean
LRS vs GRS storage	GRS ~2× LRS storage price	Use LRS for dev; GRS only where region loss matters
Archive tier (LTR)	Far cheaper per-GB than hot vault storage	Tier long-term monthly/yearly points to archive
ASR per-instance fee	Charged per replicated VM, continuously	Replicate only tier-1, not the whole estate
Instant-restore days	Snapshot storage per day retained	Drop to 1–2 days for non-critical VMs

A rough monthly picture for a small production estate (~10 VMs, ~2 TB protected, tier-1 subset of 3 VMs on ASR):

Cost driver	What you pay for	Rough INR / month	What it buys	Watch-out
Protected instances (10 VMs)	Per-instance monthly fee	~₹8,000–14,000	Backup coverage of the estate	Scales with instance size tiers
Backup storage (GRS, ~2 TB retained)	Per-GB retained, geo-priced	~₹10,000–20,000	Recoverable history	Grows with retention; archive old tiers
Instant-restore snapshots (5 days)	Source-region snapshot storage	~₹2,000–5,000	Fast recent restores	Trim days on non-critical VMs
ASR (3 tier-1 VMs)	Per-instance fee + replica + cache	~₹6,000–12,000	Fast regional failover	Most expensive per-workload — tier-1 only
Cross-region restore / egress	Restore I/O + egress when used	Episodic	Self-service paired-region restore	Only on a real restore/failover
Log Analytics (vault logs)	Per-GB ingestion	~₹1,000–3,000	Alerting + audit on destructive ops	Sample/route verbosely-logged vaults

The cost discipline is the same as the resilience discipline: right-size retention (don’t keep 7 years of daily points), tier long-term data to archive, reserve ASR for tier-1, and measure — Northwind ended up cheaper after redesign in some line items because they stopped over-retaining daily points and only replicated the three VMs that actually needed it. For estate-wide cost control, pair this with Azure FinOps and Cost Management: Controlling Cloud Spend at Scale.

Interview & exam questions

1. What is the fundamental difference between Azure Backup and Azure Site Recovery? Azure Backup protects data — it takes point-in-time recovery points so you can restore a file, disk, database or whole VM after deletion or corruption. Azure Site Recovery protects availability — it continuously replicates a VM to another region so you can fail the workload over during a regional outage. Backup answers “can I get my data back?”; ASR answers “can I run my app elsewhere?” Most production workloads need both because they cover different failure modes.

2. A team backs up VMs daily but has no DR. A region fails. Why can’t Backup meet a 1-hour RTO? Restoring from Backup means copying recovery-point data back and rebuilding VMs, which takes real time proportional to data size — restoring many multi-hundred-GB VMs serially blows a 1-hour RTO. Backup is optimised for granular data recovery, not fast whole-region failover. The tool for a tight regional RTO is Site Recovery, which boots an already-replicated replica in minutes.

3. How does Azure Backup defend against ransomware that deletes the backups? Three layered vault controls. Soft delete retains deleted recovery points for a window (14–180 days) so a malicious delete is recoverable; enhanced soft delete (always-on) makes that irreversible so an attacker can’t disable it first; immutability (locked) makes recovery points un-deletable and un-shortenable before expiry; and Multi-User Authorization requires a second approver (via a Resource Guard in a separate tenant) for destructive operations. Together they ensure a compromised admin cannot erase the backups.

4. What’s the difference between crash-consistent and application-consistent recovery points, and when does it matter? A crash-consistent point captures disks at an instant as if the power were pulled — it boots, but in-flight writes may be torn and a database may run crash recovery and lose the last transactions. An application-consistent point uses VSS (Windows) or pre/post scripts (Linux) to quiesce the application first, producing a transactionally clean moment. It matters for databases and stateful apps: always restore them from an application-consistent point, even if it’s slightly older than the latest crash-consistent one.

5. Why is vault redundancy a decision you must make at creation, and what should production use? Vault storage redundancy (LRS/ZRS/GRS) is immutable once the vault holds a protected item — you can’t change LRS→GRS without deleting all items. An LRS vault keeps all copies in one region, so a regional disaster destroys the backups with the data. Production backups should use GRS with Cross-Region Restore, which keeps a geo copy and lets you restore in the paired region on your schedule.

6. What is RPO vs RTO, and what governs each for Backup and ASR? RPO (Recovery Point Objective) is the maximum acceptable data loss in time — governed by how often you create recovery points (backup frequency for Backup; continuous replication for ASR, often minutes). RTO (Recovery Time Objective) is the maximum acceptable time to restore service — governed by how fast you restore/fail over and re-plumb (restore time for Backup; replica boot + DNS/identity cutover for ASR). Both should be set from business impact and proven in a test, not assumed.

7. Why must you test failovers, and what does a test failover do? A recovery plan is a hypothesis until exercised — DR plans decay (IPs change, scripts rot, new VMs aren’t added). A test failover boots the replicas in an isolated network with no impact to production or replication, so you can verify the app actually comes up, measure end-to-end RTO, and produce audit evidence — then clean it up. Skipping it is how “we have DR” becomes an 18-hour outage.

8. A recovery point you need to restore from has aged out of retention. What went wrong and how do you prevent it? The retention window was shorter than the detection latency — corruption was noticed after the clean point expired (e.g. 7-day retention, corruption found on day 9). There’s no fix after the fact except a cross-region/secondary copy if one exists. Prevent it by setting daily retention to at least 14–30 days so late-noticed corruption is still recoverable; treat 7-day retention as dev-only.

9. What does a recovery plan add over just replicating VMs with ASR? A pile of replicated VMs isn’t a recoverable application — tiers must come up in order and the cutover needs automation. A recovery plan sequences VMs into groups (DB → app → web) with pre/post actions (manual gates or Azure Automation runbooks) for things like rewriting connection strings and updating DNS, turning failover into a single repeatable, auditable operation instead of an improvisation under stress.

10. Your ASR replication health goes Critical and RPO climbs. What’s happening and what do you check? Replication lag — the source is generating writes faster than they replicate, usually due to a throttled/full cache storage account, constrained egress, or a churn burst. Check Site Recovery → Replicated items for per-VM RPO/health and the cache storage account’s throttling metrics. Fix by upgrading the cache storage, ensuring egress headroom, and reducing runaway write workloads. RPO health is a leading indicator — alert on it.

11. What’s the difference between a Recovery Services vault and a Backup vault? A Recovery Services vault is the long-standing vault for VM/file/SQL/SAP backup and Azure Site Recovery. A Backup vault is the newer model for cloud-native workloads the Recovery Services vault never covered — Azure Blobs, managed disks, PostgreSQL flexible server, AKS — with native immutability and MUA. You can’t migrate items between them, and ASR/VMs only live in the Recovery Services vault, so pick correctly before creating anything.

12. How do you ensure newly created VMs are actually protected? Backup is enabled per-item, so without automation new VMs ship unprotected and coverage drifts as the estate grows. Use an Azure Policy (built-in “Configure backup on VMs”) to auto-enable backup on new VMs, and review Backup center coverage routinely. Coverage is a governance problem solved with policy, not vigilance.

These map primarily to AZ-104 (Administrator) — implement and manage backup and recovery (Recovery Services vaults, backup policies, ASR) — and AZ-305 (Solutions Architect Expert) — design business-continuity solutions (RPO/RTO, backup vs DR, recovery objectives). The ransomware/immutability angle touches SC-100/AZ-500. A compact cert-mapping for revision:

Question theme	Primary cert	Exam objective area
Backup vs ASR, RPO/RTO	AZ-305	Design business-continuity solutions
Recovery Services vault, policies, retention	AZ-104	Implement and manage backup
ASR replication, recovery plans, failover	AZ-104 / AZ-305	Implement DR; design BC
Soft delete, immutability, MUA	AZ-500 / SC-100	Secure backup; ransomware resilience
Crash- vs app-consistent, restore types	AZ-104	Backup and recovery operations
Vault redundancy (LRS/ZRS/GRS)	AZ-305	Design for resiliency / data redundancy

Quick check

A user accidentally deletes a critical file; you need it back from three weeks ago. Which service, and which restore type?
Central India region goes fully offline and you must serve customers within 30 minutes. Which service, and why does daily Backup alone fail here?
True or false: you can change a vault’s redundancy from LRS to GRS after you’ve been backing up 50 VMs into it for a year.
A ransomware operator gets admin and deletes your backup items. Name the two vault controls that would still save you.
Your test failover boots all VMs successfully but the application can’t serve traffic. What’s the most likely missing piece, and where do you fix it?

Answers

Azure Backup, file-level restore — mount the recovery point from three weeks ago and copy the file back. This requires daily retention of at least 21 days; ASR keeps only hours of recovery points and is whole-VM only, so it can’t do this.
Azure Site Recovery — it keeps a continuously replicated, bootable replica in a paired region you can fail over to in minutes. Daily Backup fails the 30-minute RTO because restoring means copying recovery-point data back and rebuilding VMs, which takes far longer than booting an already-replicated replica.
False. Vault redundancy is immutable once the vault holds a protected item. To go LRS→GRS you’d have to stop protection and delete all backup data first (or create a new GRS vault and re-protect everything). Decide redundancy at creation.
Soft delete (ideally enhanced/always-on) retains the deleted recovery points for a recoverable window even after deletion; immutability (locked) prevents the points being deleted or their retention shortened at all. Multi-User Authorization further blocks a single compromised admin from disabling either. Any of these defeats the “delete the backups” step.
DNS / identity / dependency cutover wasn’t part of the plan — the VMs are up but can’t resolve names, authenticate, or reach a primary-region dependency. Fix it by adding those cutover steps (DNS updates, connection-string/identity rewrites) to the recovery-plan runbooks and verifying them in the test failover — which is exactly what test failovers exist to catch.

Glossary

Azure Backup — the service that takes point-in-time recovery points of VMs, files, SQL, SAP HANA, Azure Files and blobs into a vault, for restore after deletion or corruption.
Azure Site Recovery (ASR) — the service that continuously replicates VMs to another region and orchestrates failover for regional disaster recovery.
Recovery Services vault — the classic vault for VM/file/SQL/SAP backup and the control plane for ASR; holds RBAC, redundancy and soft-delete settings.
Backup vault — the newer vault for cloud-native workloads (blobs, managed disks, PostgreSQL flexible server, AKS) with native immutability and MUA; does not do VMs or ASR.
Recovery point — a point-in-time copy you can restore to; the unit of “how far back can I go.”
Backup policy — the schedule + tiered retention (daily/weekly/monthly/yearly) attached to protected items, defining RPO and history.
Soft delete — retention of deleted backup data for a window (14–180 days) so deletion is recoverable; enhanced/always-on makes it irreversible.
Immutability — recovery points cannot be deleted or have retention shortened before expiry; locked immutability is irreversible.
Multi-User Authorization (MUA) — destructive vault operations require a second approver via a Resource Guard, typically in a separate tenant.
RPO (Recovery Point Objective) — the maximum acceptable data loss, in time; set by backup frequency / replication cadence.
RTO (Recovery Time Objective) — the maximum acceptable time to restore service; set by restore/failover speed plus cutover.
Crash-consistent — a recovery point captured at an instant with no quiesce (as if power was pulled); boots but may lose last writes.
Application-consistent — a recovery point taken after quiescing the app (VSS / scripts) for a transactionally clean restore.
Instant restore — fast restore from snapshots kept in the source region for a 1–5 day window, before data is only in vault storage.
Replication — ASR’s continuous copy of disk writes to a target region via a cache storage account, building recovery points.
Recovery plan — an ordered set of VM groups with pre/post scripts/runbooks that orchestrate a sequenced, repeatable failover.
Test failover — booting replicas in an isolated network with no production impact; the rehearsal that proves DR works.
Failback / re-protect — reversing replication and returning the workload to the primary region after it recovers.
Cross-Region Restore (CRR) — a GRS-vault capability to restore in the paired region on demand, without waiting for a Microsoft-declared failover.
GRS / ZRS / LRS — geo-, zone-, and locally-redundant storage options for the vault; only GRS survives a regional loss.

Next steps

You can now protect data with Backup, defend it against ransomware with a hardened vault, and stand up regional DR with Site Recovery and a tested recovery plan. Build outward:

Next: High Availability vs Disaster Recovery: RTO and RPO Explained — set the objectives that drive every choice in this article before you size anything.
Related: Ransomware Resilience: Immutable Backups, Recovery Vaults, and Isolated Recovery Environments — go deep on the immutable-vault and isolated-recovery patterns that turn backups into a genuine ransomware defence.
Related: Azure Regions and Availability Zones: Designing for Resilience — the region/zone substrate that backup redundancy and ASR replication depend on.
Related: Azure Multi-Region Active-Active Architecture: Designing for Zero-Downtime — when minutes of failover are too many and you need active-active instead of (or alongside) ASR.
Related: Azure Storage Account Fundamentals: Blobs, Files, Queues and Tables — the LRS/ZRS/GRS redundancy model that also governs your vault, plus blob backup.
Related: Azure FinOps and Cost Management: Controlling Cloud Spend at Scale — keep retention and ASR spend honest as the estate grows.