Azure Resiliency

Azure Backup and Site Recovery: Protecting Workloads from Loss

The worst phone call of a career is not “the site is down.” It is “the site is down and the backups don’t restore.” A manufacturing client of mine backed up their file servers religiously for three years and never once tried to recover a whole machine. When ransomware encrypted their production estate on a Tuesday night, the backup jobs were green, the data was intact — and it still took eighteen hours to get the application serving customers, because nobody had ever built or rehearsed an orchestrated recovery. The data came back; the service did not, for the better part of a day, because restoring four hundred files is a different problem from standing up an application tier. That gap — between “I have the bytes” and “I can run the app” — is exactly the gap that Azure Backup and Azure Site Recovery (ASR) fill, and they fill different halves of it.

Azure Backup answers one question: can I get my data back to how it was at a point in time? It takes application-consistent point-in-time copies of VMs, files, SQL Server, SAP HANA, Azure Files and blobs into a hardened Recovery Services vault (or, for newer workloads, a Backup vault), and lets you restore a single file, a single disk, or an entire machine. Azure Site Recovery answers a completely different question: can I run my whole application somewhere else, fast, in a defined order? It continuously replicates a VM’s disks to a paired region, and on failover it powers on the replicas, attaches networking, and walks a recovery plan you authored — web tier, then app tier, then database, with scripts in between. Backup is your time machine against deletion and corruption; Site Recovery is your second site against a regional outage. You need both, and confusing them is how you end up with eighteen-hour Tuesdays.

This article is the production playbook for both. You will learn the vault model and why soft delete plus immutability is the only thing standing between you and a ransomware operator who got domain admin; every backup policy knob (frequency, retention tiers, instant-restore snapshots) and what each costs; how ASR replication actually works (the appliance, the cache storage account, crash- vs app-consistent recovery points, the RPO/RTO you can realistically promise); how to build and — the part everyone skips — test a recovery plan without touching production; and a structured failure→cause→confirm→fix table for the dozen ways these jobs break in the real world. Every operation gets the exact az command, a Bicep equivalent, and a KQL query where the answer lives in logs. The prose explains the why; the tables — there are many — are the reference you keep open at 02:00 when the vault is throwing UserErrorGuestAgentStatusUnavailable and the CFO is on the bridge.

What problem this solves

Data and applications die from causes that look nothing alike, and a single mechanism cannot defend against all of them. An engineer fat-fingers a DROP TABLE or deletes the wrong resource group. A bad deployment corrupts data subtly for six hours before anyone notices. A ransomware operator encrypts every reachable disk and then hunts down and deletes the backups, because they know that intact backups are the only thing that lets you refuse the ransom. An Azure region has a storage incident and your entire workload — perfectly healthy code — is unreachable for hours. Each of these is a different attack surface, and “we have backups” is a meaningless statement until you say which failure mode you mean.

What breaks without a real strategy is not the backup — it is the recovery. Teams discover, mid-incident, that their retention was 7 days and the corruption started on day 9; that the backups were in the same region that just failed; that the vault had no soft delete so the ransomware deleted the recovery points along with the data; that nobody knew the order to bring tiers up, or that the application needs a connection-string rewrite and a DNS swap that lives only in one person’s head — and that person is on a flight. The cruel truth of disaster recovery is that an untested recovery plan is a hypothesis, not a capability. DR plans decay silently: an IP changes, a dependency is added, a script rots, and the plan that worked at the last audit fails at the real incident.

Who hits this: everyone running production in the cloud, but it bites hardest on teams that treat backup as a checkbox rather than a tested capability. The finance and healthcare teams who must prove recoverability to auditors. The lean startups who set up daily VM backups, feel safe, and never once run a test restore. The enterprises with sprawling estates where backup coverage (is every new VM actually protected?) silently drifts. And anyone who thinks a snapshot is a backup — snapshots live next to the thing they protect and die with it, which is precisely useless against ransomware or a region loss.

To frame the whole field before the deep dive, here is the threat model: every loss event this article defends against, which service answers it, and the one control that actually saves you.

Loss event What’s lost Primary defence The control that saves you
Accidental delete (file/VM/RG) Specific objects Azure Backup Soft delete on the vault + retention ≥ 14 days
Slow data corruption Recent good state Azure Backup Long-enough retention + point-in-time choice
Ransomware encrypts disks All reachable data Azure Backup Immutable vault + soft delete + MUA
Ransomware deletes backups Your only recovery Azure Backup Immutability lock + Multi-User Authorization
Single-VM crash / OS rot One machine Azure Backup (restore VM) Tested whole-VM restore, not just file restore
Availability-zone failure One zone Zone-redundant design ZRS storage / zonal redundancy (often not DR)
Regional outage The whole workload Azure Site Recovery Replication to a paired region + tested failover
Region loss + no order to recover Time and sanity ASR recovery plan Sequenced, script-driven, rehearsed plan

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already be comfortable with the Azure resource model — subscriptions, resource groups, regions and Azure paired regions — and able to run az in Cloud Shell, read JSON output, and reason about managed disks and VNets. Helpful but not required: a working mental model of RTO (how long until the app is back) and RPO (how much data you can afford to lose), and a passing familiarity with managed identities and RBAC, because the vault’s access model leans on both. If those resilience terms are fuzzy, the conceptual groundwork lives in High Availability vs Disaster Recovery: RTO and RPO Explained; the region/zone substrate these services replicate across is covered in Azure Regions and Availability Zones: Designing for Resilience.

This sits in the Resiliency & Business Continuity track. It is downstream of basic compute and storage and upstream of full multi-region architecture. Backup and ASR are components of a resilience strategy, not the whole thing: they pair with active-active patterns from Azure Multi-Region Active-Active Architecture: Designing for Zero-Downtime (for workloads that cannot tolerate even a short failover), with the storage redundancy concepts in Azure Storage Account Fundamentals: Blobs, Files, Queues and Tables (LRS/ZRS/GRS, which also govern vault redundancy), and with the secret-protection discipline in Azure Key Vault: Secrets, Keys and Certificates Done Right (because customer-managed keys and the keys your failed-over app needs both live there). The hardest ransomware variant of this topic — air-gapped, immutable, isolated recovery — gets its own deep treatment in Ransomware Resilience: Immutable Backups, Recovery Vaults, and Isolated Recovery Environments.

A quick map of who owns what during a recovery, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it causes
Source workload (VM/DB/files) The data being protected App / DBA team Agent down → backup fails; app inconsistency
In-guest agent (MARS / VM ext) Snapshot coordination Platform + app Extension unhealthy → job fails
Recovery Services / Backup vault Recovery points, policy, soft delete Backup / platform team Misconfigured retention; no immutability
Vault redundancy (LRS/ZRS/GRS) Where copies physically live Platform / architecture Same-region copy lost in a regional outage
ASR replication path Disk replication + recovery points Platform / network Replication lag → RPO breach
Recovery plan + scripts Failover order, automation App + platform Wrong order, stale script → long RTO
DNS / networking / identity Cutover plumbing Network + identity App up but unreachable; auth broken

Core concepts

Six mental models make every later decision obvious.

Backup protects state; Site Recovery protects service. This is the master distinction and it drives everything. Azure Backup captures point-in-time copies of data so you can roll an object back to how it was — a file, a disk, a database, a whole VM. Azure Site Recovery captures a continuously updated replica of a running machine so you can power it on elsewhere. Backup’s unit of value is a recovery point (a moment you can return to); ASR’s unit of value is a failover (the act of running the workload in the secondary site). Backup defends against deletion and corruption, which are time problems; ASR defends against unavailability, which is a location problem. A VM can need both: Backup to undo a bad change, ASR to survive a region outage.

The vault is the trust boundary — and its hardening is the whole game against ransomware. Recovery points live in a vault (Recovery Services vault for the classic estate; the newer Backup vault for blobs, disks, Azure Database for PostgreSQL flexible server, AKS and more). The vault is a control-plane object with its own RBAC, its own redundancy setting, and — critically — its own data-protection controls: soft delete (deleted recovery points are retained, recoverable, for a window rather than purged immediately), immutability (recovery points cannot be deleted or shortened before expiry), and Multi-User Authorization (MUA) (destructive operations require a second approver via a Resource Guard). A modern ransomware playbook is encrypt the data, then delete the backups; these three controls are specifically what defeat the second step. A vault without them is a backup that an attacker with your credentials can erase.

Crash-consistent is not application-consistent — and the difference is your data integrity. When Backup or ASR captures a recovery point, it is one of three consistency levels. A crash-consistent point is “as if you pulled the power cord” — disks captured at an instant, in-flight writes possibly torn; it boots, but a database may need crash recovery and could lose the last transactions. A file-system-consistent point flushes the OS file cache (Linux) so on-disk files are coherent. An application-consistent point uses VSS on Windows (or pre/post scripts on Linux) to quiesce the application — flush database buffers, freeze writers — so the recovery point is a clean, transactionally consistent moment. For databases and stateful apps you want application-consistent points; ASR creates them on a configurable cadence, and Backup uses VSS by default for VMs. If you only have crash-consistent points, plan for recovery time and possible last-seconds data loss.

RPO and RTO are promises with prices, not aspirations. RPO (Recovery Point Objective) is the maximum data loss you accept, measured in time — “we can lose at most 15 minutes.” It is governed by how often you create recovery points: backup frequency for Backup (hourly to daily), and continuous replication for ASR (RPO often a few minutes, app-consistent points every hour by default). RTO (Recovery Time Objective) is the maximum time to restore service — “we are back within 2 hours.” It is governed by how fast you can restore or fail over and re-plumb: restoring a 2 TB VM from backup takes real time; an ASR failover boots a replica in minutes but DNS, identity and dependency cutover add to it. Tighter RPO/RTO costs more (more frequent points, hot replicas, more automation). The discipline is to set them from business impact, not ambition, and then test that you meet them.

Restore is a spectrum, not a button. Azure Backup does not just “restore the VM.” It offers, from cheapest/fastest to most complete: file-level restore (mount a recovery point and copy individual files), disk restore (recover specific managed disks and attach them), replace existing (overwrite the source VM’s disks), and create new VM (build a fresh VM from the recovery point). Instant restore uses snapshots retained in the source region for a configurable window (1–5 days) so recent restores are near-instant and don’t pull from vault storage. Choosing the right restore type for the incident — one file vs a whole machine — is the difference between a five-minute fix and an hour-long rebuild.

Failover has phases, and “test” is the most important one. ASR failover is not a single act. A test failover spins up the replica in an isolated network with no impact to production or replication — this is your rehearsal and your audit evidence, and you should run it quarterly. A planned failover (zero data loss, for a controlled migration) shuts the source down cleanly first. An unplanned failover (the real disaster) runs from the latest available recovery point because the source is gone. After the dust settles you commit the failover (finalising it) and, when the primary region returns, re-protect and fail back. The lifecycle — replicate → test → fail over → commit → re-protect → fail back — is the thing you must understand, because skipping “test” is how the eighteen-hour Tuesday happens.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Term One-line definition Which service Why it matters
Recovery Services vault Classic vault for VM/file/SQL/SAP backup + ASR Both The hardened store; RBAC + redundancy + soft delete live here
Backup vault Newer vault for blobs, disks, AKS, flexible-server DBs Backup Where modern workload backups go; supports immutability + MUA
Recovery point A point-in-time copy you can restore to Backup / ASR The unit of “how far back can I go”
Backup policy Schedule + retention rules attached to items Backup Defines RPO and how long you keep history
Soft delete Deleted points retained, recoverable, for a window Both Defeats “attacker deletes the backups”
Immutability Points can’t be deleted/shortened before expiry Both The lock ransomware can’t pick
MUA / Resource Guard Destructive ops need a second approver Both Stops a single compromised admin
RPO Max acceptable data loss (in time) Both Set by frequency / replication cadence
RTO Max acceptable time to restore service Both Set by restore/failover speed + plumbing
Crash- vs app-consistent Power-pull vs quiesced (VSS) recovery point Both Data integrity of what you restore
Instant restore Snapshot-backed fast restore in source region Backup Speeds recent restores; costs snapshot storage
Replication Continuous disk copy to a target region ASR The mechanism behind regional DR
Recovery plan Sequenced failover with groups + scripts ASR Turns a pile of VMs into a recoverable app
Test failover Failover into an isolated network, no impact ASR The rehearsal that makes DR real
Failback / re-protect Return to primary after it recovers ASR Closing the loop post-incident

Backup vs Site Recovery: choosing the right tool

The single most expensive mistake in this space is reaching for the wrong tool — backing up a workload that needed replication, or replicating one that just needed a longer retention. They are complements, not substitutes. Here is the head-to-head that settles it:

Dimension Azure Backup Azure Site Recovery
Question it answers “Can I get my data back?” “Can I run my app elsewhere?”
Protects VMs, files, SQL, SAP HANA, Azure Files, blobs, disks VMs (Azure + on-prem VMware/Hyper-V/physical)
Unit of value Recovery point (point-in-time) Failover (running replica)
Typical RPO Hours to a day (per schedule) Seconds to minutes (continuous)
Typical RTO Minutes to hours (restore time) Minutes (boot replica) + cutover
Defends against Delete, corruption, ransomware Regional / large-scale outage
Storage cost model Vault storage for retained points Continuous replica storage + cache
Granularity File / disk / DB / whole VM Whole VM (and its dependencies)
History retained Days to years (LTR) Hours of recovery points (e.g. 24–72h)
Orchestration Restore (manual/automated) Recovery plans (sequenced, scripted)

The decision rule, as a table — match the workload requirement to the tool:

If the requirement is… Use Why
Undo an accidental delete weeks later Backup (long retention) ASR keeps only hours of points
Recover one corrupted file Backup (file-level restore) ASR is whole-VM only
Keep 7 years of monthly snapshots for audit Backup (yearly retention / LTR) Compliance archive is Backup’s job
Survive a full region outage in minutes Site Recovery Continuous replica, fast failover
Sequence web→app→DB failover with scripts Site Recovery (recovery plan) Orchestration is ASR’s job
Protect against ransomware deleting backups Backup (immutable vault + MUA) Hardened vault controls
Both: undo bad changes AND survive region loss Both They cover different failure modes
Near-zero downtime, no failover step at all Neither alone → active-active DR ≠ HA; see multi-region design

A blunt rule I give every team: Backup is mandatory for anything that holds data; Site Recovery is for the subset that cannot tolerate a prolonged regional outage. Most estates over-buy ASR (it is the more expensive, more operationally demanding service) and under-invest in testing Backup. Protect everything with Backup; reserve ASR for the tier-1 workloads where minutes of regional downtime translate to real money or real harm.

A snapshot is not a backup — and three other myths

The most dangerous belief in this space is “we take snapshots, so we’re covered.” A snapshot lives next to the thing it protects, shares its fate, and offers no immutability — it is a convenience, not a recovery strategy. Here is each common myth against the reality:

Common belief Reality Why it bites
“Disk snapshots are our backup” Snapshots sit in the same subscription/region and have no immutability Ransomware/region loss takes them with the data
“RAID / ZRS protects our data” That’s hardware/zone availability, not point-in-time recovery A DROP TABLE or corruption replicates instantly to every copy
“GRS storage means we have DR” GRS is async replication, not a tested failover capability No orchestration, no restore test, surprise RTO
“Backups are green, so we’re safe” A green job proves capture, not recoverability Untested restores fail when you finally need them
“ASR replaces Backup” ASR keeps only hours of points and is whole-VM only Can’t undo a file delete from three weeks ago
“We can change vault redundancy later” Redundancy is immutable once an item is protected Stuck on LRS during a regional outage

The vault: Recovery Services vs Backup vault

Everything Backup and ASR do is anchored to a vault. There are two kinds today, and picking the wrong one wastes a day of rework because you cannot migrate items between them.

Recovery Services vault is the long-standing vault. It backs up Azure VMs, on-prem files/folders (via the MARS agent), SQL Server in Azure VMs, SAP HANA in Azure VMs, and Azure File Shares — and it is the control plane for Azure Site Recovery. If you are protecting VMs or running ASR, this is your vault.

Backup vault is the newer model for workloads the Recovery Services vault never covered: Azure Blobs (operational + vaulted backup), Azure Managed Disks, Azure Database for PostgreSQL flexible server, AKS (cluster state), and Azure Database for MySQL/PostgreSQL. It has a cleaner data-protection model (native immutability, MUA) and is where Microsoft is investing for cloud-native workloads. It does not do VMs or ASR.

Here is which vault each workload belongs to — get this right before you create anything:

Workload Vault type Backup style Notes
Azure VM Recovery Services Snapshot + vault VSS app-consistent by default (Windows)
On-prem files/folders (MARS) Recovery Services Agent → vault The MARS agent, scheduled
SQL Server in Azure VM Recovery Services Stream (log/diff/full) 15-min log RPO possible
SAP HANA in Azure VM Recovery Services Backint stream Certified Backint integration
Azure File Share Recovery Services Snapshot-based Snapshots managed by the vault
Azure Blob Backup vault Operational + vaulted Point-in-time / continuous
Azure Managed Disk Backup vault Incremental snapshot Snapshot in a resource group
PostgreSQL flexible server Backup vault Vaulted Long-term retention beyond service default
AKS Backup vault Cluster + PV (via extension) Backup extension + trusted access
Azure VM replication (DR) Recovery Services ASR replication Not “backup” — it’s DR

Create a Recovery Services vault and immediately set its storage redundancy (you can only change it before the first protected item exists):

# Recovery Services vault with GRS (cross-region) redundancy
az backup vault create \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --location centralindia
# Set redundancy BEFORE protecting anything (GeoRedundant / LocallyRedundant / ZoneRedundant)
az backup vault backup-properties set \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --backup-storage-redundancy GeoRedundant \
  --cross-region-restore-flag true

The Bicep equivalent, with soft delete and cross-region restore baked in:

resource rsv 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
  name: 'rsv-prod-cin'
  location: location
  sku: { name: 'RS0', tier: 'Standard' }
  identity: { type: 'SystemAssigned' }
  properties: {}
}

resource rsvConfig 'Microsoft.RecoveryServices/vaults/backupconfig@2024-04-01' = {
  name: '${rsv.name}/vaultconfig'
  properties: {
    enhancedSecurityState: 'Enabled'        // soft delete + security features
    softDeleteFeatureState: 'Enabled'
    storageModelType: 'GeoRedundant'
    crossRegionRestoreFlag: true
  }
}

The vault redundancy choice is the same LRS/ZRS/GRS decision as a storage account, and it is consequential for DR — an LRS vault keeps every copy in one region, so a regional disaster takes the backups with the data. Match redundancy to the threat:

Redundancy Copies kept Survives When to use Cost
LRS (Locally redundant) 3 copies, one datacenter Disk/rack/node failure Dev/test; data with a separate regional copy Lowest
ZRS (Zone redundant) 3 copies across AZs A whole availability zone Prod where region loss is covered elsewhere Medium
GRS (Geo redundant) LRS + async copy to paired region A whole region Production default for backups Higher
GRS + Cross-Region Restore GRS, restorable from secondary on demand Region loss, with self-service restore Tier-1; restore without waiting for failover Higher + restore I/O

A note that has cost people their backups: redundancy is immutable once the vault holds a protected item. If you create an LRS vault, protect 200 VMs, then realise during a regional incident that you needed GRS — you cannot change it without deleting all protected items first. Decide redundancy at creation. For anything production, default to GRS with Cross-Region Restore enabled, which lets you restore in the paired region on your schedule rather than waiting for Microsoft to declare a failover.

Hardening the vault: soft delete, immutability and MUA

This is the section that matters most and that most teams skip. A backup an attacker can delete is not a backup. Three layered controls turn the vault from a convenience into a genuine ransomware defence, and they compound: soft delete buys you a recovery window, immutability removes the delete capability entirely, and MUA ensures no single compromised admin can disable either.

Soft delete retains deleted backup data — for 14 days by default, configurable up to 180 days, free of charge during the soft-delete window — so that an accidental or malicious “delete this backup item” can be undone. With enhanced soft delete you can make it always-on (irreversible: it cannot be turned off, closing the loop where an attacker simply disables soft delete first). Check and configure it:

# Inspect soft-delete state and retention
az backup vault backup-properties show \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --query "{soft:softDeleteFeatureState, days:softDeleteRetentionPeriodInDays}" -o table

# Set enhanced soft delete to always-on (irreversible) with 30-day retention
az backup vault backup-properties set \
  --name rsv-prod-cin --resource-group rg-resiliency \
  --soft-delete-feature-state AlwaysON \
  --soft-delete-duration 30

Immutability makes recovery points un-deletable and un-shortenable before their expiry. You enable it on the vault and then optionally lock it. Unlocked immutability can be turned off (good while you pilot); a locked immutable vault is irreversible — not even Microsoft support can delete a recovery point before it expires. That irreversibility is the entire point: it is the property a ransomware operator cannot defeat with stolen credentials.

resource rsvImmutability 'Microsoft.RecoveryServices/vaults@2024-04-01' = {
  name: 'rsv-prod-cin'
  location: location
  sku: { name: 'RS0', tier: 'Standard' }
  properties: {
    securitySettings: {
      immutabilitySettings: {
        state: 'Locked'   // 'Unlocked' while piloting; 'Locked' is irreversible
      }
    }
  }
}

Multi-User Authorization (MUA) protects operations, not just data: critical actions (disable soft delete, reduce retention, delete a backup item, stop protection with delete) require approval through a Resource Guard held in a different subscription or tenant, governed by a security team the workload admins don’t control. So even a fully compromised backup admin cannot quietly weaken the vault — the destructive op stalls awaiting a second party. Configure the guard’s scope:

# Associate a Resource Guard (created by the security team) with the vault for MUA
az dataprotection resource-guard create \
  --resource-group rg-security --name rg-guard-prod --location centralindia
# Then link the vault to it and choose which operations are guarded in the portal/Bicep

These controls layer; understand what each stops and its escape hatch (or lack of one):

Control What it stops Default Can an admin disable it? Recommended prod setting
Soft delete (basic) Permanent loss on accidental/malicious delete On, 14 days Yes (then 14-day window still applies) Enable, ≥ 30 days
Enhanced soft delete (Always-on) Attacker disabling soft delete first Off No (irreversible) Enable, Always-on
Immutability (Unlocked) Deleting/shortening points before expiry Off Yes Enable while piloting
Immutability (Locked) Same, irreversibly Off No (irreversible) Enable + Lock for tier-1
Multi-User Authorization A single compromised admin weakening the vault Off Only with the second approver Enable for prod vaults
RBAC least privilege Over-broad backup/restore rights Backup Operator, not Owner

And the ransomware kill-chain, mapped to the control that breaks each step — this is why you layer them:

Attacker step Without hardening Control that breaks it
Gains admin via phishing Full control of estate (out of scope — identity hardening)
Encrypts production disks Data unusable Backup itself (restore clean points)
Deletes backup items No recovery → pay ransom Soft delete (retains them)
Disables soft delete, then deletes Soft delete bypassed Enhanced soft delete (Always-on)
Shortens retention to expire points Points vanish “legitimately” Immutability (Locked)
Uses one stolen admin to do all above One credential = total loss MUA / Resource Guard

These controls have states with one-way doors — the irreversible transitions are deliberate (an attacker can’t undo them either), so understand them before you flip the switch:

State / transition Reversible? Effect When to choose
Soft delete → Off Yes No retention of deleted points Never on prod
Soft delete → On (basic) Yes 14–180 day recovery window Minimum baseline
Soft delete → Always-on No (one-way) Can’t be disabled by anyone Production hardening
Immutability → Unlocked Yes Points protected, can be turned off While piloting immutability
Immutability → Locked No (one-way) Points immutable, irreversibly Tier-1, once confident
MUA → Enabled Yes (via approver) Destructive ops need 2nd party All production vaults

If you take one thing from this article: for any vault holding production backups, enable enhanced soft delete (always-on), immutability (locked) once you are confident, and MUA. Those three turn “we have backups” into “we have backups an attacker cannot erase.”

Azure Backup policy: every knob that sets your RPO and cost

A backup policy is the schedule-plus-retention contract attached to your protected items. It is where you set RPO (how often) and how much history you keep (retention), and it is the single biggest lever on both your recoverability and your bill. The Azure VM policy has these moving parts.

Backup frequency sets your RPO. Standard policy is daily (one recovery point per day); Enhanced policy (for VMs) supports hourly backups (every 4/6/8/12 hours), tightening RPO and enabling multiple-backups-per-day and support for Trusted Launch / larger VMs. SQL-in-VM goes far tighter — transaction-log backups as frequently as every 15 minutes.

Retention is tiered — daily, weekly, monthly and yearly points kept for different durations, the classic grandfather-father-son scheme. You keep many recent daily points and a few long-lived yearly points, balancing recoverability against storage cost. Azure Backup supports retention up to 99 years for long-term archival.

Instant-restore snapshot retention controls how many days (1–5) snapshots are kept in the source region for near-instant restores before the data is only in vault storage. Longer instant-restore = faster recent restores but more snapshot storage cost.

Create a policy and protect a VM with it:

# Show the default policy, then protect a VM with it
az backup policy show --vault-name rsv-prod-cin --resource-group rg-resiliency \
  --name DefaultPolicy -o json

# Enable backup for a VM under a named policy
az backup protection enable-for-vm \
  --vault-name rsv-prod-cin --resource-group rg-resiliency \
  --vm $(az vm show -g rg-app -n vm-web-01 --query id -o tsv) \
  --policy-name DefaultPolicy

A custom policy in Bicep — daily at 02:00 UTC, 30 daily / 12 weekly / 12 monthly / 7 yearly points, 5-day instant restore:

resource vmPolicy 'Microsoft.RecoveryServices/vaults/backupPolicies@2024-04-01' = {
  name: '${rsv.name}/pol-vm-prod'
  properties: {
    backupManagementType: 'AzureIaasVM'
    instantRpRetentionRangeInDays: 5
    schedulePolicy: {
      schedulePolicyType: 'SimpleSchedulePolicy'
      scheduleRunFrequency: 'Daily'
      scheduleRunTimes: [ '2026-06-23T02:00:00Z' ]
    }
    retentionPolicy: {
      retentionPolicyType: 'LongTermRetentionPolicy'
      dailySchedule:   { retentionTimes: ['2026-06-23T02:00:00Z'], retentionDuration: { count: 30,  durationType: 'Days'   } }
      weeklySchedule:  { daysOfTheWeek: ['Sunday'], retentionTimes: ['2026-06-23T02:00:00Z'], retentionDuration: { count: 12, durationType: 'Weeks' } }
      monthlySchedule: { retentionScheduleFormatType: 'Weekly', retentionScheduleWeekly: { daysOfTheWeek:['Sunday'], weeksOfTheMonth:['First'] }, retentionTimes:['2026-06-23T02:00:00Z'], retentionDuration: { count: 12, durationType: 'Months' } }
      yearlySchedule:  { retentionScheduleFormatType: 'Weekly', monthsOfYear:['January'], retentionScheduleWeekly: { daysOfTheWeek:['Sunday'], weeksOfTheMonth:['First'] }, retentionTimes:['2026-06-23T02:00:00Z'], retentionDuration: { count: 7, durationType: 'Years' } }
    }
  }
}

Every policy setting, its default, when to change it, and the trade-off — this is the option matrix to keep open while you design:

Setting Values Default When to change Trade-off / gotcha
Policy type (VM) Standard / Enhanced Standard Need hourly RPO, Trusted Launch, larger VMs Enhanced costs more; some regions/SKUs only
Backup frequency Daily / Hourly (4–12h) Daily Tighter RPO than a day More points = more storage + snapshot churn
Daily retention 7–9999 days 30 days Longer corruption-detection window Storage grows with retention
Weekly retention 1–5163 weeks off Keep weekly checkpoints More long-lived points
Monthly retention 1–1188 months off Compliance / monthly archive Long-term storage cost
Yearly retention 1–99 years off Audit / legal hold Cheapest per-point but accumulates
Instant-restore days 1–5 2 Faster recent restores Snapshot storage cost in source region
Time zone Any TZ UTC Align backup window to off-peak local Mis-set window can hit business hours
SQL log frequency 15 min–24 h — (when SQL) Tight DB RPO More log backups, more storage

The retention-tier mental model, with the cost intuition for each tier:

Tier Typical retention Recovers from Cost intuition
Daily 7–30 days Recent accidents, fast corruption Most points; bulk of recent storage
Weekly 4–12 weeks Slow corruption noticed weeks later Fewer points, modest cost
Monthly 6–36 months Compliance “show me last quarter” Long-lived, accumulates
Yearly 1–10 (up to 99) years Audit, legal hold Cheap per-point but never-ending

Two real-world rules: set daily retention to at least 14–30 days so corruption noticed a week or two late is still recoverable (7 days is a common, painful default that loses you the good copy); and keep instant-restore at 5 days for production VMs so the restores you actually run during an incident are fast rather than pulling slowly from vault tiers.

Restore is a spectrum — pick the cheapest type that solves the incident

A policy creates recovery points; a restore uses one, and Backup gives you several restore types ranging from “copy one file” to “rebuild the whole machine.” Reaching for “create new VM” when the incident was a single deleted file wastes an hour. Match the restore type to the failure:

Restore type What it does Speed Use when Gotcha
File-level (item) restore Mount the RP, copy individual files Fast (no full restore) One or few files lost/corrupted Mounts via iSCSI; unmount after, or it lingers
Disk restore Recover specific managed disks, attach Medium One disk corrupted; need data, not OS You attach + reconfigure the VM
Replace existing Overwrite the source VM’s disks from RP Medium Whole VM corrupted, same identity wanted Original disks swapped; brief downtime
Create new VM Build a fresh VM from the RP Slowest (full copy) Source gone; want a clean rebuild New name/IP; re-plumb networking/DNS
Instant restore (snapshot) Restore from source-region snapshot Near-instant Recent point within instant-restore window Only covers the 1–5 day snapshot window
Cross-Region Restore Restore in the paired region from GRS copy Medium Primary region unavailable GRS vault + CRR flag only; egress cost

The restore-type decision as a quick lookup:

If you need to recover… Use this restore type
A handful of files from last week File-level restore
One corrupted data disk, keep the OS Disk restore
The same VM rolled back in place Replace existing
A clean machine because the original is wrecked Create new VM
A recent point, as fast as possible Instant restore (snapshot)
Anything while the primary region is down Cross-Region Restore

Workload-specific backup: SQL, SAP HANA and Azure Files

VMs are the common case, but the Recovery Services vault protects database and file workloads with their own mechanisms and far tighter RPOs than daily VM snapshots. Know the model per workload:

Workload Backup mechanism Tightest RPO Restore granularity Key requirement
Azure VM VM snapshot + vault ~1 h (Enhanced) File / disk / whole VM Healthy VM Agent
SQL Server in Azure VM Full + differential + log stream 15 min (log) Point-in-time to the second SQL extension, db_backupoperator
SAP HANA in Azure VM Backint full + log stream 15 min (log) Point-in-time Certified Backint config
Azure File Share Vault-managed snapshots Per schedule (hourly+) Individual files / full share Share registered to vault
Azure Blob (Backup vault) Operational + vaulted Continuous (operational) Point-in-time within window Backup vault, not RSV
Azure Managed Disk (Backup vault) Incremental snapshot Per schedule Whole disk Snapshot resource group

For SQL-in-VM specifically, the three backup types compose into point-in-time recovery — and missing the log backups is the usual reason “we can only restore to last midnight”:

SQL backup type What it captures Typical frequency Role in PITR
Full Entire database Daily/weekly The base to restore from
Differential Changes since last full Daily Speeds restore, less log replay
Transaction log Every committed transaction Every 15 min Rolls forward to any point in time

Azure Site Recovery: how replication actually works

Site Recovery’s job is to keep a bootable replica of your VM in another region, continuously, so you can fail over fast. Understanding the mechanism removes the mystery from the failure modes later.

For an Azure-to-Azure scenario (the common case), enabling replication on a source VM sets up the Site Recovery Mobility extension inside the VM, which intercepts disk writes and ships them to a cache storage account in the source region; ASR then asynchronously replicates that to target-region managed disks that form the replica. ASR continuously builds crash-consistent recovery points (typically every 5 minutes) and application-consistent recovery points on a configurable cadence (default every hour, using VSS on Windows / pre-post scripts on Linux). The result: an RPO usually in the single-digit minutes, and a menu of recovery points to fail over to. For on-premises sources (VMware, Hyper-V, physical), the architecture adds a configuration/process server appliance that aggregates and forwards replication, but the recovery-point concepts are identical.

ASR supports several source/target scenarios, and the moving parts differ — know which architecture you’re running before you debug it:

Scenario Source → target Extra infrastructure Typical use
Azure-to-Azure (A2A) Azure VM → another Azure region None (Mobility ext + cache SA only) Regional DR for cloud VMs
VMware → Azure On-prem VMware → Azure region Configuration + process server appliance Migrating/DR’ing VMware estates
Hyper-V → Azure On-prem Hyper-V → Azure region Provider on host (+ VMM if used) DR for Hyper-V workloads
Physical → Azure Bare-metal server → Azure region Process server appliance DR for legacy physical servers
Azure-to-Azure (zonal) VM in one AZ → another AZ None Intra-region zone resilience

Enable replication for an Azure VM from the CLI (modern az extension):

# Replicate an Azure VM to a target region via an ASR-enabled Recovery Services vault
az site-recovery protected-item create \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --fabric-name asr-cin --protection-container-name pc-cin \
  --replication-protected-item-name vm-web-01 \
  --policy-id "<replication-policy-id>" \
  --source-vm-id $(az vm show -g rg-app -n vm-web-01 --query id -o tsv) \
  --recovery-resource-group-id $(az group show -n rg-dr-southindia --query id -o tsv)

The replication policy itself controls the consistency cadence and how many points you keep:

resource asrPolicy 'Microsoft.RecoveryServices/vaults/replicationPolicies@2024-04-01' = {
  name: '${rsv.name}/pol-asr-a2a'
  properties: {
    providerSpecificInput: {
      instanceType: 'A2A'
      recoveryPointHistory: 1440           // minutes of recovery points retained (24h)
      appConsistentFrequencyInMinutes: 60  // app-consistent point cadence
      crashConsistentFrequencyInMinutes: 5 // crash-consistent point cadence
      multiVmSyncStatus: 'Enable'
    }
  }
}

The replication-policy knobs and their trade-offs:

Setting Values Default When to change Trade-off
Recovery-point retention 0–72 hours (A2A) 24 h More points to choose from More cache + storage
App-consistent frequency 1 min–12 h (or off) 60 min Tighter clean-restore granularity VSS overhead in the guest
Crash-consistent frequency 5 min (fixed for A2A) 5 min
Multi-VM consistency On / Off Off App spans VMs needing same instant Groups VMs; shared replication group
Target region Any paired/allowed region paired Compliance / latency Egress + capacity in target
Target disk type Standard/Premium SSD match source Cost vs failover IOPS Cheaper disk = slower failover perf

The three consistency levels, side by side — know which one your restore needs:

Consistency level How it’s captured Data integrity on restore Best for Cost/overhead
Crash-consistent Disk state at an instant (no quiesce) Boots; DB may run crash recovery, lose last writes Stateless tiers; tight RPO Lowest (every 5 min)
File-system-consistent OS cache flushed (Linux) Files coherent on disk General Linux servers Low
Application-consistent VSS / scripts quiesce the app Transactionally clean moment Databases, stateful apps Higher (VSS pauses writers)

The honest RPO/RTO you can promise — and what each tier actually costs in effort and money:

Approach Realistic RPO Realistic RTO Cost When it’s the right call
Daily Backup only Up to 24 h Hours (restore time) Low Non-critical; data, not uptime
Hourly Backup (Enhanced) ~1–4 h Hours Low-medium Important data, lax uptime
ASR replication Minutes Minutes + cutover Medium Tier-1 needing fast regional DR
ASR + automated recovery plan Minutes Tighter, repeatable Medium-high Multi-tier apps, audited RTO
Active-active multi-region ~Zero ~Zero (no failover) Highest Can’t tolerate any failover gap

A blunt truth about ASR RTO: the boot is fast (minutes), but your real RTO includes DNS propagation, identity/dependency cutover, and any manual verification. Teams that promise “15-minute RTO” because the VM boots in 15 minutes get a nasty surprise when DNS TTLs and a forgotten connection-string change add an hour. Measure RTO end-to-end in a test failover, not from the boot time.

Recovery plans and orchestrated failover

A pile of replicated VMs is not a recoverable application — the database must come up before the app tier, the app tier before the web tier, and somewhere in there a script rewrites a connection string and updates DNS. A recovery plan encodes that: an ordered set of groups of VMs, with pre/post actions (manual steps or Azure Automation runbooks) between groups, so a failover executes as a single, repeatable, auditable operation instead of a frantic improvisation.

A typical three-tier plan:

Group Contents Pre-action Post-action
Group 1 Database VMs (none) Runbook: verify DB online, open firewall
Group 2 App-tier VMs Manual: confirm DB healthy Runbook: update app config / conn string
Group 3 Web-tier VMs (none) Runbook: update Traffic Manager / DNS
Post-plan Runbook: smoke test, notify on-call

Trigger the three failover types from the CLI:

# TEST failover into an isolated network (no production impact) — your rehearsal
az site-recovery recovery-plan failover-test \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --recovery-plan-name rp-shop-prod \
  --recovery-point-type Latest \
  --network-id $(az network vnet show -g rg-dr-southindia -n vnet-dr-isolated --query id -o tsv)

# UNPLANNED failover (the real disaster — source may be gone)
az site-recovery recovery-plan failover-unplanned \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --recovery-plan-name rp-shop-prod --recovery-point-type Latest

# COMMIT once you've verified the failed-over app
az site-recovery recovery-plan commit \
  --resource-group rg-resiliency --vault-name rsv-prod-cin \
  --recovery-plan-name rp-shop-prod

The failover types, when to use each, and the data-loss implication:

Failover type Source state Data loss When to use Networking
Test failover Source still running None (isolated) Quarterly rehearsal, audit evidence Isolated VNet, no prod impact
Planned failover Source healthy, controlled Zero (clean shutdown first) Migration, scheduled DR drill Production target
Unplanned failover Source degraded/gone From latest available point (RPO) Real disaster Production target
Failback (re-protect) Primary recovered Minimal (reverse-replicate first) Return to primary post-incident Reverse direction

The full failover lifecycle — the order is the discipline:

Phase What happens You do Common miss
Replicate Continuous disk copy to target Monitor RPO health Ignoring replication-lag alerts
Test failover Replica boots in isolated net Verify app, then clean up Forgetting cleanup → orphan cost
Unplanned failover Replica boots in production Run recovery plan, verify No DNS/identity cutover plan
Commit Failover finalised, points freed Confirm before committing Committing before verifying
Re-protect Reverse replication primary↔secondary Enable once primary returns Skipping → no way back
Failback Return workload to primary Planned failover in reverse Never testing failback

The non-negotiable habit: run a test failover every quarter. It is the only thing that proves the plan works, surfaces drift (a new VM not in the plan, a script that rots, an IP that changed), and gives auditors evidence. A test failover into an isolated network has zero production impact — there is no excuse not to. Then clean up the test (a single action) so you are not paying for orphaned test VMs.

Architecture at a glance

The diagram traces both protection paths from the same source workload, so you can see how Backup and Site Recovery operate in parallel on the very same VMs. Read it left to right. On the far left sits the source estate in the primary region — your web, app and database VMs, each with an in-guest agent (the Backup VM extension and the ASR Mobility extension) doing two jobs at once. The Backup path (top) snapshots each VM and writes application-consistent recovery points into a Recovery Services vault, where the hardening lives: soft delete, immutability (locked) and Multi-User Authorization are the controls that keep those points alive even if an attacker with admin rights tries to delete them. The vault’s GRS redundancy with Cross-Region Restore means a second copy already sits in the paired region, restorable on your schedule. The Site Recovery path (bottom) streams disk writes through a source-region cache storage account into continuously replicated target-region managed disks, building crash- and application-consistent recovery points minutes apart.

Follow the flows to the right and the two paths converge on recovery. From Backup you choose a restore type — file, disk, or whole VM — to undo a deletion or roll back corruption. From Site Recovery you trigger a failover that a recovery plan orchestrates: database group first, app group next, web group last, with runbooks rewriting connection strings and updating DNS / Traffic Manager in between, landing the running application in the secondary region. The numbered badges mark the five places this architecture most often fails — an unhealthy guest agent that silently breaks backups, a vault left without immutability that ransomware erases, replication lag that quietly breaches your RPO, a failover that stalls because the recovery plan was never tested, and the cutover plumbing (DNS, identity) that leaves the app running but unreachable. The legend narrates each as symptom, the command to confirm it, and the fix — the same method as every incident: localise the failure to one hop, confirm with the named tool, apply the fix.

Azure resilience architecture showing a source VM estate in the primary region protected by two parallel paths — an Azure Backup path snapshotting application-consistent recovery points into a hardened Recovery Services vault with soft delete, locked immutability, Multi-User Authorization and GRS cross-region-restore redundancy, and an Azure Site Recovery path streaming disk writes through a source-region cache storage account into continuously replicated target-region managed disks — converging on recovery: file/disk/VM restore from Backup, and a sequenced recovery plan (database then app then web tier, with runbooks for connection-string rewrite and DNS/Traffic Manager cutover) failing the running application over to the secondary region, with five numbered failure points marked: unhealthy guest agent breaking backups, a vault without immutability erased by ransomware, replication lag breaching RPO, an untested recovery plan stalling failover, and DNS/identity cutover leaving the app running but unreachable

Real-world scenario

Northwind Financial runs a customer loan-origination platform on Azure: a three-tier app — two web VMs, two app VMs, and a clustered SQL Server pair — on Standard D-series instances in Central India, fronted by Application Gateway, serving roughly 3,000 loan applications a day. Compliance requires a 4-hour RTO and a 15-minute RPO for the loan database, plus seven-year retention of monthly backups for audit. The platform team is five engineers; the resilience budget is about ₹85,000/month.

Their original setup looked responsible and wasn’t. Azure Backup ran daily VM backups with 7-day retention into a Recovery Services vault — LRS, in the same region. No Site Recovery. No immutability. They had never run a restore test. On paper: “we have backups.” In reality, three latent failures stacked: 7-day retention couldn’t satisfy a 7-year audit or a corruption noticed late; an LRS vault would die with the region in a regional outage; and without ASR there was no way to meet a 4-hour RTO if Central India went dark.

The wake-up call was a near-miss, not a disaster. A botched schema migration corrupted a loan-status column, and the bad data wasn’t noticed for nine days — by which point the only clean copy had aged out of the 7-day retention. They recovered by manually reconstructing the column from downstream audit logs over a weekend. The post-incident review was blunt: the backups had worked perfectly and were useless, because the retention window was shorter than their detection latency. That single sentence reset the whole programme.

The rebuild had three parts. First, the vault. They recreated it as GRS with Cross-Region Restore, enabled enhanced soft delete (always-on, 30 days), and turned on immutability (locked) plus MUA with the security team holding the Resource Guard — so a compromised platform admin could no longer weaken backups. Second, the policy. They moved to a tiered retention — 30 daily, 12 weekly, 36 monthly, 7 yearly points — and added SQL transaction-log backups every 15 minutes to hit the 15-minute database RPO, with a 5-day instant-restore window for fast recent restores. Third, Site Recovery. They enabled ASR replication of all five VMs to South India, authored a recovery plan sequencing SQL → app → web with runbooks to rewrite the app’s connection string and update Traffic Manager DNS, and — the crucial habit — scheduled a quarterly test failover into an isolated VNet.

The first test failover was humbling and exactly the point: the database came up, but the app tier failed because the runbook still pointed at the old connection string, and the measured end-to-end RTO was 5 hours 40 minutes — over their 4-hour target, almost entirely DNS TTL (set to 1 hour) and manual verification. They fixed the runbook, dropped the DNS TTL to 60 seconds, and automated the smoke test. The next quarterly test measured 2 hours 50 minutes, comfortably inside RTO, with the database at a 12-minute RPO. Eight months later, when Central India had a genuine storage-tier incident, they failed over for real in 2 hours 35 minutes with 9 minutes of data loss — inside both targets, no heroics, because the plan had been rehearsed four times. The lesson on the wall: “A green backup job is a hypothesis. A passed test restore is a capability. Only one of them pays out at 2 a.m.”

The programme as a before/after, because the gaps are the lesson:

Aspect Before (looked safe) After (was safe) Why it mattered
Vault redundancy LRS (same region) GRS + Cross-Region Restore Survives a region loss
Soft delete / immutability Off Enhanced (always-on) + locked Survives ransomware deleting backups
Daily retention 7 days 30 days Corruption noticed day 9 still recoverable
Long-term retention None 12 wk / 36 mo / 7 yr Meets the 7-year audit
Database RPO 24 h (daily) 15 min (log backups) Meets compliance RPO
Regional DR None ASR replica to South India Meets the 4-hour RTO
Recovery orchestration None Recovery plan + runbooks Repeatable, auditable failover
Tested? Never Quarterly test failover Found the broken runbook before the disaster
Measured RTO Unknown (hope) 2h35m (real incident) A number, not a prayer

Advantages and disadvantages

The Backup-plus-Site-Recovery model gives you broad, managed protection without a secondary datacenter to run — but it is not free, and it decays without discipline. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Backup gives granular recovery (file → disk → whole VM) for delete and corruption Backup alone can’t meet a tight RTO for a region loss — restore takes real time
ASR gives whole-workload failover to another region in minutes ASR is the more expensive, more operationally demanding service; over-buying it is common
Hardened vault (soft delete, immutability, MUA) defeats ransomware that deletes backups Defaults are unsafe: LRS, no immutability, no MUA — you must turn the knobs
Recovery plans turn a pile of VMs into a sequenced, auditable failover A recovery plan is a hypothesis until tested; plans decay silently (drift)
No secondary infrastructure to run/patch until you actually fail over You pay continuously for replica storage and protected instances even when idle
Long-term retention (up to 99 years) covers compliance archival cheaply per-point Storage cost accumulates relentlessly with retention; easy to over-retain
Cross-region restore lets you recover in the paired region on your schedule Cross-region restore I/O and egress add cost; only on GRS vaults
Application-consistent points (VSS) give clean database restores App-consistency adds in-guest overhead; misconfigured scripts give only crash-consistent

The model is right for the overwhelming majority of estates: protect everything with Backup (it is cheap insurance against the most common loss — accidental deletion and corruption), and layer ASR onto the tier-1 subset that cannot tolerate regional downtime. It bites hardest on teams who confuse Backup with DR (and discover at the incident that restoring 50 VMs serially blows their RTO), who deploy with default redundancy and no immutability (and lose backups to ransomware), and who set up DR and never test it (and find the recovery plan broken when it matters). Every disadvantage is manageable — but only if you know it exists, which is the entire point of doing this deliberately.

Hands-on lab

Protect a VM with Azure Backup, harden the vault, take an on-demand backup, and run a file-level restore — all on a single small VM you delete at the end. Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-backup-lab
LOC=centralindia
VAULT=rsv-lab-$RANDOM
VM=vm-lab-01
az group create -n $RG -l $LOC -o table

Step 2 — Create a small Linux VM to protect.

az vm create -g $RG -n $VM --image Ubuntu2204 --size Standard_B1s \
  --admin-username azureuser --generate-ssh-keys --public-ip-sku Standard -o table
# Drop a file we'll later "lose" and restore
az vm run-command invoke -g $RG -n $VM --command-id RunShellScript \
  --scripts "echo 'critical-loan-data-v1' | sudo tee /home/azureuser/important.txt"

Step 3 — Create a Recovery Services vault and harden it.

az backup vault create -n $VAULT -g $RG -l $LOC -o table

# Enhanced soft delete (always-on, 14 days) — irreversible hardening
az backup vault backup-properties set -n $VAULT -g $RG \
  --soft-delete-feature-state AlwaysON --soft-delete-duration 14

# Confirm the hardening took
az backup vault backup-properties show -n $VAULT -g $RG \
  --query "{soft:softDeleteFeatureState, days:softDeleteRetentionPeriodInDays}" -o table

Expected: soft = AlwaysON, days = 14.

Step 4 — Enable backup on the VM with the default policy.

az backup protection enable-for-vm -v $VAULT -g $RG \
  --vm $(az vm show -g $RG -n $VM --query id -o tsv) \
  --policy-name DefaultPolicy -o table

Step 5 — Trigger an on-demand backup (don’t wait for the schedule).

CONTAINER=$(az backup container list -v $VAULT -g $RG \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)
ITEM=$(az backup item list -v $VAULT -g $RG \
  --backup-management-type AzureIaasVM --query "[0].name" -o tsv)

az backup protection backup-now -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" \
  --retain-until $(date -d "+30 days" +%d-%m-%Y) -o table

Watch the job until it completes (this takes several minutes — the first backup copies the full disk):

az backup job list -v $VAULT -g $RG --query "[0].{op:properties.operation, status:properties.status}" -o table

Expected: eventually status = Completed.

Step 6 — List recovery points and start a file-level restore.

RP=$(az backup recoverypoint list -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" \
  --query "[0].name" -o tsv)

# Mount the recovery point as an iSCSI target with a download script (file recovery)
az backup restore files mount-rp -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" --rp-name "$RP" -o json
# The output gives a script + password; running it mounts the recovery point's disks
# so you can copy /home/azureuser/important.txt back. Unmount when done:
az backup restore files unmount-rp -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" --rp-name "$RP"

Validation checklist. You created a hardened vault (enhanced soft delete, always-on), protected a VM, took an on-demand application-consistent recovery point, and exercised file-level restore by mounting the recovery point — without ever needing the original VM intact. That is the whole Backup loop. What each step proves:

Step What you did What it proves Real-world analogue
3 Enhanced soft delete always-on The vault resists “delete the backups” Ransomware hardening
4 Enable backup with a policy Protection is policy-driven, not ad-hoc Onboarding every prod VM
5 On-demand backup You can force a point before risky change Pre-deployment safety snapshot
6 Mount RP for file restore Granular recovery without a full VM rebuild The 90% case: “restore one file”

Cleanup (avoid lingering vault/VM charges). You must stop protection before deleting the resource group, or the vault blocks deletion:

# Stop protection AND delete backup data (lab only — never --delete-backup-data in prod casually)
az backup protection disable -v $VAULT -g $RG \
  --container-name "$CONTAINER" --item-name "$ITEM" \
  --delete-backup-data true --yes

az group delete -n $RG --yes --no-wait

Cost note. A B1s VM is a few rupees per hour and a single recovery point is a tiny storage charge; an hour of this lab is well under ₹50. Deleting the resource group (after disabling protection) stops everything.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you read mid-incident, then the entries that bite hardest with the full confirm-command detail.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 VM backup job fails with UserErrorGuestAgentStatusUnavailable VM Agent / extension not running or unhealthy az vm get-instance-view -g RG -n VM --query "instanceView.vmAgent.statuses" Restart/repair VM Agent; reinstall the backup extension
2 Backup job stuck “In Progress” for hours Snapshot/VSS hang, or another op holding the VM Backup jobs blade; az backup job list; VM activity log Cancel job; check VSS writers; retry; reboot if VSS wedged
3 Need to restore but the recovery point you want isn’t there Retention too short — point aged out az backup recoverypoint list (oldest point’s date) Increase retention now; recover from CRR/secondary if any
4 “Cannot change vault redundancy” when moving LRS→GRS Vault already holds a protected item az backup item list shows items Decide redundancy at creation; new vault + re-protect
5 Deleted a backup item by mistake — is it gone? Soft delete window (if enabled) still holds it Backup items → “soft-deleted” filter az backup protection undelete; re-enable protection
6 ASR replication health “Critical”, RPO climbing Replication lag — network/throughput/cache full Site Recovery → Replicated items → RPO; cache SA metrics Increase cache SA, check egress/throughput, throttle source I/O
7 Test failover boots but app unreachable DNS/identity/dependency cutover not done App in isolated VNet; check name resolution + app config Add DNS/identity to recovery-plan runbooks; verify in test
8 Failover stuck / fails at a recovery-plan group Stale runbook, missing target resource, dependency miss Recovery plan job; Automation runbook error Fix runbook; ensure target RG/VNet/NSG exist; re-run group
9 Restored VM boots but the database needs recovery Only crash-consistent points captured Recovery points show consistency type Enable app-consistent (VSS / scripts); pick app-consistent point
10 SQL-in-VM backup fails / no log backups SQL extension/AAD config wrong; perms missing Vault → Backup items (SQL); SQL ext logs Re-register SQL, grant db_backupoperator, fix extension
11 Backup costs ballooning month over month Over-retention + high churn + instant-restore days Cost analysis by vault; protected-instance count Trim retention tiers; review instant-restore days; archive tier
12 Can’t delete a recovery point / shorten retention Immutability locked (working as intended) Vault security settings show “Locked” Wait for natural expiry; (locked = irreversible by design)
13 New VMs silently unprotected No backup-coverage policy/automation Backup center → protectable items; Azure Policy compliance Azure Policy to auto-enable backup; Backup center coverage
14 Failback not possible after primary returns Never re-protected after failover ASR → replicated items show no reverse replication Enable re-protect (reverse replication), then planned failback

The expanded form, with the full reasoning for the entries that bite hardest:

1. VM backup fails with UserErrorGuestAgentStatusUnavailable (or ExtensionStuckInDeletionOrTransitioning). Root cause: The Azure VM Agent is stopped, outdated, or the backup (VM snapshot) extension is unhealthy — Backup coordinates the snapshot through the agent, so a dead agent means no application-consistent point. Confirm:

az vm get-instance-view -g rg-app -n vm-web-01 \
  --query "instanceView.vmAgent.{status:statuses[0].displayStatus, version:vmAgentVersion}" -o table

Fix: Ensure the VM Agent is running and current (restart it inside the guest), then let Backup re-deploy its extension (or remove and re-enable protection). On Linux, confirm the walinuxagent service is up; on Windows, the WindowsAzureGuestAgent service.

3. The recovery point you need isn’t there — retention was too short. Root cause: The classic Northwind failure — corruption noticed after the point aged out of retention. Backup did its job; the retention window was shorter than your detection latency. Confirm:

# Oldest available recovery point — if it's newer than the corruption, you're out of luck
az backup recoverypoint list -v rsv-prod-cin -g rg-resiliency \
  --container-name "$C" --item-name "$I" \
  --query "sort_by([].{time:properties.recoveryPointTime}, &time)[0]" -o table

Fix: There is no fix after the fact except recovering from a cross-region/secondary copy if one exists. The real fix is preventive: set daily retention to ≥ 14–30 days so late-noticed corruption is still recoverable. Treat 7-day retention as dev-only.

6. ASR replication health goes Critical and RPO climbs above target. Root cause: Replication lag — the source is generating writes faster than they replicate, usually because the cache storage account is throttled/full, egress bandwidth is constrained, or a burst of disk churn overwhelmed the pipe. Confirm: Site Recovery → Replicated items shows per-VM RPO and health; the cache storage account’s metrics show throttling. Via KQL over the vault’s ASR logs:

// ASR replication health / RPO breaches in the last 6 hours
ASRReplicationStats
| where TimeGenerated > ago(6h)
| where RpoInSeconds > 900   // your RPO target in seconds (15 min)
| project TimeGenerated, ReplicationProtectedItemName, RpoInSeconds, ReplicationHealth
| order by TimeGenerated desc

Fix: Move the cache to a higher-performance storage account, ensure the source region has egress headroom, reduce a runaway write workload, and verify the replication policy’s retention isn’t oversized for the available throughput. RPO health is a leading indicator — alert on it before a real failover needs a fresh point.

7. Test failover boots the VMs but the application is unreachable. Root cause: The VMs are up in the isolated network, but DNS, identity, and dependency cutover were never part of the plan — the app can’t resolve names, can’t authenticate, or points at a primary-region dependency that isn’t in the test bubble. Confirm: In the isolated VNet, check name resolution from a failed-over VM and inspect the app’s configuration for primary-region hostnames; the application logs will show connection failures to unresolvable or unreachable endpoints. Fix: Add DNS updates, identity/endpoint rewrites, and dependency stand-ins to the recovery-plan runbooks, and verify them in the test failover — which is exactly what test failovers are for. An app that boots but can’t serve is the most common “passed the failover, failed the recovery” trap.

9. The restored VM boots but the database runs crash recovery / lost recent transactions. Root cause: You restored from a crash-consistent recovery point, not an application-consistent one — the DB’s in-flight writes were torn at capture. Confirm: The recovery-point list shows each point’s consistency type; if your latest is crash-consistent, that’s why. Fix: Ensure application-consistent points are being created (VSS on Windows is default for VM backup; on Linux configure pre/post scripts), and when restoring a database, choose an application-consistent recovery point even if it is slightly older than the latest crash-consistent one — clean beats recent for stateful data.

13. New VMs are silently unprotected — coverage drift. Root cause: Backup is enabled per-item, and without automation, new VMs ship without protection. The estate’s coverage silently decays as it grows. Confirm: Backup center → protectable items lists VMs with no backup; Azure Policy compliance shows the gap. Fix: Use an Azure Policy that auto-enables backup on new VMs (built-in policies exist for “Configure backup on VMs”), and review Backup center coverage as a routine. Coverage is a governance problem; solve it with policy, not vigilance.

Best practices

The alerts worth wiring before the next incident — the leading indicators:

Alert on Signal Threshold (starting point) Why it’s leading
Backup job failure Failed backup jobs ≥ 1 failure Catches agent/extension breakage before a restore needs it
ASR RPO health RpoInSeconds per item > your RPO target Warns RPO is breaching before a real failover
Replication health ASR item health = Critical Any item Critical Lag/throughput problem surfacing early
Backup coverage Unprotected protectable VMs ≥ 1 Coverage drift as the estate grows
Soft-deleted items Items in soft-delete state ≥ 1 unexpected Possible malicious/accidental deletion in flight
Vault security drift Soft delete / immutability off Any disabled on prod Someone weakened the hardening

Security notes

The security controls that also improve resilience — secure and recoverable pull together here:

Control Mechanism Secures against Also improves
Enhanced soft delete (always-on) Vault data-protection setting Attacker deleting backups Accidental-delete recovery
Immutability (locked) Vault security setting Shortening/deleting points before expiry Compliance retention guarantees
Multi-User Authorization Resource Guard (separate tenant) A single compromised admin Change-control discipline
Least-privilege RBAC Backup Operator/Reader roles Over-broad backup/restore rights Cleaner operational ownership
Private endpoints Vault private link Public exposure of backup traffic Network reliability/isolation
CMK + guarded Key Vault Customer-managed encryption keys Regulatory key-control gaps Defined key lifecycle
Diagnostic logging + alerts Vault logs → Log Analytics Silent malicious operations Faster incident detection

Cost & sizing

The bill has a few dominant drivers, and they interact with every design choice above.

Free and cheap angles worth knowing:

Item Cost reality Cheap lever
Soft delete (within window) Free during the soft-delete retention period Always enable it — no cost to safety
First 5 GB / month per region (Azure Files snapshot) Often within free allowance for small shares Keep small-share snapshots lean
LRS vs GRS storage GRS ~2× LRS storage price Use LRS for dev; GRS only where region loss matters
Archive tier (LTR) Far cheaper per-GB than hot vault storage Tier long-term monthly/yearly points to archive
ASR per-instance fee Charged per replicated VM, continuously Replicate only tier-1, not the whole estate
Instant-restore days Snapshot storage per day retained Drop to 1–2 days for non-critical VMs

A rough monthly picture for a small production estate (~10 VMs, ~2 TB protected, tier-1 subset of 3 VMs on ASR):

Cost driver What you pay for Rough INR / month What it buys Watch-out
Protected instances (10 VMs) Per-instance monthly fee ~₹8,000–14,000 Backup coverage of the estate Scales with instance size tiers
Backup storage (GRS, ~2 TB retained) Per-GB retained, geo-priced ~₹10,000–20,000 Recoverable history Grows with retention; archive old tiers
Instant-restore snapshots (5 days) Source-region snapshot storage ~₹2,000–5,000 Fast recent restores Trim days on non-critical VMs
ASR (3 tier-1 VMs) Per-instance fee + replica + cache ~₹6,000–12,000 Fast regional failover Most expensive per-workload — tier-1 only
Cross-region restore / egress Restore I/O + egress when used Episodic Self-service paired-region restore Only on a real restore/failover
Log Analytics (vault logs) Per-GB ingestion ~₹1,000–3,000 Alerting + audit on destructive ops Sample/route verbosely-logged vaults

The cost discipline is the same as the resilience discipline: right-size retention (don’t keep 7 years of daily points), tier long-term data to archive, reserve ASR for tier-1, and measure — Northwind ended up cheaper after redesign in some line items because they stopped over-retaining daily points and only replicated the three VMs that actually needed it. For estate-wide cost control, pair this with Azure FinOps and Cost Management: Controlling Cloud Spend at Scale.

Interview & exam questions

1. What is the fundamental difference between Azure Backup and Azure Site Recovery? Azure Backup protects data — it takes point-in-time recovery points so you can restore a file, disk, database or whole VM after deletion or corruption. Azure Site Recovery protects availability — it continuously replicates a VM to another region so you can fail the workload over during a regional outage. Backup answers “can I get my data back?”; ASR answers “can I run my app elsewhere?” Most production workloads need both because they cover different failure modes.

2. A team backs up VMs daily but has no DR. A region fails. Why can’t Backup meet a 1-hour RTO? Restoring from Backup means copying recovery-point data back and rebuilding VMs, which takes real time proportional to data size — restoring many multi-hundred-GB VMs serially blows a 1-hour RTO. Backup is optimised for granular data recovery, not fast whole-region failover. The tool for a tight regional RTO is Site Recovery, which boots an already-replicated replica in minutes.

3. How does Azure Backup defend against ransomware that deletes the backups? Three layered vault controls. Soft delete retains deleted recovery points for a window (14–180 days) so a malicious delete is recoverable; enhanced soft delete (always-on) makes that irreversible so an attacker can’t disable it first; immutability (locked) makes recovery points un-deletable and un-shortenable before expiry; and Multi-User Authorization requires a second approver (via a Resource Guard in a separate tenant) for destructive operations. Together they ensure a compromised admin cannot erase the backups.

4. What’s the difference between crash-consistent and application-consistent recovery points, and when does it matter? A crash-consistent point captures disks at an instant as if the power were pulled — it boots, but in-flight writes may be torn and a database may run crash recovery and lose the last transactions. An application-consistent point uses VSS (Windows) or pre/post scripts (Linux) to quiesce the application first, producing a transactionally clean moment. It matters for databases and stateful apps: always restore them from an application-consistent point, even if it’s slightly older than the latest crash-consistent one.

5. Why is vault redundancy a decision you must make at creation, and what should production use? Vault storage redundancy (LRS/ZRS/GRS) is immutable once the vault holds a protected item — you can’t change LRS→GRS without deleting all items. An LRS vault keeps all copies in one region, so a regional disaster destroys the backups with the data. Production backups should use GRS with Cross-Region Restore, which keeps a geo copy and lets you restore in the paired region on your schedule.

6. What is RPO vs RTO, and what governs each for Backup and ASR? RPO (Recovery Point Objective) is the maximum acceptable data loss in time — governed by how often you create recovery points (backup frequency for Backup; continuous replication for ASR, often minutes). RTO (Recovery Time Objective) is the maximum acceptable time to restore service — governed by how fast you restore/fail over and re-plumb (restore time for Backup; replica boot + DNS/identity cutover for ASR). Both should be set from business impact and proven in a test, not assumed.

7. Why must you test failovers, and what does a test failover do? A recovery plan is a hypothesis until exercised — DR plans decay (IPs change, scripts rot, new VMs aren’t added). A test failover boots the replicas in an isolated network with no impact to production or replication, so you can verify the app actually comes up, measure end-to-end RTO, and produce audit evidence — then clean it up. Skipping it is how “we have DR” becomes an 18-hour outage.

8. A recovery point you need to restore from has aged out of retention. What went wrong and how do you prevent it? The retention window was shorter than the detection latency — corruption was noticed after the clean point expired (e.g. 7-day retention, corruption found on day 9). There’s no fix after the fact except a cross-region/secondary copy if one exists. Prevent it by setting daily retention to at least 14–30 days so late-noticed corruption is still recoverable; treat 7-day retention as dev-only.

9. What does a recovery plan add over just replicating VMs with ASR? A pile of replicated VMs isn’t a recoverable application — tiers must come up in order and the cutover needs automation. A recovery plan sequences VMs into groups (DB → app → web) with pre/post actions (manual gates or Azure Automation runbooks) for things like rewriting connection strings and updating DNS, turning failover into a single repeatable, auditable operation instead of an improvisation under stress.

10. Your ASR replication health goes Critical and RPO climbs. What’s happening and what do you check? Replication lag — the source is generating writes faster than they replicate, usually due to a throttled/full cache storage account, constrained egress, or a churn burst. Check Site Recovery → Replicated items for per-VM RPO/health and the cache storage account’s throttling metrics. Fix by upgrading the cache storage, ensuring egress headroom, and reducing runaway write workloads. RPO health is a leading indicator — alert on it.

11. What’s the difference between a Recovery Services vault and a Backup vault? A Recovery Services vault is the long-standing vault for VM/file/SQL/SAP backup and Azure Site Recovery. A Backup vault is the newer model for cloud-native workloads the Recovery Services vault never covered — Azure Blobs, managed disks, PostgreSQL flexible server, AKS — with native immutability and MUA. You can’t migrate items between them, and ASR/VMs only live in the Recovery Services vault, so pick correctly before creating anything.

12. How do you ensure newly created VMs are actually protected? Backup is enabled per-item, so without automation new VMs ship unprotected and coverage drifts as the estate grows. Use an Azure Policy (built-in “Configure backup on VMs”) to auto-enable backup on new VMs, and review Backup center coverage routinely. Coverage is a governance problem solved with policy, not vigilance.

These map primarily to AZ-104 (Administrator)implement and manage backup and recovery (Recovery Services vaults, backup policies, ASR) — and AZ-305 (Solutions Architect Expert)design business-continuity solutions (RPO/RTO, backup vs DR, recovery objectives). The ransomware/immutability angle touches SC-100/AZ-500. A compact cert-mapping for revision:

Question theme Primary cert Exam objective area
Backup vs ASR, RPO/RTO AZ-305 Design business-continuity solutions
Recovery Services vault, policies, retention AZ-104 Implement and manage backup
ASR replication, recovery plans, failover AZ-104 / AZ-305 Implement DR; design BC
Soft delete, immutability, MUA AZ-500 / SC-100 Secure backup; ransomware resilience
Crash- vs app-consistent, restore types AZ-104 Backup and recovery operations
Vault redundancy (LRS/ZRS/GRS) AZ-305 Design for resiliency / data redundancy

Quick check

  1. A user accidentally deletes a critical file; you need it back from three weeks ago. Which service, and which restore type?
  2. Central India region goes fully offline and you must serve customers within 30 minutes. Which service, and why does daily Backup alone fail here?
  3. True or false: you can change a vault’s redundancy from LRS to GRS after you’ve been backing up 50 VMs into it for a year.
  4. A ransomware operator gets admin and deletes your backup items. Name the two vault controls that would still save you.
  5. Your test failover boots all VMs successfully but the application can’t serve traffic. What’s the most likely missing piece, and where do you fix it?

Answers

  1. Azure Backup, file-level restore — mount the recovery point from three weeks ago and copy the file back. This requires daily retention of at least 21 days; ASR keeps only hours of recovery points and is whole-VM only, so it can’t do this.
  2. Azure Site Recovery — it keeps a continuously replicated, bootable replica in a paired region you can fail over to in minutes. Daily Backup fails the 30-minute RTO because restoring means copying recovery-point data back and rebuilding VMs, which takes far longer than booting an already-replicated replica.
  3. False. Vault redundancy is immutable once the vault holds a protected item. To go LRS→GRS you’d have to stop protection and delete all backup data first (or create a new GRS vault and re-protect everything). Decide redundancy at creation.
  4. Soft delete (ideally enhanced/always-on) retains the deleted recovery points for a recoverable window even after deletion; immutability (locked) prevents the points being deleted or their retention shortened at all. Multi-User Authorization further blocks a single compromised admin from disabling either. Any of these defeats the “delete the backups” step.
  5. DNS / identity / dependency cutover wasn’t part of the plan — the VMs are up but can’t resolve names, authenticate, or reach a primary-region dependency. Fix it by adding those cutover steps (DNS updates, connection-string/identity rewrites) to the recovery-plan runbooks and verifying them in the test failover — which is exactly what test failovers exist to catch.

Glossary

Next steps

You can now protect data with Backup, defend it against ransomware with a hardened vault, and stand up regional DR with Site Recovery and a tested recovery plan. Build outward:

AzureAzure BackupAzure Site RecoveryDisaster RecoveryRTORPORansomwareRecovery Services Vault
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading