Most SQL Managed Instance “HA” incidents I have reviewed were not engine failures. They were the day the primary region went dark and the team discovered their connection strings hard-coded a regional FQDN, the service-account logins did not exist on the secondary, and nobody had ever actually run a failover. The data was intact and perfect; the application was down for three hours because the parts that move databases do not move logins, jobs, or linked servers. Managed Instance gives you two distinct continuity primitives — auto-failover groups for cross-region geo-replication and the Managed Instance link for hybrid replication to and from SQL Server. They solve different problems, they cannot be combined on the same instance, and both are useless until you have rehearsed the runbook.
This article is the build-and-prove guide. We treat business continuity not as a single feature but as a layered design: zone redundancy inside one region, a failover group across two regions, and the link for the hybrid path — each with its own RPO/RTO, its own networking, and its own failure modes. You will learn to wire both primitives correctly, route applications through listeners that survive a failover, pre-stage the server-level state that replication silently ignores, and drill the runbook on a cadence so the first failover you ever run is not at 3 a.m. during a real outage. Because this is a reference you will return to mid-incident, the tiers, the listeners, the ports, the error states, the update-policy matrix, and the failover playbook are all laid out as scannable tables alongside the prose, the az CLI, the T-SQL, and the Bicep.
By the end you will stop guessing. When a region degrades you will know whether to run a planned failover (no data loss, primary reachable) or a forced one (data loss possible, primary gone), what your measured RPO actually is, which listener your reporting traffic should hit, and exactly which pre-staged objects turn a “successful” failover into a working application. Knowing which within minutes is the difference between a controlled relocation and a regulated-downtime breach.
What problem this solves
A Managed Instance running in a single region is exposed to three failure classes the engine’s built-in HA does not cover: a region outage (datacenter or zone-fleet event), a bad change (a deployment or DROP TABLE that replicates instantly to any synchronous replica), and a hybrid migration where you need on-prem SQL Server and Azure to share data with minimal downtime. The local Always On availability group inside a Business Critical instance fails over in seconds — but only within the region. When the region is the thing that failed, local HA has nothing left to fail over to.
What breaks without a cross-region story: the business discovers continuity is a slide, not a control. The connection strings point at mi-prod-eastus2.<dns_zone>.database.windows.net, which no longer resolves to anything healthy. There is no second instance because nobody provisioned one in the partner’s DNS zone (and you cannot retrofit the DNS zone after creation). Even teams that did build a failover group often skip the rehearsal, so the first real failover surfaces the unreplicated server-level state — logins with mismatched SIDs, SQL Agent jobs that simply are not there, linked servers, credentials, msdb operators — and the overnight batch run fails on a database where every customer row is correct.
Who hits this: every team running a production Managed Instance under a continuity or regulatory obligation (RPO/RTO targets, an auditor asking “show me your last DR test”), every migration moving a 2 TB SQL Server database to MI inside a downtime window, and every hybrid estate that wants Azure as a DR target for on-prem SQL Server (or vice versa). The fix is almost never “buy a bigger SKU” — it is “design the layers, route through listeners, pre-stage server state, and drill the runbook until the numbers are measured, not marketing.”
To frame the whole field before the deep dive, here is every continuity layer this article covers, the failure it actually defends against, and where it lives:
| Continuity layer | Defends against | Scope of protection | RPO (typical) | RTO (typical) | Lives where |
|---|---|---|---|---|---|
| Local HA (built-in) | Node / disk failure | Single region | 0 (sync) | Seconds | The instance itself |
| Zone redundancy | Single-zone outage | Single region, multi-AZ | 0 (sync) | Seconds | Instance property |
| Auto-failover group | Region outage, region relocation | Cross-region (MI→MI) | ~5 s (async) | ~1 h end-to-end | Secondary RG, both regions |
| Managed Instance link | Hybrid migration, hybrid DR | SQL Server ↔ MI (one DB/link) | Seconds (async) | Minutes (cutover) | DAG between SQL Server + MI |
| Backups / PITR / LTR | Bad change, deletion, corruption | The thing replication can’t undo | Up to backup interval | Minutes–hours (restore) | Automated, per database |
The single most important sentence in this table: geo-replication is not a backup. A failover group and the link both move a DROP TABLE to the secondary in seconds. Point-in-time restore is the only layer in the list that defends against a bad change — keep it, even with a perfect failover group.
Learning objectives
By the end of this article you can:
- Map any Managed Instance continuity requirement to the right primitive — local HA, zone redundancy, a failover group, the link, or backups — and explain what each does and does not protect against.
- Provision an auto-failover group across two regions correctly: shared DNS zone set at creation, dedicated delegated subnets, cross-region connectivity, and a customer-managed (
Manual) failover policy. - Route applications through the read-write and read-only listeners so a failover never requires a connection-string change, and explain why live TCP connections do not migrate.
- Establish a Managed Instance link (distributed availability group) for online migration or hybrid DR, and choose the correct SQL Server version and MI update policy for one-way versus two-way scenarios.
- Decide planned versus forced failover under pressure, quantify the data-loss window each implies, and execute the exact
azcommand for each. - Pre-stage the server-level state replication ignores — logins with matching SIDs, SQL Agent jobs, linked servers, credentials — as a hard gate in the runbook, not a hopeful afterthought.
- Build, automate, and drill an RPO/RTO-driven failover runbook, and verify with concrete queries that listeners route correctly and replication is healthy.
- Diagnose the common failures — DNS-zone mismatch, subnet sizing, link version/policy traps, stale connections, missing server state — from the exact symptom to the confirming command to the fix.
Prerequisites & where this fits
You should already understand the Managed Instance basics: an MI is a near-100%-compatible SQL Server instance delivered as PaaS, deployed into a delegated subnet inside your VNet, with two service tiers — General Purpose (compute separated from remote storage) and Business Critical (a local Always On availability group on local SSD with a free readable replica). You should know how to run az in Cloud Shell, read JSON/-o table output, and be comfortable with T-SQL and SSMS. Familiarity with Always On availability groups, DNS CNAME resolution, and VNet peering will make every section land faster.
This sits in the Data platform & business continuity track. It assumes the platform fundamentals and the advanced-features layer from Azure SQL Database Advanced Patterns: Hyperscale, Elastic Pools, Ledger, and Always Encrypted with Secure Enclaves, and it pairs with the broader DR strategy in Active-Active Multi-Region on Azure: Building for RTO Near Zero and the IaaS-side runbook discipline in Azure Site Recovery for IaaS: Zone-to-Zone and Region Failover with Recovery Plans. When the symptom is connectivity rather than continuity, Troubleshooting Azure SQL Database: Connectivity, Timeouts, Throttling & Blocking is the companion playbook.
A quick map of who owns which layer during an incident, so you call the right person fast:
| Layer | What lives here | Who usually owns it | Continuity failure it can cause |
|---|---|---|---|
| Application / connection layer | Connection strings, retry logic, ORM | App / dev team | Hard-coded regional FQDN; no reconnect on failover |
| DNS / listener | Failover-group CNAMEs, DNS zone | Platform + DBA | Stale resolution; reads hitting the wrong instance |
| Managed Instance (engine) | Databases, local AG, replicas | DBA / platform | Tier mismatch; missing server-level state |
| Networking (VNet/subnet) | Delegated subnet, NSG, route table, peering | Network team | Provisioning failure; cross-region path down |
| Failover group / link | Geo-replication, DAG, failover policy | DBA / platform | Wrong policy; not synchronized at failover |
| Backups (PITR/LTR) | Automated backups, retention | Platform | The only defense against a replicated bad change |
Core concepts
Five mental models make every later decision obvious.
Continuity is layered, not a single switch. Local HA, zone redundancy, the failover group, the link, and backups are additive — each covers a failure the others do not. You do not pick one; you stack the ones your risk profile requires. A Business Critical instance with zone redundancy, a cross-region failover group, and long-term backup retention is three independent controls against three different failures, and removing any one re-opens a gap.
The two cross-region primitives are mutually exclusive on an instance. A failover group geo-replicates all user databases MI→MI to a partner instance in another region. The Managed Instance link builds a distributed availability group between SQL Server and MI for one database per link. You cannot put both on the same instance — choose the primitive that matches the job (MI-to-MI DR versus hybrid SQL-Server replication/migration).
The listener is the contract; the instance FQDN is not. A failover group publishes two CNAMEs in a shared DNS zone — a read-write listener that always points at the current primary and a read-only listener that points at the current secondary. Applications connect to listeners, never to a regional instance FQDN, so a role switch is invisible to the connection string. On failover the DNS record updates immediately, but existing TCP connections are not migrated — clients must reconnect, which is why aggressive retry-with-reconnect is mandatory in the data layer.
Replication moves databases, nothing else. Both the failover group and the link replicate user databases only. Logins, SQL Agent jobs, linked servers, credentials, server-level permissions, and master/msdb objects are NOT replicated. This is the single most common cause of a post-failover application outage: the data is perfect, but the service-account login is missing or has a mismatched SID, so the app cannot authenticate or the overnight job does not exist to run. Server-level state is your job, pre-staged on the secondary and gated in the runbook.
Geo-replication is asynchronous, so failover has two flavors. Replication across the WAN is async, which means at any instant the secondary is slightly behind. A planned failover synchronizes fully then switches roles (no data loss, requires the primary reachable) — your tool for drills and relocations. A forced failover promotes the secondary immediately without waiting (potential data loss equal to the un-replicated lag) — your only option when the primary region is gone. That lag is your RPO, and the only honest number is the one you measure in a drill.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:
| Concept | One-line definition | Where it lives | Why it matters to continuity |
|---|---|---|---|
| Managed Instance (MI) | Near-100% SQL Server compatible PaaS in your VNet | Delegated subnet | The unit you replicate and fail over |
| General Purpose | Compute separated from remote storage (3-copy) | Service tier | Recovery is a remount, not an instant handoff |
| Business Critical | Local AG of 4 replicas on local SSD | Service tier | Local failover in seconds; free readable replica |
| Zone redundancy | Spreads replicas/storage across AZs | Instance property | Survives a single-zone outage in-region |
| Failover group (FOG) | Geo-replicates all user DBs MI→MI | Secondary RG | Cross-region DR with a stable listener |
| DNS zone | SAN-cert-scoped zone ID shared by partners | Set at creation | Same cert post-failover → must be shared |
| Read-write listener | CNAME → current primary | Shared DNS zone | App OLTP endpoint that survives failover |
| Read-only listener | CNAME → current secondary | Shared DNS zone | Reporting/read offload endpoint |
| MI link | Distributed AG between SQL Server and MI | DAG | Hybrid migration / DR, one DB per link |
| Update policy | SQL-engine versioning mode of the MI | Set at creation | Gates two-way link failback |
| Planned failover | Sync then switch (no data loss) | set-primary |
Drills, relocations, failback |
| Forced failover | Switch immediately (data loss possible) | --allow-data-loss |
Real region outage only |
| RPO | Data you can lose (the replication lag) | Design target | ~5 s under healthy async |
| RTO | Time to recover (switch + app recovery) | Design target | ~1 h end-to-end, drill it |
| Standby replica | License-free passive secondary | --secondary-type |
Cuts SQL licensing on DR-only vCores |
Service tiers, zone redundancy, and the built-in HA model
Before you reach for cross-region anything, understand what you already get inside one region — because the tier you chose changes both your local recovery behavior and what a geo-secondary costs. Managed Instance has two operational models:
General Purpose separates compute from storage. Your database files live on Azure remote storage (premium/standard managed disks) with built-in 3-copy durability. A node failure means a stateless compute node is reattached to the same storage — fast to provision, but recovery is a remount, not an instant replica handoff. There is no local readable replica.
Business Critical runs an Always On availability group of four replicas locally on local SSD. One replica is a free read-only replica you reach with ApplicationIntent=ReadOnly. Failover between local replicas is seconds, and you get the lowest-latency I/O. It costs more and caps at smaller maximum storage than General Purpose.
The tier comparison that drives both your local-HA expectation and your geo-secondary cost:
| Property | General Purpose | Business Critical | Why it matters |
|---|---|---|---|
| Compute/storage model | Compute separate from remote storage | Compute + local SSD storage | GP recovery is a remount; BC is a replica handoff |
| Local replicas | None (stateless compute) | 4-node Always On AG | BC fails over locally in seconds |
| Local readable replica | No | Yes (free, ApplicationIntent=ReadOnly) |
BC offloads reads in-region for free |
| Local failover time | Remount (longer) | Seconds | BC for the tightest in-region RTO |
| Max storage (approx) | 16 TB | ~5.5 TB (Gen5, scales with vCores) | GP for very large databases |
| I/O latency | Higher (remote) | Lowest (local SSD) | BC for latency-sensitive OLTP |
| Relative cost | Lower | Higher (includes SQL licensing on replicas) | Drives DR-secondary economics |
| Standby (license-free) secondary | Supported in FOG | Supported in FOG | Cut licensing on DR-only vCores |
Zone redundancy is the cheapest resilience upgrade most teams skip. Enabling it spreads the replicas (Business Critical) or the storage (General Purpose, where supported) across availability zones in the region, so a single-zone outage does not take the instance down. It is a property of the instance, not a separate resource:
# Enable zone redundancy on an existing instance (Business Critical)
az sql mi update \
--resource-group rg-data-eastus2 \
--name mi-prod-eastus2 \
--zone-redundant true
resource mi 'Microsoft.Sql/managedInstances@2023-08-01-preview' = {
name: 'mi-prod-eastus2'
location: 'eastus2'
sku: { name: 'GP_Gen5', tier: 'GeneralPurpose', family: 'Gen5', capacity: 8 }
properties: {
subnetId: miSubnet.id
zoneRedundant: true // spread across AZs; protects a single-zone outage
storageSizeInGB: 256
licenseType: 'LicenseIncluded'
}
}
The built-in HA layers and exactly what each one survives — read this before deciding you “have HA”:
| Built-in mechanism | Tier | Protects against | Does NOT protect against | Recovery behavior |
|---|---|---|---|---|
| 3-copy remote storage | General Purpose | Disk failure | Region outage, zone outage, bad change | Storage-level durability |
| Compute reattach | General Purpose | Compute-node failure | Region/zone outage | Remount to same storage |
| Local 4-node AG | Business Critical | Node/replica failure | Region outage | Local failover in seconds |
| Zone redundancy | Both (where supported) | Single-zone outage | Region outage, bad change | Stays up across AZs |
| Automated backups + PITR | Both | Bad change, deletion, corruption | (this is the one that does) | Restore to a point in time |
Zone redundancy protects against a zone failure. It does not protect against a region failure, a bad deployment, or accidental data deletion. That is the line where failover groups and the link begin. Keep the built-in HA and add cross-region DR and keep backups — they are layers, not alternatives.
Provisioning auto-failover groups across two regions
A failover group geo-replicates all user databases on an instance to a partner instance in a different region, with a managed listener that survives failover. Two hard prerequisites trip people up before they ever run the command:
- The two instances must share a DNS zone. A Managed Instance’s host certificate is a SAN cert scoped to a generated DNS zone ID (e.g.
a1b2c3d4e5f6). For a client to reconnect to the secondary using the same cert after failover, the secondary must be in the same DNS zone. You set this at creation time of the secondary by pointing it at the primary as its DNS zone partner — you cannot retrofit it. - The subnets must have connectivity. Global VNet peering is the recommended path (private, backbone, no gateway). VPN/ExpressRoute also work but are slower for the seeding phase.
The prerequisites as a checklist with the failure you get if you skip each — this is the table that saves a re-provision:
| Prerequisite | Requirement | When set | Failure if wrong |
|---|---|---|---|
| Shared DNS zone | Secondary created in primary’s DNS zone | Creation only | Clients can’t reconnect post-failover; not retrofittable |
| Same hardware family | Both instances same gen (e.g. Gen5) | Provisioning | FOG creation rejected |
| Secondary ≥ primary size | Secondary vCores/storage ≥ primary | Provisioning | Seeding/failover constrained or blocked |
| Same/compatible service tier | GP↔GP or BC↔BC (with caveats) | Provisioning | FOG creation rejected |
| Cross-region connectivity | Global VNet peering (recommended) | Before FOG create | Seeding fails / FOG can’t form |
| Collation match | Server collation identical | Provisioning | FOG creation rejected |
| Empty secondary | No user DBs on the secondary | Before FOG create | FOG create fails (DBs would conflict) |
| Distinct subnets | Each instance in its own delegated subnet | Provisioning | Both → DNS-zone/peering issues |
Create the secondary in the partner’s DNS zone (the --dns-zone-partner takes the primary instance resource ID):
# Secondary instance, placed into the primary's DNS zone
az sql mi create \
--resource-group rg-data-westus2 \
--name mi-prod-westus2 \
--location westus2 \
--subnet "/subscriptions/<sub>/resourceGroups/rg-net-westus2/providers/Microsoft.Network/virtualNetworks/vnet-westus2/subnets/snet-mi" \
--capacity 8 --storage 256GB \
--edition BusinessCritical --family Gen5 \
--dns-zone-partner "/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.Sql/managedInstances/mi-prod-eastus2"
With both instances live and peered, create the failover group on the secondary’s resource group, referencing the primary as --source-mi. Use --failover-policy Manual (customer-managed) so you decide when to fail over — this is the strongly recommended posture:
az sql instance-failover-group create \
--name fog-prod-sqlmi \
--resource-group rg-data-westus2 \
--location westus2 \
--source-mi mi-prod-eastus2 \
--partner-mi mi-prod-westus2 \
--partner-resource-group rg-data-westus2 \
--failover-policy Manual \
--grace-period 1
Note the directionality: --source-mi is the current primary (East US 2), --partner-mi is the secondary you just built (West US 2), and --location/--resource-group describe the secondary. Every create-time parameter, its default, and the trade-off — get these right at creation because several are not changeable later:
| Parameter | What it sets | Default / typical | When to change | Gotcha if wrong |
|---|---|---|---|---|
--failover-policy |
Who triggers failover | Manual (recommended) |
Automatic only if you accept MS-timed region failover |
Automatic can fail over on a transient region blip |
--grace-period |
Min hours before MS forces failover (Automatic only) | 1 (floor) | Tune under Automatic | Inert under Manual |
--secondary-type |
Geo vs Standby secondary | Geo |
Standby for DR-only (license-free) |
Reads on a Standby forfeit the license benefit |
--source-mi |
Current primary | (the primary) | — | Reversed → group points the wrong way |
--partner-mi |
The secondary | (the new instance) | — | Must be in the shared DNS zone |
--partner-resource-group |
Secondary’s RG | (secondary RG) | — | Mismatch → create fails |
--location |
Secondary region | (secondary region) | — | Must be the secondary, not primary |
Set --secondary-type Standby instead of the default Geo if the secondary is DR-only and you want the license-free passive replica benefit — you stop paying SQL licensing on its vCores as long as you never run read workloads on it. --grace-period only matters under a Microsoft-managed (Automatic) policy; it is the minimum wait (floor of 1 hour) before Microsoft forces a region-wide failover. Under Manual it is inert.
The failover-policy decision, distilled:
| If you want… | Choose | Because |
|---|---|---|
| Full control over when you fail over | Manual |
You decide; no surprise failover on a transient blip |
| Microsoft to fail over automatically on a sustained outage | Automatic |
Hands-off, but cedes timing and accepts grace-period delay |
| A DR-only secondary with no read workload | Standby secondary |
License-free passive replica saves SQL licensing |
| A secondary you also read from | Geo secondary |
Read offload, but you pay SQL licensing on it |
The same group expressed declaratively, so it is reproducible and reviewable:
resource fog 'Microsoft.Sql/managedInstances/instanceFailoverGroups@2023-08-01-preview' = {
name: 'fog-prod-sqlmi'
parent: secondaryMi // created on the secondary
properties: {
readWriteEndpoint: {
failoverPolicy: 'Manual' // customer-managed
failoverWithDataLossGracePeriodMinutes: 60 // honored only under Automatic
}
readOnlyEndpoint: { failoverPolicy: 'Enabled' }
partnerRegions: [ { location: 'eastus2' } ]
managedInstancePairs: [
{ primaryManagedInstanceId: primaryMi.id, partnerManagedInstanceId: secondaryMi.id }
]
secondaryType: 'Standby' // license-free DR secondary
}
}
Read-write and read-only listener endpoints for connection routing
This is the entire payoff. The failover group publishes two CNAMEs in the shared DNS zone, and they do not change when the primary role moves:
| Listener | FQDN format | Points to | Connection string intent | Read/write |
|---|---|---|---|---|
| Read-write | <fog-name>.<zone_id>.database.windows.net |
current primary | default (read-write) | Read-write |
| Read-only | <fog-name>.secondary.<zone_id>.database.windows.net |
current secondary | ApplicationIntent=ReadOnly |
Read-only |
So your applications connect to the listener, never to mi-prod-eastus2... directly:
# Read-write (OLTP) — always lands on whichever instance is primary
Server=tcp:fog-prod-sqlmi.a1b2c3d4e5f6.database.windows.net;Database=orders;Encrypt=True;
# Read-only (reporting) — lands on the geo-secondary
Server=tcp:fog-prod-sqlmi.secondary.a1b2c3d4e5f6.database.windows.net;Database=orders;ApplicationIntent=ReadOnly;Encrypt=True;
There are more endpoints in play than just these two, and confusing them is a classic outage. The complete endpoint reference for a Managed Instance in a failover group:
| Endpoint | Form | Resolves to | Survives failover? | Reachable from |
|---|---|---|---|---|
| FOG read-write listener | <fog>.<zone>.database.windows.net |
Current primary | Yes | VNet-local only |
| FOG read-only listener | <fog>.secondary.<zone>.database.windows.net |
Current secondary | Yes | VNet-local only |
| Instance VNet-local FQDN | <mi>.<zone>.database.windows.net |
That specific instance | No (pins a region) | VNet-local only |
| Instance public endpoint | <mi>.public.<zone>.database.windows.net,3342 |
That instance, port 3342 | No | Internet (if enabled) |
| Link DAG (mirroring) | Private IP/host : 5022 | DAG transport | n/a | VNet-local only |
Two operational truths to internalize, then the connection-string knobs that make the routing actually work:
- On failover, the DNS record updates immediately, but existing TCP connections are not migrated. Clients must reconnect, and they will only pick up the new primary after their DNS cache expires. Build aggressive retry-with-reconnect into your data layer; do not assume a live connection rides through.
- These listeners are not reachable over the public endpoint. They resolve over the VNet-local endpoint only — from inside the VNet or peered/VPN/ExpressRoute networks. Plan your app placement accordingly.
The connection-string parameters that govern failover behavior, and how to set each:
| Parameter | What it does | Recommended for FOG | Why |
|---|---|---|---|
Server |
Target endpoint | The listener, never an instance FQDN | Survives the role switch |
ApplicationIntent |
Read-write vs read-only routing | ReadOnly on the read-only listener |
Routes reporting to the secondary |
Encrypt |
TLS to the instance | True |
MI requires encrypted connections |
MultiSubnetFailover |
Parallel connect across listener IPs | True |
Faster reconnect after failover |
Connection Timeout |
Initial connect timeout | 30 s (raise during failover window) | Avoid premature timeout mid-switch |
| Retry policy (app) | Reconnect on transient/40613/4060 |
Exponential backoff, jitter | Connections don’t migrate; you must reconnect |
The errors your data layer will see during a failover window — treat every one as “reconnect,” not “fail the request”:
| Error | Meaning | Likely cause | Confirm | Correct app behavior |
|---|---|---|---|---|
40613 |
Database not currently available | Role switch in progress | instance-failover-group show state |
Retry with backoff (reconnect) |
4060 / 40615 |
Cannot open database / login blocked | Failover or login not on new primary | Check listener target + login exists | Reconnect; pre-stage login |
40197 / 40501 |
Service busy / throttling | Transient during switch or load | Resource metrics | Backoff and retry |
10928 / 10929 |
Resource limit reached | Connection/worker cap on new primary | DMV sys.dm_exec_requests |
Backoff; right-size secondary |
18456 |
Login failed | Login/SID missing on new primary | sys.server_principals on new primary |
Pre-stage login with matching SID |
| Conn timeout | No response within timeout | DNS cache stale; switch not complete | nslookup listener; FOG state |
Raise timeout; MultiSubnetFailover=True |
The Managed Instance link: near-real-time replication from SQL Server
Failover groups are MI-to-MI. The Managed Instance link is the hybrid story: it builds a distributed availability group between a SQL Server instance (on-prem or on an Azure VM) and a SQL Managed Instance, replicating a single database in near-real-time. Because it is a DAG, the underlying tech is the same Always On machinery your DBAs already know.
Key facts that shape the design:
- One database per link. To replicate ten databases you create ten links. A single MI hosts up to 100 links (500 on the Next-gen General Purpose tier).
- Minimum versions matter. One-way replication (SQL Server → MI) is GA from SQL Server 2016 SP3 / 2017 CU31 / 2019 CU20. Two-way DR with failback requires SQL Server 2022 (16.x) or SQL Server 2025 (17.x).
- Authentication is certificate-based. SQL Server and MI exchange the public keys of their respective certificates. Windows auth cannot establish the link.
- It uses the VNet-local endpoint only — not the public endpoint, not a private endpoint.
- User databases only. Logins, agent jobs, server-level objects, and system databases are not replicated. This is the single most common post-failover outage cause; handle it explicitly (see the runbook).
The link’s hard constraints and limits, enumerated so you design within them from day one:
| Constraint | Value / rule | Implication |
|---|---|---|
| Databases per link | 1 | N databases → N links |
| Links per MI | 100 (500 on Next-gen GP) | Caps the fan-in to one instance |
| Transport port | 5022 (DB mirroring), default instance | NSG + on-prem firewall must allow it |
| Authentication | Certificate-based only | Windows auth cannot form the link |
| Endpoint used | VNet-local only | Not public, not private endpoint |
| Replicated objects | User database(s) only | Server-level state is your job |
| One-way GA versions | SS 2016 SP3 / 2017 CU31 / 2019 CU20 | Older builds can’t link |
| Two-way (failback) versions | SS 2022 (16.x) / 2025 (17.x) | Failback needs modern SQL Server |
| MI update policy for failback | Must match SS 2022/2025 policy | Not supported on “Always-up-to-date” |
| System databases | Not replicated | master/msdb handled manually |
The trust handshake on the SQL Server side, abbreviated:
-- On SQL Server: certificate to authenticate the database mirroring endpoint
USE master;
CREATE CERTIFICATE SqlServerCert
WITH SUBJECT = 'SQL Server certificate for MI link';
-- Database mirroring endpoint that the DAG rides over (default instance uses 5022)
CREATE ENDPOINT Endpoint_Mirroring
STATE = STARTED
AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
FOR DATABASE_MIRRORING (
ROLE = ALL,
AUTHENTICATION = CERTIFICATE SqlServerCert,
ENCRYPTION = REQUIRED ALGORITHM AES);
The endpoint options that matter for the link, with their values and what each controls:
| Endpoint option | Values | Use for the link | Why |
|---|---|---|---|
LISTENER_PORT |
TCP port | 5022 (default instance) | DAG transport; only the default instance should use 5022 |
LISTENER_IP |
ALL / specific |
ALL (or scope to the MI subnet) |
Accept the DAG connection |
ROLE |
ALL / PARTNER / WITNESS |
ALL |
Supports both directions |
AUTHENTICATION |
CERTIFICATE / WINDOWS |
CERTIFICATE <cert> |
Windows auth can’t form the link |
ENCRYPTION |
REQUIRED / SUPPORTED / DISABLED |
REQUIRED ALGORITHM AES |
Encrypt the replication stream |
If multiple SQL Server instances share a host, each needs its own mirroring endpoint on a dedicated port. Only the default instance should use 5022. SSMS automates the entire certificate exchange and DAG creation through its “Replicate database to Managed Instance” wizard — use it for the first link, then script the rest with the latest
Az.Sqlmodule or REST.
One-way vs two-way link scenarios for migration and DR
The link is the same feature with two very different operating modes, gated by SQL Server version and the MI update policy:
| Scenario | Versions | Direction | MI update policy | Use it for |
|---|---|---|---|---|
| Online migration / read offload | SQL Server 2016–2019 | SQL Server → MI only | Any | Minimal-downtime migration; offload read/reporting/analytics to Azure |
| Two-way DR with failback | SQL Server 2022 / 2025 | Bidirectional | Must match SS 2022/2025 policy | Hybrid DR where Azure or on-prem can be primary, and you can fail back |
The hard constraint on two-way: replicating from MI back to SQL Server (and failing over in that direction) is not supported on the “Always-up-to-date” update policy. The instance must run the matching SQL Server 2022 or SQL Server 2025 update policy. If you stand up an MI on Always-up-to-date and later need to fail back to on-prem, you are stuck — you can fail over to MI but never back over the link. Decide the update policy when you provision, based on whether failback to SQL Server is a requirement.
The update-policy decision matrix — this is the one that bites a year later if you get it wrong:
| Update policy | Engine versioning | Two-way link failback | Choose when |
|---|---|---|---|
| Always-up-to-date | Latest engine features, continuous | Not supported | MI is the long-term home; no failback to on-prem ever |
| SQL Server 2022 policy | Aligned to SQL Server 2022 | Supported | You may fail back to a SQL Server 2022 estate |
| SQL Server 2025 policy | Aligned to SQL Server 2025 | Supported | You may fail back to a SQL Server 2025 estate |
For migration, the link is the only option that delivers a true online cutover to the Business Critical tier. The pattern: establish the link, let it seed and stay synchronized while production keeps running on SQL Server, then fail over to MI during a short, planned cutover window. For SQL Server 2016–2019 that failover is one-directional and breaks the link — which is exactly what you want for a migration, but means you must be confident before you cut over.
How the link compares with the other ways to get data into MI — pick the row that matches your downtime budget:
| Method | Online? | Min downtime | Target tier | Best for |
|---|---|---|---|---|
| MI link | Yes (continuous DAG) | Minutes (cutover) | GP or BC | Lowest-downtime migration; hybrid DR |
| Log Replay Service (LRS) | Near-online (log shipping) | Final restore window | GP or BC | Migration without 2016 SP3+ link prereqs |
| Native backup/restore (URL) | No | Full restore time | GP or BC | One-off, downtime acceptable |
| DMS (Database Migration Service) | Online mode available | Varies | GP or BC | Guided, assessment-driven migrations |
| Transactional replication | Yes | Cutover | GP or BC | Selective objects, ongoing sync |
Planned vs forced failover and the data-loss tradeoffs
Geo-replication is asynchronous over the WAN. That single fact drives the entire failover-vs-forced-failover decision and your achievable RPO. There are two distinct operations:
Planned (no data loss). set-primary without --allow-data-loss. It fully synchronizes the secondary, then switches roles. It requires the current primary to be reachable. This is your tool for DR drills, region relocations, and failback after an outage clears.
# Planned failover: run AGAINST the secondary you want to promote.
# --location/-g identify the secondary. No data loss.
az sql instance-failover-group set-primary \
--name fog-prod-sqlmi \
--resource-group rg-data-westus2 \
--location westus2
Forced (potential data loss). Same command with --allow-data-loss. It promotes the secondary immediately without waiting for in-flight transactions to replicate. This is your only option when the primary region is gone. Anything not yet replicated is lost — that gap is your RPO.
# Forced failover during a real regional outage. Data loss is possible.
az sql instance-failover-group set-primary \
--name fog-prod-sqlmi \
--resource-group rg-data-westus2 \
--location westus2 \
--allow-data-loss
The two operations side by side — print this and tape it to the runbook:
| Dimension | Planned failover | Forced failover |
|---|---|---|
| Command | set-primary |
set-primary --allow-data-loss |
| Data loss | None | Possible (= un-replicated lag) |
| Primary must be reachable | Yes | No |
| Use case | Drills, relocation, failback | Real region outage only |
| RPO realized | 0 | Up to your replication lag (~5 s healthy) |
| Sync before switch | Yes (waits) | No (switches immediately) |
| Old primary after | Stays primary until switch completes | Auto-rejoins as new secondary when back |
| Reversible | Yes (planned failback) | Yes (planned failback once healthy) |
The decision in one table — match the situation to the operation:
| If the primary region is… | And you are… | Run | Expect |
|---|---|---|---|
| Healthy and reachable | Drilling / relocating | Planned set-primary |
Zero data loss |
| Healthy and reachable | Failing back after an outage | Planned set-primary |
Zero data loss |
| Down / unreachable | In a real outage | --allow-data-loss |
Possible loss = replication lag |
| Degraded but reachable | Unsure | Try planned first | Falls back to forced only if needed |
The DNS update is immediate either way, but the role switch can take up to ~5 minutes; until it completes some databases on the new primary remain read-only. After a forced failover, when the old primary comes back it automatically rejoins as the new secondary — do not delete it in a panic. Once the region is healthy, use a planned set-primary (no data loss) to fail back.
There is a way to get a data-lossless forced failover for a drill: stop the workload, drain long-running transactions, disconnect all clients, and confirm the failover group status is
Synchronizingon both instances before forcing. But for production failback, always prefer planned over forced. Forced failover has no guardrails against data loss by design.
Networking prerequisites: subnet delegation, NSG, and route tables
Managed Instance is the most networking-opinionated PaaS database Azure ships. Get the subnet wrong and provisioning simply fails. The non-negotiables:
- A dedicated subnet delegated to
Microsoft.Sql/managedInstances. Nothing else lives in it. No other resource types, no other instances from a different DNS zone. - Subnet sizing: a
/27is the documented minimum, but plan/26or larger so you have room to scale vCores (scaling can temporarily need extra addresses). - An NSG and a route table must be associated with the subnet — they are required, not optional. The platform manages a set of mandatory rules/routes; do not strip them.
The networking requirements as a hard checklist with the symptom you get when each is wrong:
| Requirement | Rule | Symptom if violated |
|---|---|---|
| Dedicated subnet | Delegated to Microsoft.Sql/managedInstances, nothing else |
Provisioning fails |
| Subnet size | /27 minimum; plan /26+ |
Scale operations blocked (no free IPs) |
| NSG attached | Required; keep mandatory rules | Provisioning fails / instance unhealthy |
| Route table attached | Required; keep mandatory routes | Provisioning fails / management broken |
| No conflicting delegations | Only the MI delegation | Subnet rejected |
| Cross-region path (FOG) | Global VNet peering (recommended) | Seeding fails; FOG can’t form |
| Link path | Peering (Azure VM) or VPN/ExpressRoute (on-prem) | Link can’t establish over 5022 |
| Mirroring port (link) | 5022 open in NSG + on-prem firewall | DAG transport blocked |
For a failover group, the two instance subnets across regions need connectivity. Global VNet peering is the recommended choice:
# Global VNet peering between the two MI VNets (run both directions).
az network vnet peering create \
--name peer-eastus2-to-westus2 \
--resource-group rg-net-eastus2 \
--vnet-name vnet-eastus2 \
--remote-vnet "/subscriptions/<sub>/resourceGroups/rg-net-westus2/providers/Microsoft.Network/virtualNetworks/vnet-westus2" \
--allow-vnet-access true --allow-forwarded-traffic true
# ...then the symmetric peer-westus2-to-eastus2 with the remote-vnet reversed.
The cross-region connectivity options for the FOG seeding/replication path, compared:
| Option | Private? | Throughput | Setup effort | Best for |
|---|---|---|---|---|
| Global VNet peering | Yes (backbone) | High | Low (no gateway) | Recommended for MI↔MI FOG |
| VPN gateway (VNet-to-VNet) | Yes (encrypted tunnel) | Modest | Medium | When peering isn’t available |
| ExpressRoute | Yes (private circuit) | High | High (circuit) | Existing ER estates; hybrid |
Subnet delegation and the required associations, expressed declaratively in Bicep so it is reproducible:
resource miSubnet 'Microsoft.Network/virtualNetworks/subnets@2024-05-01' = {
parent: vnet
name: 'snet-mi'
properties: {
addressPrefix: '10.20.0.0/26'
delegations: [
{
name: 'miDelegation'
properties: { serviceName: 'Microsoft.Sql/managedInstances' }
}
]
networkSecurityGroup: { id: nsgMi.id }
routeTable: { id: routeMi.id }
}
}
For the link, connectivity is between SQL Server and the MI subnet: VNet peering if SQL Server is on an Azure VM, or VPN/ExpressRoute from on-prem. The data path uses the database mirroring port (default 5022) plus the AG communication ports; ensure your NSGs and on-prem firewalls allow that traffic over the private link. The link cannot use the public or private endpoint — only the VNet-local endpoint. If you need to deepen the resolution design across the hub, Private Endpoints and Private DNS at Scale: A Hub-and-Spoke Resolution Architecture covers the patterns the listeners depend on.
Building and rehearsing a failover runbook with RPO/RTO targets
A failover capability you have never executed is a liability, not a control. Write the runbook, attach numbers, and drill it on a schedule.
Set the targets first. For Managed Instance geo-failover, design around an RPO of ~5 seconds under healthy async replication (the realistic data-loss window in a forced failover) and an RTO of ~1 hour for the failover operation plus application recovery — the role switch itself is minutes, but end-to-end recovery is gated by your slowest dependency. Document the actual numbers you measure in drills; do not ship the marketing ones.
The RPO/RTO budget broken down so you know which line item to attack when the number is too high:
| Phase | Contributes to | Typical | How to reduce |
|---|---|---|---|
| Un-replicated WAN lag | RPO | ~5 s (healthy async) | Faster link, less write burst, planned over forced |
| Detect + decide | RTO | Minutes (human) | Pre-authorized runbook, clear trigger criteria |
| Role switch (DNS + promote) | RTO | Up to ~5 min | Nothing to tune; platform-bound |
| Client reconnect | RTO | Seconds–minutes | Retry-with-reconnect, low DNS TTL, MultiSubnetFailover |
| App-tier failover | RTO | Minutes | Pre-deploy app in DR region; warm standby |
| Server-state validation | RTO | Minutes | Pre-stage logins/jobs; automate the gate |
The runbook, in order:
-
Pre-stage the secondary. Sync logins with matching SIDs, agent jobs, linked servers, credentials, and any
master/msdbobjects. These are NOT replicated. This is the step that turns a “successful” failover into a broken application.-- On the SECONDARY instance: recreate the login with the SAME SID as primary. CREATE LOGIN app_svc WITH PASSWORD = '<strong-secret>', SID = 0x2A...; -- SID from primary -
Decide planned vs forced based on whether the primary is reachable. Drills and failback use planned (
set-primary). A real region-down event uses forced (--allow-data-loss). -
Verify replication health before any planned operation — status must be
Synchronizingon both instances. -
Execute the appropriate
set-primarycommand from the failover section. -
Fail over the app tier too. Repoint or restart app components in the secondary region so cross-region latency to the database does not silently degrade you. The DB moving alone is half a failover.
-
Validate (next section), then communicate completion.
-
Fail back with a planned failover once the primary region is healthy — never leave production running on a DR-only-sized secondary longer than necessary.
The server-level state that replication ignores — the pre-stage checklist that prevents the “data perfect, app down” outage:
| Object class | Replicated by FOG/link? | Pre-stage how | Symptom if missed |
|---|---|---|---|
| User databases | Yes | (automatic) | — |
| SQL logins | No | CREATE LOGIN … SID = <match> |
18456 login failed |
| Login SID match | No | Copy SID from primary | Orphaned users; auth fails |
| SQL Agent jobs | No | Script + recreate on secondary | Batch/ETL never runs |
| Linked servers | No | Recreate with credentials | Cross-server queries fail |
| Credentials / proxies | No | Recreate | Jobs/operations fail |
| Server-level permissions | No | Re-grant | Authorization gaps |
msdb operators/alerts |
No | Recreate | No alerting post-failover |
| Server-level audit | No | Reconfigure | Compliance gap |
Automate the health check so the drill is repeatable rather than artisanal:
# Pre-flight gate: refuse to proceed unless the group is Synchronizing.
state=$(az sql instance-failover-group show \
--name fog-prod-sqlmi -g rg-data-westus2 -l westus2 \
--query "replicationState" -o tsv)
echo "Failover group replication state: $state"
[ "$state" = "CATCH_UP" ] || { echo "Not in sync — aborting drill."; exit 1; }
The drill cadence and what each rehearsal proves:
| Drill | Cadence | Proves | Risk |
|---|---|---|---|
| Planned failover (prod) | Quarterly | Listener routing, server-state gate, app failover | Low (data-lossless, reversible) |
| Planned failback | Same drill, reverse | Failback path and timing | Low |
| Forced-failover tabletop | Quarterly | The team knows the data-loss decision | None (paper exercise) |
| Server-state diff | Monthly | Logins/jobs still in sync | None (read-only check) |
| Connectivity verification | Per release | Listeners resolve from app subnets | None |
Drill cadence: quarterly planned failover in production (it is data-lossless and reversible), plus a tabletop of the forced-failover path. The first time you discover the service account login does not exist on the secondary should be a Tuesday-afternoon drill, not a 3 a.m. region outage. For the IaaS counterpart of this discipline — recovery plans, sequencing, scripts — see Azure Site Recovery for IaaS: Zone-to-Zone and Region Failover with Recovery Plans.
Verifying and monitoring the system
After configuring a failover group or running a drill, confirm the system is actually in the state you think it is — do not trust the portal’s green tick alone:
# 1. Failover group exists, knows both partners, and reports replication role/state.
az sql instance-failover-group show \
--name fog-prod-sqlmi -g rg-data-westus2 -l westus2 \
--query "{role:replicationRole, state:replicationState, partner:partnerRegion[0].location, policy:readWriteEndpoint.failoverPolicy}" -o table
# 2. Both listeners resolve and the right instance answers.
# Read-write must hit the current primary; read-only must hit the secondary.
nslookup fog-prod-sqlmi.a1b2c3d4e5f6.database.windows.net
nslookup fog-prod-sqlmi.secondary.a1b2c3d4e5f6.database.windows.net
-- 3. From inside the VNet, confirm which physical instance you landed on
-- via each listener (run after connecting through the read-write FQDN).
SELECT @@SERVERNAME AS landed_on,
DATABASEPROPERTYEX(DB_NAME(), 'Updateability') AS rw_or_ro;
-- 4. For the link: confirm the distributed AG is healthy and synchronizing.
SELECT ag.name, drs.synchronization_state_desc, drs.synchronization_health_desc
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_groups ag ON ag.group_id = drs.group_id;
Expected: the failover group state is CATCH_UP/Synchronizing, both CNAMEs resolve, the read-write listener lands on a writable primary, the read-only listener lands on a read-only secondary, and the link’s DAG reports SYNCHRONIZED/HEALTHY. The instruments to wire and watch — what each tells you and the healthy reading:
| Check / signal | Tool | Healthy reading | What a bad reading means |
|---|---|---|---|
replicationState |
az ... instance-failover-group show |
CATCH_UP |
Behind → forced failover would lose data |
replicationRole |
az ... instance-failover-group show |
Primary on the live instance |
Role drifted / mid-switch |
| RW listener resolution | nslookup from app subnet |
New primary’s VNet-local IP | Stale DNS / wrong region |
Updateability per DB |
DATABASEPROPERTYEX(...) |
READ_WRITE on primary |
DB still read-only post-switch |
| Local AG health (BC) | sys.dm_hadr_database_replica_states |
HEALTHY / SYNCHRONIZED |
Local replica degraded |
| Link DAG state | sys.dm_hadr_database_replica_states |
SYNCHRONIZED |
Link drifting; cutover unsafe |
| Logins present | sys.server_principals |
Required logins exist + SID match | 18456 waiting to happen |
| Agent jobs present | msdb.dbo.sysjobs |
Expected jobs exist | Batch will silently skip |
| Replication lag | Azure Monitor metric | Low / flat (~seconds) | RPO quietly degrading |
| Failover events | Activity log | Expected, attributable | Unexpected/automatic failover |
Architecture at a glance
The diagram traces the data and control path of a Managed Instance continuity design, left to right, and maps each failure class onto the exact hop where it bites. Read it as the system actually behaves: applications in the primary region connect through the failover-group listeners (read-write to the primary, read-only to the secondary) — never to a regional instance FQDN — over the VNet-local endpoint. The primary Managed Instance in East US 2 (Business Critical, zone-redundant, a local 4-node Always On AG) holds the live databases. It asynchronously geo-replicates all user databases across global VNet peering to the secondary Managed Instance in West US 2, which is created in the same DNS zone so the listener cert still validates after a switch. On the hybrid side, a SQL Server estate replicates a single database to the primary over a distributed availability group on port 5022, the link path for migration and hybrid DR. Underneath both sits the control plane — automated backups and PITR (the only defense against a replicated bad change) and Azure Monitor watching replication state and lag.
Walk the badges to read the diagnostic map. Badge 1 sits on the listener: a stale DNS cache or a connection that never reconnects keeps clients pinned to the dead region — confirm by resolving the listener, fix with retry-with-reconnect and MultiSubnetFailover. Badge 2 sits on the geo-replication arrow: the secondary is behind, so a forced failover loses the un-replicated lag — confirm the FOG replicationState, decide planned versus forced accordingly. Badge 3 sits on the secondary instance: the databases arrived but the logins, SIDs, and Agent jobs did not — confirm with sys.server_principals, fix by pre-staging server state. Badge 4 sits on the link: the wrong update policy blocks failback to SQL Server — confirm the policy, fix by choosing it at provisioning. Badge 5 sits on the DNS zone itself: a secondary built in the wrong zone can never present a valid cert post-failover — confirm the zone IDs match, fix by re-provisioning into the shared zone. The whole method is in the picture: localize the failure to a hop, read the badge, run the named check, apply the fix.
Real-world scenario
Meridian Pay, a payments platform, ran its core ledger on SQL Server 2019 on-prem and needed to move to Azure SQL Managed Instance with a regulator-mandated cap of 15 minutes of cutover downtime on a database just under 2 TB, plus an ongoing cross-region DR posture once it landed in Azure. The team is six DBAs and platform engineers; the target was a Business Critical instance in East US 2 with a geo-secondary in West US 2.
Their first instinct for the migration was the Log Replay Service, but they could not hit the window during rehearsals — the final log restore plus validation blew past 15 minutes every time, every rehearsal, because the tail-log restore was serialized against a 2 TB database. They switched to the Managed Instance link. Because the link rides a distributed availability group and keeps the MI continuously seeded and synchronized while production stays live on SQL Server, the actual cutover collapsed to: stop the app, confirm the DAG was SYNCHRONIZED, fail over to MI, repoint the app at the failover-group read-write listener. Measured cutover: under 4 minutes — comfortably inside the regulatory cap, with the auditor watching.
The trap they hit — and the reason this is worth writing down — was the post-cutover outage that had nothing to do with data. The link replicates the database, not the instance. Their batch settlement Agent jobs and the service-account logins simply were not on the MI, so the first overnight settlement run failed even though every row of customer data was intact and correct. The on-call DBA spent forty minutes chasing a “data corruption” ghost before realizing the data was fine and the login was missing — the app was getting 18456 Login failed, not a data error. They fixed it by scripting logins with matching SIDs and recreating the Agent jobs as an explicit pre-cutover task — and then encoded it as a gate in the runbook so it could never be skipped again:
-- Pre-cutover gate on the target MI: fail loudly if a required login is missing,
-- BEFORE anyone declares the migration "done".
DECLARE @required TABLE (name sysname);
INSERT INTO @required VALUES (N'app_settlement_svc'), (N'app_ledger_svc');
IF EXISTS (SELECT 1 FROM @required r
WHERE NOT EXISTS (SELECT 1 FROM sys.server_principals p WHERE p.name = r.name))
THROW 50001, 'Required login(s) missing on MI — do not cut over.', 1;
Once on Azure, they layered in the cross-region failover group (Standby secondary in West US 2 to keep the DR vCores license-free) and ran their first quarterly planned failover. It surfaced exactly one more gap — a linked server to a pricing service that, like the logins, lived only on the old primary — which they added to the pre-stage list. By the second drill the planned failover was a clean nine-minute end-to-end exercise: switch, app repoint, validate, communicate. Their measured RPO under healthy replication sat at ~3 seconds; their measured planned-failover RTO was ~9 minutes, well under the 1-hour design target. They left forced failover as a tabletop, because the decision — accept the replication lag as lost — is a business call, not a button to practice.
The incident-and-build as a timeline, because the order of moves is the lesson:
| Time / phase | Symptom or step | Action taken | Effect | What it taught |
|---|---|---|---|---|
| Rehearsal | LRS cutover > 15 min | Use Log Replay Service | Missed window every time | Final log restore doesn’t scale to 2 TB |
| Re-plan | Need online cutover | Switch to MI link (DAG) | Continuous sync while live | Link is the only true online path to BC |
| Cutover | 2 TB, 15-min cap | Stop app, confirm sync, fail over, repoint | < 4 min cutover | The cutover collapses when pre-seeded |
| +1 night | Settlement job failed | Chased “corruption” | 40 min lost | Data was fine; the login was missing |
| Fix | 18456 on app |
Script logins (matching SID) + jobs | App recovered | Replication moves DBs, not server state |
| Hardening | Make it un-skippable | Add login gate to runbook | Can’t cut over with missing logins | Encode the lesson as a hard gate |
| DR build | Cross-region posture | FOG with Standby secondary | License-free DR | Standby cuts SQL licensing on DR vCores |
| Drill 1 | First planned failover | Run quarterly drill | Found missing linked server | Drills surface the gaps cheaply |
| Drill 2 | Clean failover | Switch + repoint + validate | RTO ~9 min, RPO ~3 s | Measured numbers, not marketing |
The lesson generalizes to failover groups too: cross-region geo-replication and the link both move user databases and nothing else. Server-level state is your job, and it belongs in the runbook as a hard gate, not a hopeful afterthought.
Advantages and disadvantages
The two-primitive model — failover groups for MI↔MI DR, the link for hybrid — is powerful precisely because each is purpose-built, but that specialization is also where the sharp edges live. Weigh it honestly:
| Advantages (why this model helps you) | Disadvantages (why it bites) |
|---|---|
| Failover-group listeners survive the role switch — no connection-string change on failover | Listeners resolve VNet-local only; public-endpoint clients can’t use them, constraining app placement |
| Standby secondary makes a DR-only replica license-free, cutting the cost of cross-region DR | A Geo secondary you read from forfeits the license benefit — easy to misconfigure and overpay |
| The link delivers a true online migration to Business Critical, cutover in minutes | The link is one database per link; ten databases means ten links to build and operate |
Customer-managed (Manual) policy means no surprise failover on a transient region blip |
You must notice and decide — there’s no automatic safety net unless you opt into Automatic |
| Both primitives reuse familiar Always On machinery your DBAs already know | Async geo-replication means a forced failover can lose data equal to the replication lag |
| Zone redundancy + FOG + backups stack into three independent controls | Geo-replication is not a backup — a DROP TABLE replicates in seconds; you still need PITR |
| The DNS zone is set once and the cert “just works” after failover | The DNS zone is set at creation only — get it wrong and you re-provision the secondary |
| Two-way link enables hybrid DR where Azure or on-prem can be primary | Two-way failback needs SQL Server 2022/2025 and the matching update policy — decided at provisioning |
The model is right for any team that needs both an MI↔MI DR story and a hybrid SQL-Server path, and that is willing to operate listeners, pre-stage server state, and drill. It bites hardest on teams that treat geo-replication as a backup (it is not), that read from a Geo secondary they meant to keep license-free, that try to retrofit the DNS zone or the update policy after creation, or that never rehearse and discover the missing logins during a real outage. Every disadvantage is manageable — but only if you designed for it before provisioning, which is the entire point of this article.
Hands-on lab
Stand up a single Managed Instance, then add a failover group to a second instance, route through the listeners, and run a planned failover — observing each step. This lab provisions real Managed Instances, which are not free and take time to deploy; treat it as a costed exercise (see the cost note) and delete everything at the end. Run in Cloud Shell (Bash).
Reality check: a Managed Instance can take 1.5–4+ hours to provision the first time (the second in a subnet is faster). If you only want to read the flow without the spend, follow the commands and expected outputs without running them — the sequence is the lesson.
Step 1 — Variables and resource groups.
SUB=$(az account show --query id -o tsv)
RG1=rg-mi-eastus2; LOC1=eastus2
RG2=rg-mi-westus2; LOC2=westus2
az group create -n $RG1 -l $LOC1 -o table
az group create -n $RG2 -l $LOC2 -o table
Step 2 — Networking: a delegated subnet, NSG, and route table per region. (Abbreviated; create a VNet in each region with a /26 subnet delegated to Microsoft.Sql/managedInstances, an NSG, and a route table attached — see the Bicep in the networking section.) Then peer the two VNets globally, both directions:
az network vnet peering create -n peer-e2-to-w2 -g $RG1 --vnet-name vnet-eastus2 \
--remote-vnet "/subscriptions/$SUB/resourceGroups/$RG2/providers/Microsoft.Network/virtualNetworks/vnet-westus2" \
--allow-vnet-access true --allow-forwarded-traffic true -o table
# ...repeat reversed for peer-w2-to-e2.
Step 3 — Create the primary instance (General Purpose keeps the lab cheaper).
az sql mi create -g $RG1 -n mi-lab-eastus2 -l $LOC1 \
--subnet "/subscriptions/$SUB/resourceGroups/$RG1/providers/Microsoft.Network/virtualNetworks/vnet-eastus2/subnets/snet-mi" \
--capacity 4 --storage 32GB --edition GeneralPurpose --family Gen5 \
--license-type LicenseIncluded -o table
Expected: a long-running deploy; when done, state = Ready.
Step 4 — Create the secondary in the primary’s DNS zone. This is the step you cannot retrofit:
az sql mi create -g $RG2 -n mi-lab-westus2 -l $LOC2 \
--subnet "/subscriptions/$SUB/resourceGroups/$RG2/providers/Microsoft.Network/virtualNetworks/vnet-westus2/subnets/snet-mi" \
--capacity 4 --storage 32GB --edition GeneralPurpose --family Gen5 \
--license-type LicenseIncluded \
--dns-zone-partner "/subscriptions/$SUB/resourceGroups/$RG1/providers/Microsoft.Sql/managedInstances/mi-lab-eastus2" -o table
Step 5 — Create the failover group (customer-managed) on the secondary’s RG.
az sql instance-failover-group create -n fog-lab -g $RG2 -l $LOC2 \
--source-mi mi-lab-eastus2 --partner-mi mi-lab-westus2 \
--partner-resource-group $RG2 --failover-policy Manual --grace-period 1 -o table
Step 6 — Inspect the group and resolve both listeners.
az sql instance-failover-group show -n fog-lab -g $RG2 -l $LOC2 \
--query "{role:replicationRole, state:replicationState, partner:partnerRegion[0].location}" -o table
# Read-write and read-only CNAMEs live in the shared DNS zone:
nslookup fog-lab.<zone_id>.database.windows.net
nslookup fog-lab.secondary.<zone_id>.database.windows.net
Expected: role: Primary, state: CATCH_UP; both CNAMEs resolve to VNet-local IPs.
Step 7 — Run a planned failover and watch the role move.
az sql instance-failover-group set-primary -n fog-lab -g $RG2 -l $LOC2 -o table
# Re-query: the West US 2 instance should now be Primary.
az sql instance-failover-group show -n fog-lab -g $RG1 -l $LOC1 \
--query "{role:replicationRole, state:replicationState}" -o table
Expected: after a few minutes the role has switched; the read-write listener now resolves to the West US 2 instance — with no change to the listener FQDN.
Step 8 — Fail back (planned, data-lossless).
az sql instance-failover-group set-primary -n fog-lab -g $RG1 -l $LOC1 -o table
The lab steps mapped to what each proves:
| Step | What you did | What it proves | Real-world analogue |
|---|---|---|---|
| 4 | Secondary in primary’s DNS zone | The zone is a creation-time decision | The #1 re-provision trap |
| 5 | --failover-policy Manual |
Customer-managed timing | Production-recommended posture |
| 6 | Resolve both listeners | The listener contract is real | App connection-string design |
| 7 | Planned set-primary |
Failover is data-lossless and reversible | Quarterly drill |
| 8 | Planned failback | The round trip works | Post-outage recovery |
Cleanup (Managed Instances bill continuously).
az group delete -n $RG1 --yes --no-wait
az group delete -n $RG2 --yes --no-wait
Cost note. Two GP Gen5 4-vCore instances bill on the order of ₹30–60+ per hour combined (plus storage and any egress), and they bill while provisioning — this lab is not free-tier. Keep the window short, or read the flow without provisioning. Deleting both resource groups stops all charges.
Common mistakes & troubleshooting
This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the entries that bite hardest with the full confirm-command detail underneath.
| # | Symptom | Root cause | Confirm (exact cmd / portal path) | Fix |
|---|---|---|---|---|
| 1 | After failover, app still hits the dead region | Stale DNS cache / connection never reconnected | nslookup <fog>.<zone>.database.windows.net; check it resolves to new primary |
Retry-with-reconnect; MultiSubnetFailover=True; lower DNS TTL reliance |
| 2 | App gets 18456 Login failed on the new primary |
Login/SID not pre-staged on secondary | SELECT name,sid FROM sys.server_principals on new primary vs old |
CREATE LOGIN … SID=<match>; add to runbook gate |
| 3 | Overnight batch/ETL didn’t run after failover | SQL Agent jobs not replicated | SELECT name FROM msdb.dbo.sysjobs on new primary |
Script + recreate jobs on secondary as pre-stage |
| 4 | FOG create fails referencing the secondary | DNS zone / family / tier / collation mismatch | az sql mi show --query "{dns:dnsZone,fam:sku.family,tier:sku.tier}" both |
Re-create secondary in shared DNS zone, matching family/tier |
| 5 | Forced failover lost more data than expected | Secondary was behind (un-replicated lag) | FOG replicationState (not CATCH_UP at failover) |
Prefer planned; for drills, drain + confirm sync first |
| 6 | Reporting queries are hitting the primary | App used read-write listener / no ApplicationIntent |
Confirm Server= is the .secondary. FQDN + ApplicationIntent=ReadOnly |
Point reporting at the read-only listener |
| 7 | Can’t fail back over the link to SQL Server | MI on “Always-up-to-date” update policy | az sql mi show --query "servicePrincipal" / portal update policy |
Re-provision MI with SS 2022/2025 policy (creation-time) |
| 8 | Link won’t establish | Port 5022 blocked, or Windows auth attempted | Test 5022 across the path; confirm cert-based auth | Open 5022 in NSG + firewall; use certificate auth |
| 9 | MI provisioning fails immediately | Subnet not delegated / too small / no NSG+route | az network vnet subnet show --query "{deleg:delegations,nsg:networkSecurityGroup}" |
Delegate to Microsoft.Sql/managedInstances; /26+; attach NSG+route |
| 10 | Listeners don’t resolve from the app | App not in VNet/peered network (public-only) | nslookup from the app host fails |
Place app in VNet/peered net; listeners are VNet-local |
| 11 | A DROP TABLE is gone on the secondary too |
Geo-replication replicated the bad change | Compare to PITR; it’s not a backup | Restore from PITR/LTR — keep backups even with FOG |
| 12 | Standby secondary still incurs SQL licensing | Read workload running on a Geo/used Standby |
Check for connections via the read-only listener | Make it truly passive; set --secondary-type Standby |
| 13 | Role switch “stuck”; some DBs read-only | Switch still completing (up to ~5 min) | FOG replicationState; per-DB Updateability |
Wait out the switch; don’t force again mid-switch |
| 14 | After forced failover, old primary “reappeared” | It auto-rejoined as the new secondary | FOG show lists it as secondary |
Leave it; fail back with planned set-primary |
The expanded form, with the full reasoning for the entries that cause the longest outages:
1. After failover, the app keeps hitting the dead region.
Root cause: The client cached the old DNS resolution or never tore down its connection pool, so it keeps trying the failed primary even though the listener CNAME already points elsewhere.
Confirm: nslookup <fog>.<zone>.database.windows.net from the app subnet — it should resolve to the new primary’s VNet-local IP. If the app still fails, it’s the client, not DNS.
Fix: Implement aggressive retry-with-reconnect (rebuild the connection on transient/40613), add MultiSubnetFailover=True, and don’t rely on a long-lived DNS cache. Connections do not migrate on failover — reconnection is mandatory.
2. The app gets 18456 Login failed immediately after failover.
Root cause: The login (or its SID) was never pre-staged on the secondary. Replication moved the database; it did not move the server-level login, so the contained database user is orphaned and authentication fails.
Confirm: On the new primary, SELECT name, sid FROM sys.server_principals WHERE type IN ('S','U','G') and compare to the old primary — a missing name or a different SID is the smoking gun.
Fix: CREATE LOGIN app_svc WITH PASSWORD='…', SID=0x… using the SID copied from the original primary, and encode a login-existence gate in the runbook (see the scenario) so a failover can never complete with missing logins.
3. The overnight batch/ETL silently didn’t run after a failover.
Root cause: SQL Agent jobs are not replicated. The database is on the new primary, but the jobs that operate on it live in msdb, which is server-level state.
Confirm: SELECT name FROM msdb.dbo.sysjobs on the new primary — the expected jobs are absent.
Fix: Script the Agent jobs (and their schedules, credentials, and proxies) and recreate them on the secondary as an explicit pre-stage step; verify them in the runbook.
4. Creating the failover group fails when it references the secondary.
Root cause: The secondary does not satisfy a creation prerequisite — most often a DNS-zone mismatch (it wasn’t created in the primary’s zone), but also hardware-family, service-tier, or collation mismatch, or the secondary already has user databases.
Confirm: az sql mi show -n mi-prod-westus2 -g rg-data-westus2 --query "{dns:dnsZone, family:sku.family, tier:sku.tier, collation:collation}" and compare to the primary.
Fix: The DNS zone and several other properties are creation-time only — re-create the secondary in the shared DNS zone with a matching family/tier/collation and no user databases.
5. A forced failover lost more data than the team expected.
Root cause: The secondary was behind when forced — the un-replicated lag became the data loss. Async replication is always slightly behind; a write burst widens the gap.
Confirm: The FOG replicationState was not CATCH_UP at the moment of failover (or replication lag metrics were elevated).
Fix: Prefer planned failover whenever the primary is reachable. For a data-lossless drill, stop the workload, drain long transactions, and confirm Synchronizing/CATCH_UP on both instances before forcing.
7. You can fail over to MI but cannot fail back to SQL Server over the link. Root cause: The MI runs the “Always-up-to-date” update policy, which does not support two-way (failback) replication over the link. Failback requires the SQL Server 2022 or 2025 update policy, chosen at provisioning. Confirm: Check the instance’s update policy in the portal or via the management API. Fix: The update policy is creation-time — re-provision the MI with the matching SQL Server 2022/2025 policy if hybrid failback is a requirement. Decide this before you build.
9. Managed Instance provisioning fails almost immediately.
Root cause: The subnet is not delegated to Microsoft.Sql/managedInstances, is too small (below /27), or is missing the required NSG/route table association.
Confirm: az network vnet subnet show -g rg-net -n snet-mi --vnet-name vnet --query "{prefix:addressPrefix, deleg:delegations[].serviceName, nsg:networkSecurityGroup.id, route:routeTable.id}".
Fix: Delegate the subnet to Microsoft.Sql/managedInstances, size it /26 or larger, and attach an NSG and a route table (keep the platform’s mandatory rules/routes).
11. A DROP TABLE (or bad deployment) is gone on the secondary too.
Root cause: Geo-replication is not a backup. A destructive change replicates to the secondary in seconds, so failing over does not recover it.
Confirm: The object is missing on both primary and secondary; only PITR/LTR has the pre-change state.
Fix: Restore from point-in-time restore or long-term retention. Keep automated backups and an appropriate retention policy even with a perfect failover group — they are the only defense against a replicated bad change.
Best practices
- Layer the controls; don’t pick one. Zone redundancy and a cross-region failover group and backups — three independent defenses against three different failures. Removing any one re-opens a gap.
- Set the DNS zone at creation. Always create the secondary in the primary’s DNS zone via
--dns-zone-partner. It cannot be retrofitted; getting it wrong means re-provisioning. - Use
--failover-policy Manualin production. Customer-managed timing means no surprise failover on a transient region blip — you decide when continuity events happen. - Connect applications to listeners, never to a regional instance FQDN. The read-write and read-only listeners survive the role switch; an instance FQDN pins you to a region.
- Build retry-with-reconnect into the data layer. Connections do not migrate on failover; aggressive reconnection (and
MultiSubnetFailover=True) is the difference between a blip and an outage. - Pre-stage server-level state as a hard gate. Logins with matching SIDs, Agent jobs, linked servers, credentials,
msdbobjects — replication ignores all of them. Encode an existence check in the runbook so a failover can’t “succeed” with missing logins. - Keep backups even with geo-replication. A failover group replicates a
DROP TABLEin seconds; only PITR/LTR defends against a bad change. Geo-replication is not a backup. - Use a Standby secondary for DR-only.
--secondary-type Standbymakes the passive replica license-free — but only if you never run read workloads on it. - Choose the update policy before provisioning if hybrid failback matters. Two-way link failback requires the SQL Server 2022/2025 policy; “Always-up-to-date” can never fail back over the link.
- Drill quarterly in production. Planned failover is data-lossless and reversible — rehearse it on a schedule, and tabletop the forced-failover decision. The first failover should never be a real outage.
- Measure RPO/RTO; don’t quote marketing numbers. Record the actual lag and the actual end-to-end recovery time from your own drills, and design to those.
- Fail over the app tier too. A database that moves alone leaves the app paying cross-region latency to its now-distant tier. Repoint or warm-start the app in the secondary region.
- Alert on replication state and lag, not just “instance down” — a secondary silently falling behind is your RPO quietly degrading.
Security notes
- Encrypt every connection. Managed Instance requires encrypted connections; keep
Encrypt=True(and a sane minimum TLS version) on every connection string, including the listeners. A failover drill is no excuse to weaken TLS. - Keep the listeners VNet-local. The failover-group listeners resolve only on the VNet-local endpoint — do not expose the database via the public endpoint (port 3342) unless you have a hard requirement, and if you do, restrict it tightly. Private is the default for a reason.
- Use Microsoft Entra authentication where you can. Prefer Entra-based logins and managed identity for applications over SQL logins, so credential rotation and conditional access apply — and so failover doesn’t hinge on a hand-synced SQL login password. Where SQL logins are unavoidable, pre-stage them with matching SIDs and manage the secret in Key Vault.
- Protect the link’s transport. The DAG rides certificate-authenticated, AES-encrypted database mirroring over port 5022 — scope the NSG and on-prem firewall to the exact peers, and rotate the certificates that establish trust.
- Encrypt at rest with customer-managed keys where required. TDE is on by default; if compliance demands CMK, ensure the key is available in both regions so a failover doesn’t strand the secondary without its key — a cross-region key gap turns a failover into an outage.
- Least-privilege the failover operators. The identity that runs
set-primaryand manages the failover group should have exactly the SQL/management roles it needs and no more; audit who can trigger a region failover. - Don’t leak topology in errors. Connection failures during a switch (
40613,18456) should be handled by the app’s retry layer, not surfaced verbatim to end users.
The security controls that also strengthen continuity — secure and resilient pull the same direction here:
| Control | Mechanism | Secures against | Also strengthens continuity |
|---|---|---|---|
| Encrypted connections | Encrypt=True, min TLS |
Eavesdropping / downgrade | — |
| VNet-local listeners | No public endpoint exposure | Internet attack surface | Forces correct (private) app placement |
| Entra auth + managed identity | Token-based logins | Static credential theft | Removes hand-synced login passwords from failover |
| Certificate-auth link (5022) | AES-encrypted DAG | Tampered replication stream | Reliable, authenticated hybrid path |
| CMK present in both regions | Cross-region key availability | Unauthorized key use | Prevents a key gap stranding the secondary |
| Least-privilege failover role | Scoped RBAC | Unauthorized region failover | Controlled, auditable continuity actions |
Cost & sizing
The bill drivers and how they interact with the design:
- You pay for two instances, not one. A cross-region failover group means a second full Managed Instance running in the partner region — the dominant line item. A geo-secondary that mirrors a Business Critical primary is a substantial recurring cost.
- Standby secondary cuts licensing, not compute.
--secondary-type Standbymakes the DR replica’s vCores license-free (you stop paying SQL Server licensing on them) as long as it carries no read workload — a meaningful saving on a DR-only secondary. Run a single reporting query against it and you forfeit the benefit. - Tier multiplies the secondary’s cost. A Business Critical secondary (local SSD, 4 replicas, licensing) costs far more than a General Purpose one. If the secondary is purely DR, weigh whether it must match the primary’s tier.
- Zone redundancy is a modest uplift, not a new resource. Enabling it adds cost on the tiers that support it, but it is the cheapest cross-AZ resilience you can buy and is a property toggle, not a second instance.
- Backups have a free allotment, then bill. PITR storage up to your configured retention has an included allotment tied to your provisioned storage; long-term retention (LTR) is billed separately by volume. Cross-region restore and egress add cost. This is cheap insurance against the bad-change failure geo-replication can’t cover.
- The link itself has no direct charge, but you pay for the MI it lands on, the SQL Server side, and any cross-region/hybrid egress over the path.
A rough monthly picture and what each line buys you (figures are indicative; validate with the pricing calculator for your region and SKU):
| Cost driver | What you pay for | Rough INR / month | What it buys | Watch-out |
|---|---|---|---|---|
| Primary MI (GP Gen5, 8 vCore) | Compute + 256 GB storage | ~₹1.5–2.5 L | The production instance | Right-size vCores to real load |
| Primary MI (BC Gen5, 8 vCore) | Compute + local SSD + licensing | ~₹3–4.5 L | Local HA, free read replica, low latency | Licensing dominates BC cost |
| Geo-secondary (Standby, GP) | Compute, license-free | ~₹1–1.5 L | Cross-region DR, no SQL licensing | Reads on it forfeit the benefit |
| Geo-secondary (Geo, BC) | Full BC + licensing | ~₹3–4.5 L | DR + read offload | Most expensive secondary option |
| Zone redundancy | Cross-AZ uplift | +10–30% of instance | Survive a single-zone outage | Only where the tier supports it |
| PITR (within allotment) | Backup storage | Often included | Bad-change recovery | LTR is billed on top |
| Long-term retention (LTR) | Retained backup volume | Per-GB, varies | Months/years of restore points | Grows with retention + size |
| Cross-region egress | Replication/restore traffic | Per-GB, modest | The DR path itself | Write-heavy workloads cost more |
Sizing rules of thumb: the secondary’s vCores/storage must be ≥ the primary’s, so you cannot save by under-provisioning the secondary; you can save by making a DR-only secondary Standby (license-free) and, where the risk allows, a lower tier than the primary. Measure your replication lag in a drill before assuming Geo reads are “free” capacity — and never delete backups to save money on a system whose only defense against a replicated DROP TABLE is PITR.
Interview & exam questions
1. What is the difference between a failover group and the Managed Instance link, and can you use both on one instance? A failover group geo-replicates all user databases MI→MI to a partner instance in another region with a managed listener that survives failover. The Managed Instance link builds a distributed availability group between SQL Server and MI to replicate one database for hybrid migration or DR. They are mutually exclusive on the same instance — you choose the primitive that matches the job (MI↔MI DR versus hybrid SQL-Server replication).
2. Why must the secondary instance be created in the primary’s DNS zone, and can you change it later? A Managed Instance’s host certificate is a SAN cert scoped to its generated DNS zone ID. For a client to reconnect to the secondary using the same listener cert after failover, the secondary must share that DNS zone. It is set at creation time via --dns-zone-partner and cannot be retrofitted — getting it wrong means re-provisioning the secondary.
3. A failover completed and every customer row is correct, but the app is down with login failures. What happened? Replication moves user databases only — logins, SIDs, Agent jobs, linked servers, and msdb objects are not replicated. The database arrived; the service-account login did not (or its SID doesn’t match), so authentication fails (18456) and contained users are orphaned. Pre-stage logins with matching SIDs and recreate jobs on the secondary, gated in the runbook.
4. When do you run a planned failover versus a forced one, and what’s the data-loss implication of each? Planned (set-primary, no flag) fully synchronizes then switches — no data loss, but requires the primary reachable; use it for drills, relocations, and failback. Forced (--allow-data-loss) promotes the secondary immediately without waiting — potential data loss equal to the un-replicated async lag; it’s your only option when the primary region is gone.
5. Your applications connect fine but a region failover would require changing connection strings. What’s wrong and how do you fix it? The apps are connecting to a regional instance FQDN instead of the failover-group listener. Point read-write (OLTP) traffic at <fog>.<zone>.database.windows.net and read-only (reporting) at <fog>.secondary.<zone>.database.windows.net with ApplicationIntent=ReadOnly. The listener CNAME follows the primary role, so no connection-string change is needed on failover.
6. What does --secondary-type Standby do, and what’s the catch? It designates a DR-only, license-free passive replica — you stop paying SQL Server licensing on the secondary’s vCores. The catch: the benefit holds only if you never run read workloads on it. Run a single reporting query through the read-only listener and you forfeit the license-free status (that’s what a Geo secondary is for).
7. You migrated a 2 TB database to MI with the link and need the lowest possible cutover downtime. Why is the link the right tool? The link keeps the MI continuously seeded and synchronized via a distributed availability group while production stays live on SQL Server, so the cutover collapses to: stop the app, confirm the DAG is SYNCHRONIZED, fail over to MI, repoint the app. It’s the only option that delivers a true online cutover to the Business Critical tier, with downtime in minutes rather than a full restore window.
8. A team built an MI on the “Always-up-to-date” update policy and now needs to fail back to on-prem SQL Server over the link. Can they? No. Two-way (failback) replication over the link is not supported on “Always-up-to-date” — it requires the SQL Server 2022 or 2025 update policy, which is a creation-time choice. They can fail over to MI but never back over the link, and the only fix is to re-provision the instance with the matching policy.
9. Why is geo-replication not a substitute for backups? A failover group (and the link) replicate changes — including destructive ones — to the secondary in seconds. A DROP TABLE or a bad deployment is on both replicas almost instantly, so failing over does not recover it. Point-in-time restore / LTR is the only layer that defends against a replicated bad change; keep backups even with a perfect failover group.
10. What are the networking prerequisites for a Managed Instance and a cross-region failover group? A dedicated subnet delegated to Microsoft.Sql/managedInstances (nothing else in it), /27 minimum (plan /26+), with a required NSG and route table attached. For a failover group, the two regional subnets need connectivity — global VNet peering is recommended (private, backbone, no gateway); VPN/ExpressRoute also work but are slower to seed.
11. After a forced failover during an outage, the old primary came back online. What do you do? Leave it. Once the failed region recovers, the old primary automatically rejoins as the new secondary — do not delete it in a panic. When the region is fully healthy, run a planned set-primary (data-lossless) to fail back to the original layout.
12. How do you achieve read scale-out and DR with the same failover group? Use the read-only listener (<fog>.secondary.<zone>.database.windows.net with ApplicationIntent=ReadOnly) to route reporting/analytics to the geo-secondary while OLTP uses the read-write listener. Note this makes the secondary a Geo (read) secondary, so you pay SQL licensing on it — a DR-only secondary that you want license-free must be Standby and unread.
These map to DP-300 (Azure Database Administrator Associate) — plan and implement data platform resources, high availability and disaster recovery for Azure SQL — and the BC/DR design content in AZ-305 (Solutions Architect Expert). The migration angle (link, LRS, DMS) touches DP-203/migration-focused material. A compact cert-mapping for revision:
| Question theme | Primary cert | Exam objective area |
|---|---|---|
| Failover groups, listeners, planned vs forced | DP-300 | Implement HA/DR for Azure SQL |
| MI link, migration cutover, update policy | DP-300 | Migrate; configure MI |
| Layered BC/DR design, RPO/RTO | AZ-305 | Design business continuity solutions |
| Networking (delegated subnet, peering) | AZ-305 / AZ-700 | Design/implement network connectivity |
| Backups vs geo-replication, PITR/LTR | DP-300 | Manage backups and recovery |
| Standby secondary, licensing, cost | DP-300 / AZ-305 | Optimize cost; HA/DR |
Quick check
- You need cross-region DR for a Managed Instance and a hybrid path to replicate one database from on-prem SQL Server. Can a single instance use both a failover group and the link?
- A failover completed, all data is correct, but the app fails with
18456 Login failed. What was not replicated, and how do you prevent it next time? - The primary region is completely down. Which failover command do you run, and what is the data-loss implication?
- Your apps connect to
mi-prod-eastus2.<zone>.database.windows.net. Why is that a problem for failover, and what should they connect to instead? - True or false: because you have a cross-region failover group, you can safely reduce your backup retention to save money.
Answers
- No. A single instance can host either a failover group (MI↔MI geo-replication of all user databases) or the Managed Instance link (a DAG to SQL Server for one database) — they are mutually exclusive on the same instance. Use a separate instance, or choose the primitive that matches the job.
- Server-level state — specifically the login (or its SID) — was not replicated; replication moves user databases only. Pre-stage the login with a matching SID on the secondary (
CREATE LOGIN … SID=0x…) and add a login-existence check as a hard gate in the failover runbook so a failover can’t complete with missing logins. - Run a forced failover:
az sql instance-failover-group set-primary … --allow-data-loss. It promotes the secondary immediately without waiting for in-flight transactions, so you may lose data equal to the un-replicated async lag (~5 s under healthy replication). It’s the only option when the primary is unreachable. - That is a regional instance FQDN, which pins the app to one region and does not move on failover — after a switch it points at a dead/secondary instance. Connect to the failover-group listeners instead: the read-write listener (
<fog>.<zone>.database.windows.net) for OLTP and the read-only listener (<fog>.secondary.<zone>.database.windows.net,ApplicationIntent=ReadOnly) for reporting. They follow the primary role automatically. - False. Geo-replication is not a backup — it replicates a
DROP TABLEor bad deployment to the secondary in seconds, so failing over can’t recover a bad change. PITR/LTR is the only defense against that failure class; keep an appropriate retention policy even with a perfect failover group.
Glossary
- Managed Instance (MI) — a near-100%-compatible SQL Server instance delivered as PaaS, deployed into a delegated subnet in your VNet.
- General Purpose — the tier that separates compute from remote storage (3-copy durable); recovery is a compute remount, with no local readable replica.
- Business Critical — the tier that runs a local 4-node Always On availability group on local SSD with a free read-only replica; local failover in seconds.
- Zone redundancy — an instance property that spreads replicas (BC) or storage (GP, where supported) across availability zones to survive a single-zone outage.
- Auto-failover group (FOG) — a managed construct that geo-replicates all user databases MI→MI to a partner instance in another region, with listeners that survive failover.
- DNS zone — the SAN-cert-scoped zone ID a Managed Instance is created with; the secondary must share it so the listener cert validates after failover (creation-time only).
- Read-write listener — the CNAME (
<fog>.<zone>.database.windows.net) that always resolves to the current primary; the OLTP endpoint. - Read-only listener — the CNAME (
<fog>.secondary.<zone>.database.windows.net) that resolves to the current secondary; the read-offload endpoint, used withApplicationIntent=ReadOnly. - Managed Instance link — a distributed availability group between a SQL Server instance and an MI, replicating one database in near-real-time for hybrid migration or DR.
- Distributed availability group (DAG) — the Always On construct spanning two availability groups that the link rides over, carried on the database mirroring port (5022).
- Update policy — the SQL-engine versioning mode of an MI (Always-up-to-date, or SQL Server 2022/2025 policy); two-way link failback requires the SQL Server 2022/2025 policy, set at creation.
- Planned failover —
set-primarywith no flag: fully synchronizes then switches roles (no data loss; requires the primary reachable). - Forced failover —
set-primary --allow-data-loss: promotes the secondary immediately (possible data loss equal to the un-replicated lag); for a real region outage. - RPO (Recovery Point Objective) — the data you can afford to lose; for MI geo-failover, ~5 s of un-replicated async lag under healthy replication.
- RTO (Recovery Time Objective) — the time to recover; the role switch is minutes, but end-to-end recovery (~1 h target) is gated by app and dependency failover.
- Standby secondary —
--secondary-type Standby: a DR-only, license-free passive replica; the license benefit holds only if no read workload runs on it. - Customer-managed (
Manual) failover policy — the recommended posture where you decide when to fail over, rather than Microsoft triggering it automatically. - Point-in-time restore (PITR) — restoring a database to a moment within the configured retention; the only defense against a replicated bad change (geo-replication is not a backup).
Next steps
You can now design, build, route, and drill a Managed Instance business-continuity solution. Build outward:
- Next: Azure SQL Database Advanced Patterns: Hyperscale, Elastic Pools, Ledger, and Always Encrypted with Secure Enclaves — the platform and advanced-features layer beneath these continuity controls.
- Related: Active-Active Multi-Region on Azure: Building for RTO Near Zero — push beyond active-passive to an active-active posture across regions.
- Related: Azure Site Recovery for IaaS: Zone-to-Zone and Region Failover with Recovery Plans — the IaaS-side runbook discipline that mirrors this one for VMs.
- Related: Troubleshooting Azure SQL Database: Connectivity, Timeouts, Throttling & Blocking — when the symptom is connectivity rather than continuity.
- Related: Private Endpoints and Private DNS at Scale: A Hub-and-Spoke Resolution Architecture — the private resolution patterns the VNet-local listeners depend on.
- Related: Eliminating Secrets in Azure: Key Vault, Managed Identity, and Automated Rotation — manage the SQL-login secrets you pre-stage so a failover never hinges on a hand-synced password.