Patching Failover Clusters with Cluster-Aware Updating and Stretch Clusters via Storage Replica

Patching a failover cluster by hand is how you turn a routine Tuesday into a quorum-loss incident. Cluster-Aware Updating (CAU) exists precisely so a human never has to remember to drain a node, suspend it, patch it, reboot, resume, and only then move to the next one. Get CAU right and a 16-node cluster patches itself overnight with every clustered role staying online. The second half of this article extends that same cluster across two datacenters: a Storage Replica stretch cluster that keeps a synchronous, block-level copy of your volumes in a second site, so a metro failure costs you a failover, not a restore. Both features ship in Windows Server Datacenter; the commands below target Windows Server 2022 and apply cleanly to 2019 and 2025.

1. CAU updating modes: self-updating versus remote-updating

CAU runs as an Update Coordinator that walks the cluster one node at a time. There are two ways to run it, and the choice shapes your whole patch pipeline.

Remote-updating mode. The coordinator runs outside the cluster - on an admin workstation or a management server - and drives an Update Run on demand. Nothing is installed in the cluster itself. This is the mode for ad-hoc, attended patching and for testing run profiles before you commit to automation.
Self-updating mode. You add a clustered CAU role (Microsoft.ClusterAwareUpdating) to the cluster. The role schedules itself and, when the window opens, one node temporarily hosts the coordinator, patches every other node, then hands the coordinator role to an already-patched node so the last one can be updated too. Fully unattended, no external box required.

The mechanism is identical in both modes - the difference is who owns the schedule. My default is self-updating for steady-state monthly patching, with a remote-updating run kept in the toolbox for emergencies and for dry runs.

Install the tooling on every node and on the management host:

$nodes = 'clu-n1','clu-n2','clu-n3','clu-n4'

Invoke-Command -ComputerName $nodes -ScriptBlock {
    Install-WindowsFeature -Name RSAT-Clustering-PowerShell, FS-FileServer
}

# CAU UI + cmdlets on the management workstation:
Install-WindowsFeature -Name RSAT-Clustering-AutomationServer, RSAT-Clustering-PowerShell

Confirm the cluster is actually ready to be updated by CAU. This single command is the best pre-flight check there is - it verifies WinRM, the firewall rules, the required PowerShell remoting, and that a self-updating role is installable:

Test-CauSetup -ClusterName clu-prod

Resolve every warning before going further. The most common failures are a missing Failover Clustering firewall group and the absence of the RestartCluster inbound rule, both of which CAU needs to reboot nodes remotely.

2. Configuring the CAU role, run profiles, and pre/post scripts

For self-updating, register the clustered role and bake the schedule, plugin, and options into it. The plugin Microsoft.WindowsUpdatePlugin pulls from whatever the node points at - WSUS, Windows Update for Business, or Microsoft Update.

Add-CauClusterRole -ClusterName clu-prod `
    -DaysOfWeek Sunday `
    -WeeksOfMonth @(2,4) `
    -IntervalWeeks 1 `
    -MaxFailedNodes 1 `
    -MaxRetriesPerNode 2 `
    -RequireAllNodesOnline `
    -CauPluginName Microsoft.WindowsUpdatePlugin `
    -CauPluginArguments @{ 'IncludeRecommendedUpdates' = 'False' } `
    -PreUpdateScript  '\\fs01\cau$\pre-update.ps1' `
    -PostUpdateScript '\\fs01\cau$\post-update.ps1' `
    -StartDate '2026-06-14 02:00:00' `
    -Force

The knobs that matter:

Parameter	Why it matters
`MaxFailedNodes`	Hard stop. If more than N nodes fail to update, CAU aborts the run rather than chewing through the whole cluster. Keep it low - 1 for small clusters.
`MaxRetriesPerNode`	Transient WU failures are common; 2-3 retries absorb them without manual intervention.
`RequireAllNodesOnline`	Refuses to start if a node is already down. Prevents patching into a degraded quorum.
`PreUpdateScript` / `PostUpdateScript`	Cluster-wide hooks, run once per Update Run (not per node). Quiesce backups, page on-call, or validate health here.

A pre-update script worth its salt blocks the run if the cluster is unhealthy. A non-zero exit aborts the whole run:

# pre-update.ps1 - abort the Update Run if anything is already unhealthy
$bad = Get-ClusterNode | Where-Object State -ne 'Up'
if ($bad) {
    Write-Error "Nodes not Up before patching: $($bad.Name -join ', ')"
    exit 1
}
# Storage Replica must be healthy before we start rebooting nodes
$lag = Get-SRGroup | Get-SRPartnership |
       Where-Object ReplicationStatus -ne 'ContinuouslyReplicating'
if ($lag) { Write-Error 'Storage Replica not in sync; aborting.'; exit 1 }
exit 0

Register the scripts in source control and deploy them to a share the cluster’s machine accounts can read. CAU runs them as SYSTEM on the coordinator, so permission them to the cluster name object (CNO), not a user.

3. Draining roles, fault domains, and update ordering

The reason CAU is safe is the drain. Before patching a node, CAU puts it into maintenance mode and live-migrates or fails over every clustered role off it - exactly what Suspend-ClusterNode -Drain does interactively. Nothing on that node is serving traffic when the reboot hits.

CAU’s ordering is availability-aware: by default it updates the node hosting the fewest roles first and, critically, respects the cluster’s quorum so it never suspends a node that would break the majority. On a Storage Spaces Direct or stretch cluster, CAU also waits for storage to finish resyncing before moving on - it will not pull a second node while volumes are still rebuilding from the first.

You can pin the order. On a stretch cluster you almost always want to patch one site fully, confirm it is healthy, then patch the other - never interleave sites, or a mid-patch metro link blip could leave you with one rebooting node per site and no quorum:

# Patch Site A nodes first, then Site B, never interleaved
Invoke-CauRun -ClusterName clu-prod `
    -CauPluginName Microsoft.WindowsUpdatePlugin `
    -NodeOrder clu-n1,clu-n2,clu-n3,clu-n4 `
    -MaxFailedNodes 1 -MaxRetriesPerNode 2 `
    -RequireAllNodesOnline -Force -Wait

Verify a drain by hand before trusting automation:

Suspend-ClusterNode -Name clu-n1 -Drain -Wait
Get-ClusterGroup | Where-Object OwnerNode -eq 'clu-n1'   # should be empty
Resume-ClusterNode -Name clu-n1 -Failback Immediate

CAU never reboots two nodes at once. The whole design is serial, drain-first, quorum-respecting. If you ever see two nodes down during a CAU run, something is wrong - stop and investigate; do not assume CAU did it.

4. Storage Replica modes: synchronous, asynchronous, and log sizing

Storage Replica (SR) does block-level, volume-to-volume replication beneath the filesystem. Two modes, and the distinction is the whole disaster-recovery conversation:

Synchronous. The write is not acknowledged to the application until it is committed to the SR log on both the source and destination. Zero data loss - RPO = 0. The price is latency: every write pays the round-trip to the second site, so you need roughly <= 5 ms round-trip and fat, low-jitter links. This is the mode for a metro stretch cluster.
Asynchronous. The write acks locally and ships to the destination shortly after. Survives long-distance, high-latency links, but your RPO is non-zero - you can lose the in-flight writes. Use it across regions, not across a metro.

The log volume is the heart of SR and the part people under-provision. Every write lands in the log first, then drains to the data volume. If the log fills - because the destination fell behind or the link saturated - replication stalls. Microsoft’s guidance is at least 8 GB of log per data volume, sized up for write-heavy workloads. Put the log on the fastest storage you have (NVMe/SSD); log latency is in the critical path of every synchronous write.

Two iron rules SR enforces:

Log and data volumes must be GPT, not MBR.
The destination data volume must be at least as large as the source, and the destination volume is dismounted and inaccessible while it is a replication target - you cannot read it on the passive side. This surprises people: the second site’s copy is not a live, browsable share.

# Validate a server pair can do Storage Replica before committing
Test-SRTopology -SourceComputerName clu-n1 -SourceVolumeName D: -SourceLogVolumeName E: `
    -DestinationComputerName clu-n3 -DestinationVolumeName D: -DestinationLogVolumeName E: `
    -DurationInMinutes 30 -ResultPath \\fs01\sr$\report

Test-SRTopology runs a real workload sample and emits an HTML report with measured latency and a recommended log size. Run it during a busy period - sizing on idle numbers is how you fill a log in production.

5. Building a two-site stretch cluster with replicated volumes

A stretch cluster is a single failover cluster whose nodes live in two sites, with SR keeping each site’s storage in sync. The shape: two nodes in Site A, two in Site B, asymmetric storage (each site has its own disks - this is not shared storage), and SR replicating Site A’s volumes to Site B.

First, install the feature everywhere and tell the cluster which node lives where using fault domains and sites. Site awareness is what makes the cluster keep roles local, fail over within a site before crossing the WAN, and order placement sensibly:

$all = 'clu-n1','clu-n2','clu-n3','clu-n4'
Invoke-Command -ComputerName $all -ScriptBlock {
    Install-WindowsFeature -Name Storage-Replica, FS-FileServer -IncludeManagementTools -Restart
}

# Define sites and assign nodes (fault-domain awareness)
New-ClusterFaultDomain -Name SiteA -Type Site -Location 'DC-East'
New-ClusterFaultDomain -Name SiteB -Type Site -Location 'DC-West'
Set-ClusterFaultDomain -Name clu-n1 -Parent SiteA
Set-ClusterFaultDomain -Name clu-n2 -Parent SiteA
Set-ClusterFaultDomain -Name clu-n3 -Parent SiteB
Set-ClusterFaultDomain -Name clu-n4 -Parent SiteB

# Prefer keeping groups in their home site; cross-site failover only when needed
(Get-Cluster).PreferredSite = 'SiteA'

Now establish the replication partnership. Add the disks to the cluster first, then create the SR partnership between the Site A data volume and its Site B counterpart, each with its own log:

New-SRPartnership `
    -SourceComputerName clu-n1 -SourceRGName rg-SiteA `
    -SourceVolumeName 'C:\ClusterStorage\DataA' -SourceLogVolumeName 'L:' `
    -DestinationComputerName clu-n3 -DestinationRGName rg-SiteB `
    -DestinationVolumeName 'C:\ClusterStorage\DataB' -DestinationLogVolumeName 'L:' `
    -ReplicationMode Synchronous `
    -LogSizeInBytes 8GB

ReplicationMode Synchronous is the line that buys you RPO = 0. The cluster now sees the replicated volume as a single resource that can move between sites; whichever site owns it is the source, and SR flips replication direction automatically on failover.

6. Quorum: file-share and cloud witness for split sites

A two-site cluster with an even number of nodes (2+2) has a fatal quorum problem: if the inter-site link dies, each site holds exactly half the votes and both sites lose quorum - a split brain that takes the whole cluster down. The witness is what breaks the tie, and where you put it is the entire decision.

The witness must live in a third, independent failure domain - never in Site A or Site B. If it sits in Site A and Site A is the site that fails, you lose the witness and two nodes at once: game over.

Cloud witness is the right answer for stretch clusters. It is a tiny blob in an Azure Storage account, reachable from both sites over HTTPS, neutral to either datacenter failing. My default.
File-share witness works when there is no internet egress - but only if you have a genuine third site to host it. A share inside one of the two datacenters defeats the purpose.

# Cloud witness - the third, neutral vote for a stretch cluster
Set-ClusterQuorum -CloudWitness `
    -AccountName 'stcluwitness' `
    -AccessKey  '<storage-account-key>' `
    -Endpoint   'core.windows.net'

# Confirm every node and the witness each hold one vote
Get-ClusterQuorum | Format-List *
Get-ClusterNode | Select-Object Name, DynamicWeight, NodeWeight

Dynamic quorum and dynamic witness are on by default and they are your friends here: the cluster automatically adjusts vote weights so the surviving site retains the majority during a sequential failure, and toggles the witness vote to keep the total odd. You do not manage these by hand, but you must read them correctly in logs - a node’s effective weight changes over time.

Dynamic quorum protects against sequential failures. A simultaneous, instantaneous loss of two same-site nodes before the cluster recomputes weights can still drop quorum. The cloud witness is what carries you through that window - which is exactly why it must not share a failure domain with either site.

Verify

Prove the build before you trust it. Run these after CAU configuration and after the stretch partnership is established.

CAU readiness and run history:

Test-CauSetup -ClusterName clu-prod                 # all checks pass
Get-CauClusterRole -ClusterName clu-prod            # role present, schedule correct
Get-CauReport -ClusterName clu-prod -Last -Detailed # last Update Run, per-node result

Replication health and lag:

Get-SRGroup | Select-Object Name, ReplicationStatus, NumOfBytesRemaining
# ContinuouslyReplicating with NumOfBytesRemaining at/near 0 = healthy synchronous
Get-SRPartnership

Site failover and controlled failback. This is the test that matters - simulate losing Site A and confirm roles and storage come up in Site B, then fail back cleanly:

# Move the replicated role to Site B and let SR reverse direction
Move-ClusterGroup -Name 'FS-Stretch' -Node clu-n3
Get-ClusterGroup 'FS-Stretch' | Select-Object Name, OwnerNode, State

# Confirm SR flipped: clu-n3 is now source, clu-n1 is destination
Get-SRPartnership | Select-Object SourceComputerName, DestinationComputerName

# Controlled failback to Site A once it is healthy again
Move-ClusterGroup -Name 'FS-Stretch' -Node clu-n1

Watch the SR event channel during the failover - Microsoft-Windows-StorageReplica/Admin (events 1215/5009/5015 mark direction changes and resync). A clean failover shows the destination becoming source within seconds and replication resuming in the new direction without a full resync.

Enterprise scenario

A payments platform ran a 4-node stretch cluster across two metro datacenters 38 km apart, hosting the SQL FCI behind their settlement service. RPO was contractually zero, so SR ran synchronous. Patching had been a quarterly all-hands maintenance window until they moved to CAU self-updating - and the first automated run aborted halfway through, on a Sunday at 02:30, with one node already patched and the run refusing to continue.

The cause was their own pre-update gate, working exactly as designed. Their pre-update.ps1 checked Get-SRGroup for ContinuouslyReplicating before each Update Run, and the dark-fibre provider had a 6 ms latency spike that night from a re-route. At 6 ms, synchronous writes backed up, the SR log on the source crept toward full, and ReplicationStatus flipped out of ContinuouslyReplicating. The gate did its job and aborted rather than reboot a node while replication was unhealthy - which would have stalled writes to the settlement DB.

The fix was not to weaken the gate; it was to make the gate tolerant of transient lag while still blocking on real divergence. They changed the check to allow a bounded NumOfBytesRemaining and a short re-check window, and split CAU into a per-site NodeOrder so Site A fully completed and resynced before Site B was touched:

# pre-update.ps1 - tolerate transient lag, block on genuine divergence
$g = Get-SRGroup | Get-SRPartnership
$diverged = $g | Where-Object {
    $_.ReplicationStatus -ne 'ContinuouslyReplicating' -and
    $_.NumOfBytesRemaining -gt 64MB    # bounded catch-up is acceptable
}
if ($diverged) { Write-Error 'SR genuinely behind; aborting run.'; exit 1 }
Start-Sleep -Seconds 30   # ride out a brief link blip, then re-check
$g = Get-SRGroup | Get-SRPartnership
if ($g.ReplicationStatus -contains 'Error') { exit 1 }
exit 0

They also oversized the SR log from 8 GB to 32 GB on NVMe so a latency spike no longer threatened to fill it. The next month’s run completed unattended in 41 minutes, every settlement transaction served throughout. The lesson: synchronous SR makes link quality a first-class patching dependency, and your CAU pre-script is the right place to encode that - tolerant of jitter, intolerant of real data loss.

Patching Failover Clusters with Cluster-Aware Updating and Stretch Clusters via Storage Replica

1. CAU updating modes: self-updating versus remote-updating

2. Configuring the CAU role, run profiles, and pre/post scripts

3. Draining roles, fault domains, and update ordering

4. Storage Replica modes: synchronous, asynchronous, and log sizing

5. Building a two-site stretch cluster with replicated volumes

6. Quorum: file-share and cloud witness for split sites

Verify

Enterprise scenario

Operations checklist

Written by Vinod

Comments

Keep Reading

Building a Two-Tier AD CS PKI: Offline Root and Enterprise Issuing CA

Diagnosing AD Replication and FSMO Failures with repadmin and dcdiag

Authoring AppArmor Profiles: Confining Services on Ubuntu and Debian