Windows Failover Clustering and Storage Spaces Direct: A Production Build

Hyper-converged infrastructure on Windows Server collapses compute and storage onto the same nodes using Storage Spaces Direct (S2D), with Failover Clustering providing the high-availability fabric. Done right, you get a software-defined storage pool that survives a node loss without a SAN in sight. Done wrong, you build a cluster that loses quorum the first time a node reboots for patching. This walkthrough is the production build I run: quorum first, validate before you commit, enable S2D, then prove it survives failure.

The reference is a four-node cluster, each node with dual 25 GbE RDMA NICs and a mix of NVMe cache and SSD/HDD capacity drives. Commands target Windows Server 2022 (S2D requires Datacenter edition); the principles apply to 2019 and 2025 too.

1. Cluster concepts: quorum, votes, and witness types

Before a single command, internalize the quorum model, because it is the single most common source of self-inflicted outages.

A failover cluster stays online only while it holds quorum - a majority of votes. Each node gets one vote. A witness (disk, file share, or cloud) gets one vote, used as a tiebreaker. The cluster keeps running as long as more than half the votes are reachable.

Nodes	Votes without witness	Survives	Recommendation
2	2	0 node failures (50% is not a majority)	Witness is mandatory
3	3	1 node failure	Witness recommended
4	4	1 node failure (2 lost = 50%)	Witness mandatory for the 2-node-down edge
5	5	2 node failures	Witness recommended

The rule of thumb: always configure a witness. It costs nothing and turns even-node clusters from fragile to resilient.

Dynamic quorum and dynamic witness

Modern clusters do not use a static vote count. Dynamic quorum recalculates the working majority as nodes leave gracefully, so a cluster can survive down to a last node standing if nodes drop sequentially. Dynamic witness toggles the witness vote on or off to keep the total odd, which is why a 4-node cluster gives the witness a vote and an odd cluster may not. Both are on by default and not managed by hand, but you must understand them when reading logs - the vote a node holds changes over time.

Dynamic quorum protects against sequential failures, not simultaneous ones. If two of four nodes die at the exact same instant before the cluster can recompute votes, you still lose quorum. Witness placement is what saves you in that window.

Witness type selection

Cloud witness - a tiny blob in an Azure Storage account. Lowest operational burden; my default, ideal for stretched or branch clusters.
File share witness (SMB) - a share on a separate, independent server (never a cluster node). Use when there is no internet egress.
Disk witness - a shared LUN. Not applicable to S2D hyper-converged clusters; there is no shared disk, so do not use it here.

2. Validate hardware and networks with Test-Cluster before you commit

Never create an S2D cluster without a clean validation report. Microsoft support will ask for it, and Test-Cluster catches firmware mismatches, missing RDMA, and bad drives before they become a 3 a.m. incident.

Install the features on every node first:

$nodes = 'hci-n1','hci-n2','hci-n3','hci-n4'

Invoke-Command -ComputerName $nodes -ScriptBlock {
    Install-WindowsFeature -Name Failover-Clustering, FS-FileServer `
        -IncludeManagementTools
}

# Reboot if any install requested it, then proceed.

Run the full validation, including the storage-specific suite that exercises S2D readiness:

Test-Cluster -Node $nodes `
    -Include 'Storage Spaces Direct','Inventory','Network','System Configuration' `
    -ReportName 'C:\ClusterReports\precheck'

Open the generated HTML report. Treat every Failed as a hard stop and review every Warning. The checks that matter most for S2D: identical drive firmware across nodes, all eligible drives reported as unclaimed (no leftover partitions), consistent NIC drivers, and a passing network communication test. If a drive shows a stale partition table from a prior build, clear it before clustering (this is destructive - triple-check the host list):

Invoke-Command -ComputerName $nodes -ScriptBlock {
    Get-Disk | Where-Object IsSystem -eq $false |
        Clear-Disk -RemoveData -RemoveOEM -Confirm:$false -ErrorAction SilentlyContinue
}

3. Create the cluster and configure the witness

Create the cluster without storage so S2D can claim the drives in a controlled step. Supply a static management IP and skip adding eligible storage:

New-Cluster -Name 'hci-clus01' `
    -Node $nodes `
    -StaticAddress '10.20.0.50' `
    -NoStorage

After creation, set the cluster name object and configure the witness. For a cloud witness you need an Azure Storage account name and one of its access keys:

Set-ClusterQuorum -Cluster 'hci-clus01' `
    -CloudWitness `
    -AccountName 'sthciwitness01' `
    -AccessKey '<storage-account-key>' `
    -Endpoint 'core.windows.net'

For an air-gapped environment, use a file share witness on an independent server instead:

Set-ClusterQuorum -Cluster 'hci-clus01' `
    -FileShareWitness '\\fsw-srv01\hci-clus01-witness'

Confirm the quorum model and that dynamic quorum is engaged:

Get-ClusterQuorum -Cluster 'hci-clus01' | Format-List *
(Get-Cluster).DynamicQuorum   # expect 1
Get-ClusterNode | Select-Object Name, State, DynamicWeight, NodeWeight

DynamicWeight is the vote the node currently holds under dynamic quorum; NodeWeight is whether it is eligible to vote at all. Both should read 1 on healthy nodes.

4. Enable Storage Spaces Direct: pools, volumes, and resiliency

With a clean cluster, enable S2D. This single cmdlet claims eligible drives, builds the storage pool, and with mixed media auto-binds the fastest device as cache - NVMe in front of SSD/HDD capacity drives:

Enable-ClusterStorageSpacesDirect -CimSession 'hci-clus01' -Confirm:$false

Verify the pool and cache:

Get-StoragePool -IsPrimordial $false |
    Select-Object FriendlyName, Size, AllocatedSize, HealthStatus

Get-PhysicalDisk |
    Select-Object FriendlyName, MediaType, Usage, Size, HealthStatus |
    Sort-Object Usage

Drives bound to cache show Usage = Journal; capacity drives show Usage = Auto-Select.

Choosing a resiliency type

Resiliency is set per volume - the decision that defines your fault tolerance and usable capacity.

Resiliency	Min nodes	Survives	Capacity efficiency	Use for
Two-way mirror	2	1 drive or node	50%	Small clusters, latency-sensitive
Three-way mirror	3	2 drives or nodes (non-concurrent)	33%	Most production workloads
Dual parity	4	2 drives or nodes	up to ~80% (scales with nodes)	Archival, large cold data
Mirror-accelerated parity	4	2 drives or nodes	mixed	Mix of hot writes + cold capacity

On a four-node cluster, three-way mirror is the safe default for active workloads. Create volumes as Cluster Shared Volumes (CSV) so every node can access them concurrently - essential for live migration:

# Three-way mirror volume for VMs
New-Volume -FriendlyName 'vm-prod-01' `
    -FileSystem CSVFS_ReFS `
    -StoragePoolFriendlyName 'S2D on hci-clus01' `
    -ResiliencySettingName 'Mirror' `
    -PhysicalDiskRedundancy 2 `
    -Size 4TB

# Mirror-accelerated parity volume for bulk/archive
New-Volume -FriendlyName 'archive-01' `
    -FileSystem CSVFS_ReFS `
    -StoragePoolFriendlyName 'S2D on hci-clus01' `
    -StorageTierFriendlyNames 'Performance','Capacity' `
    -StorageTierSizes 1TB, 9TB

PhysicalDiskRedundancy 2 is what makes a mirror three-way (two copies can be lost). Always use ReFS for S2D - it provides the integrity streams and block cloning S2D depends on.

5. Cluster networking: RDMA, SMB Multichannel, and live migration

S2D moves storage traffic between nodes over SMB. This east-west traffic is the performance-critical path, so it must run over RDMA (RoCEv2 or iWARP) for low latency and CPU offload. Most builds use a converged design: the same dual NICs carry both storage and tenant traffic behind a SET (Switch Embedded Teaming) switch.

Create the SET switch on every node:

Invoke-Command -ComputerName $nodes -ScriptBlock {
    New-VMSwitch -Name 'ConvergedSwitch' `
        -NetAdapterName 'SMB1','SMB2' `
        -EnableEmbeddedTeaming $true `
        -AllowManagementOS $false
}

Add host vNICs for storage and map them to physical adapters so each RDMA path stays on a separate NIC, then enable RDMA on the vNICs:

Invoke-Command -ComputerName $nodes -ScriptBlock {
    Add-VMNetworkAdapter -ManagementOS -Name 'SMB_1' -SwitchName 'ConvergedSwitch'
    Add-VMNetworkAdapter -ManagementOS -Name 'SMB_2' -SwitchName 'ConvergedSwitch'

    Set-VMNetworkAdapterTeamMapping -ManagementOS `
        -VMNetworkAdapterName 'SMB_1' -PhysicalNetAdapterName 'SMB1'
    Set-VMNetworkAdapterTeamMapping -ManagementOS `
        -VMNetworkAdapterName 'SMB_2' -PhysicalNetAdapterName 'SMB2'

    Enable-NetAdapterRdma -Name 'vEthernet (SMB_1)','vEthernet (SMB_2)'
}

Confirm RDMA is actually active end to end - this is where most “S2D is slow” tickets are resolved:

Get-NetAdapterRdma | Where-Object Enabled
Get-SmbClientNetworkInterface | Where-Object RdmaCapable
# During load, watch RDMA activity:
Get-SmbMultichannelConnection -SmbInstance SBL

RoCEv2 requires lossless Ethernet: Priority Flow Control (PFC) and DCB configured consistently on both the NICs and the physical switches. If DCB is not configured end to end, RoCE will fall back to lossy behavior and you will see retransmits and storage latency spikes. iWARP needs no switch DCB config, which is why many teams prefer it for simplicity.

Finally, tune live migration to use SMB (which rides RDMA) and set sensible concurrency:

Invoke-Command -ComputerName $nodes -ScriptBlock {
    Set-VMHost -VirtualMachineMigrationPerformanceOption SMB `
        -MaximumVirtualMachineMigrations 2 `
        -MaximumStorageMigrations 2
}

6. Deploy clustered roles and tune failover policy

With storage and networking in place, deploy workloads. For Hyper-V, simply place VM files on a CSV path and create the VM as highly available:

New-VM -Name 'app-vm-01' `
    -MemoryStartupBytes 16GB `
    -Generation 2 `
    -Path 'C:\ClusterStorage\vm-prod-01' `
    -NewVHDPath 'C:\ClusterStorage\vm-prod-01\app-vm-01\disk0.vhdx' `
    -NewVHDSizeBytes 128GB

Add-ClusterVirtualMachineRole -VirtualMachine 'app-vm-01' -Cluster 'hci-clus01'

For a clustered file server serving application data (SQL, Hyper-V over SMB), use Scale-Out File Server (SOFS) for active-active access:

Add-ClusterScaleOutFileServerRole -Name 'sofs-app01' -Cluster 'hci-clus01'
New-SmbShare -Name 'sqldata' -Path 'C:\ClusterStorage\vm-prod-01\Shares\sqldata' `
    -FullAccess 'CONTOSO\SQLEngineAccounts'

Tune failover behavior on the role. The key knobs are how many failures are tolerated in a window before the cluster gives up, and whether a role fails back automatically:

$grp = Get-ClusterGroup -Name 'app-vm-01'
$grp.FailoverThreshold = 2      # max failovers ...
$grp.FailoverPeriod    = 6      # ... within this many hours
$grp.AutoFailbackType  = 0      # 0 = no failback (prevents ping-pong)

Set node preference where workloads should land, but leave failback disabled in most cases - it can move a VM back to a node that is healthy but still warming caches.

7. Patching with Cluster-Aware Updating and maintenance mode

Patching is where the quorum design earns its keep. Cluster-Aware Updating (CAU) orchestrates a rolling update: drain one node, patch, reboot, bring it back, move to the next - so the cluster stays online throughout. Enable self-updating mode to patch on a schedule:

Add-CauClusterRole -ClusterName 'hci-clus01' `
    -DaysOfWeek Sunday `
    -WeeksOfMonth 2 `
    -IntervalWeeks 4 `
    -MaxFailedNodes 1 `
    -MaxRetriesPerNode 2 `
    -EnableFirewallRules `
    -Force

To patch on demand:

Invoke-CauRun -ClusterName 'hci-clus01' `
    -CauPluginName 'Microsoft.WindowsUpdatePlugin' `
    -MaxFailedNodes 1 -MaxRetriesPerNode 2 -Force

For manual maintenance (firmware, hardware), drain roles and pause S2D on the node first. Skipping the storage pause is a classic mistake that triggers an unnecessary resync storm:

# Drain roles off the node
Suspend-ClusterNode -Name 'hci-n2' -Drain

# Then pause the S2D storage on that node
Get-StorageFaultDomain -Type StorageScaleUnit |
    Where-Object FriendlyName -eq 'hci-n2' |
    Enable-StorageMaintenanceMode

# ... do the work, reboot if needed ...

Get-StorageFaultDomain -Type StorageScaleUnit |
    Where-Object FriendlyName -eq 'hci-n2' |
    Disable-StorageMaintenanceMode

Resume-ClusterNode -Name 'hci-n2' -Failback Immediate

Always confirm storage is fully healthy before touching the next node.

Verify

Walk this list before declaring the cluster production-ready:

# Cluster and node health
Get-Cluster | Format-List Name, DynamicQuorum, QuorumType
Get-ClusterNode | Select-Object Name, State, StatusInformation

# S2D health - HealthStatus must be Healthy, OperationalStatus OK
Get-StoragePool -IsPrimordial $false | Select-Object FriendlyName, HealthStatus, OperationalStatus
Get-VirtualDisk | Select-Object FriendlyName, ResiliencySettingName, HealthStatus, OperationalStatus
Get-PhysicalDisk | Where-Object HealthStatus -ne 'Healthy'   # should return nothing

# No active repair/resync jobs (run after any maintenance)
Get-StorageJob

# Witness has a vote and quorum is dynamic
Get-ClusterQuorum | Format-List *

# RDMA is live
Get-SmbClientNetworkInterface | Where-Object RdmaCapable -eq $true

A healthy cluster reports every node Up, every virtual disk Healthy/OK, zero unhealthy physical disks, and no lingering Get-StorageJob entries.

Failure-injection testing and reading cluster logs

Never trust an HA cluster you have not broken on purpose. Run these tests in a maintenance window, before the cluster carries production load.

Test 1 - Hard node loss. Power off a node abruptly (pull power or hard-stop the VM) and watch roles fail over:

Get-ClusterGroup | Select-Object Name, OwnerNode, State
Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus

Expect VMs to restart on surviving nodes within seconds, virtual disks to go to Degraded (not Failed), and quorum to hold. When the node returns, S2D auto-resyncs; track it with Get-StorageJob.

Test 2 - Network partition. Disable a storage NIC and confirm SMB Multichannel reroutes over the surviving path with no role movement.

Test 3 - Witness loss. With one node already down, take the witness offline and confirm the cluster stays up (dynamic quorum should have already adjusted weights).

When something does not behave, the authoritative source is the cluster log. Generate a merged, time-zone-corrected log across all nodes:

Get-ClusterLog -Destination 'C:\ClusterReports' -TimeSpan 60 -UseLocalTime

Grep the resulting .log files for the node that dropped and look for quorum, vote, and resource state transitions. Cross-reference with the System and FailoverClustering operational event logs:

Get-WinEvent -LogName 'Microsoft-Windows-FailoverClustering/Operational' -MaxEvents 100 |
    Where-Object { $_.LevelDisplayName -in 'Error','Warning' } |
    Format-Table TimeCreated, Id, Message -AutoSize

The most useful event IDs to learn: 1135 (node removed from active membership), 1177 (cluster lost quorum - the one you never want to see), and 1069/1146 (resource/role failures).

Enterprise scenario

A four-node S2D cluster running ~120 production VMs started throwing storage latency alerts every few weeks - p99 write latency on the mirror CSVs spiking from sub-millisecond to 40-60 ms for ten to fifteen minutes, then clearing on its own. No node was down, Get-StorageJob showed nothing, and virtual disks read Healthy. The platform team chased the storage layer for two sprints before the actual cause surfaced in the SMB instance metrics.

The cluster ran RoCEv2 over a converged SET switch. PFC and DCB were configured on the hosts, but a firmware update on one top-of-rack switch had silently reset its priority-to-traffic-class mapping. RoCE was falling back to lossy behavior on exactly one path, and during VM backup windows the resulting congestion drops triggered go-back-N retransmits on the SBL (Software Bus Layer) east-west traffic. The “intermittent” pattern was just the backup schedule.

The tell was non-zero RDMA failure counters that only climbed during the spikes:

Get-NetAdapterStatistics | Select-Object Name, RdmaConnectionErrors
Get-SmbMultichannelConnection -SmbInstance SBL | Select-Object Name, ClientRdma, ServerRdma
# On the switch: confirm DCBX and PFC counters
Get-NetQosFlowControl
Get-NetAdapterQos | Where-Object Enabled

The fix was reapplying the DCB/PFC config to the drifted switch and pinning switch firmware in the change baseline. They also added a synthetic check that alerts on any rise in RdmaConnectionErrors and on PFC pause-frame asymmetry between paired ports. Lesson: on RoCE-based S2D, treat the physical switch DCB state as part of the cluster, and monitor RDMA error counters - not just disk health - because the storage fabric can degrade while every disk still reports green.

Production checklist

Pitfalls and next steps

The failures I see most often: no witness on an even-node design, losing quorum during the first patch reboot; running RoCE without end-to-end DCB and chasing phantom “slow storage” tickets that are really packet drops; skipping Enable-StorageMaintenanceMode before reboots and triggering needless resyncs; and using two-way mirror in production, which cannot survive a drive failure during a node reboot. Use three-way mirror unless you have a deliberate reason not to.

From here, layer in proactive operations: monitor S2D health faults and storage jobs, schedule periodic failure-injection drills (an untested HA design is a hope, not a guarantee), and benchmark IOPS headroom with VMFleet before filling the cluster. If you outgrow a single cluster, the next step is a stretch cluster with synchronous Storage Replica across two sites - but get the single-site fundamentals bulletproof first.