Servers Administration

Building Resilient Linux Storage with mdadm Software RAID

Hardware RAID controllers fail in their own quiet ways: a proprietary on-disk format you cannot read without an identical card, a battery-backed cache that silently dies, firmware that lies about a rebuild. Linux md (multiple devices) sidesteps all of that. It is the kernel’s software RAID layer, driven from userspace by mdadm, and it stores its metadata in an open, documented format you can assemble on any Linux box. This is a working tour of building production arrays: choosing a level, persisting the config so it survives reboot, tuning rebuild speed, wiring up hot spares, and — the part that actually matters — recovering cleanly when a disk dies at 3 a.m.

Scope: examples target current RHEL-family (RHEL 9/10, Rocky, Alma) and Ubuntu Server 22.04/24.04, with mdadm 4.x and a kernel that uses metadata version 1.2 by default. Device names below (/dev/sd[b-f]) are illustrative; on real hardware prefer stable /dev/disk/by-id/ paths so a renumbered SCSI bus never assembles the wrong member. Run everything as root.

1. Choosing a RAID level: the trade-offs that bite

The level you pick is a function of your failure budget, your rebuild risk, and your write pattern — not a default. Here is the honest comparison.

Level Min disks Usable capacity Fault tolerance Write penalty Use when
RAID 1 2 size of one disk any 1 disk (per mirror) low (2x writes) Boot/root volumes, small datasets, simplest recovery
RAID 5 3 (N-1) disks any 1 disk high (read-modify-write parity) Read-mostly bulk storage where capacity matters
RAID 6 4 (N-2) disks any 2 disks higher (dual parity) Large arrays of large disks; survives a second failure during rebuild
RAID 10 4 N/2 disks 1 disk per mirror pair low Databases and write-heavy workloads needing both speed and redundancy

Two things experienced operators internalize:

Linux md has a useful “level 10” of its own that is not just nested 1+0: it supports near, far, and offset layouts and works with an odd number of disks. The default near layout behaves like classic RAID 10; far trades rebuild simplicity for markedly better sequential read throughput on the same spindles.

2. Creating an array and persisting it in mdadm.conf

Wipe any stale signatures first — leftover filesystem or RAID metadata is the most common cause of a confused assemble.

# Inspect before you touch anything
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT
wipefs -a /dev/sdb /dev/sdc /dev/sdd /dev/sde

Create a RAID 6 across four members. The flags here are the ones that matter in production:

mdadm --create /dev/md0 \
  --level=6 \
  --raid-devices=4 \
  --metadata=1.2 \
  --bitmap=internal \
  --chunk=512 \
  /dev/sdb /dev/sdc /dev/sdd /dev/sde

The array is usable immediately and resyncs in the background. Watch it:

cat /proc/mdstat
mdadm --detail /dev/md0

The initial resync of a fresh array is normal and expected — md is making parity consistent. You can build a filesystem and start using the array while it runs; it will simply be slower until resync completes.

Now persist the configuration. This is the step people skip, and then their array assembles under a random /dev/md127 after reboot. The fix is to write the array definition into mdadm.conf and regenerate the initramfs.

# RHEL family
mdadm --detail --scan >> /etc/mdadm.conf

# Debian/Ubuntu use a subdirectory
mdadm --detail --scan >> /etc/mdadm/mdadm.conf

That appends a line like:

ARRAY /dev/md0 metadata=1.2 name=host01:0 UUID=3a9f...:e1c2...:...

The UUID is what assembly keys on, which is exactly why by-id device naming does not matter at assemble time — md finds members by superblock UUID, not by /dev path. Set HOMEHOST so arrays created elsewhere are not silently adopted:

HOMEHOST <system>
MAILADDR ops-storage@example.com

Then rebuild the initramfs so the boot environment knows about the array (details in section 8):

# RHEL family
dracut --force

# Debian/Ubuntu
update-initramfs -u

Finally, lay down a filesystem and a stable mount:

mkfs.xfs -L data /dev/md0
mkdir -p /srv/data
echo 'LABEL=data /srv/data xfs defaults,noatime 0 0' >> /etc/fstab
mount -a

3. Write-intent bitmaps and rebuild speed tuning

A write-intent bitmap records which regions of the array have outstanding writes. Without it, an unclean shutdown forces md to resync the entire array on next boot, because it has no idea which stripes might be inconsistent. With a bitmap, md resyncs only the dirty regions — turning a multi-hour full resync into seconds. The cost is a small write amplification to update bitmap bits; on spinning disks it is negligible, and the recovery-time win is enormous.

If you forgot it at create time, add it live:

mdadm --grow /dev/md0 --bitmap=internal

Tune the granularity with --bitmap-chunk (larger chunk = fewer bits = less overhead but coarser resync). The default internal bitmap stores bits alongside the superblock. For very write-heavy arrays you can place the bitmap on a separate fast device, but internal is the correct default for almost everyone.

Rebuild and resync speed is governed by two sysctls that throttle md so a rebuild does not starve foreground I/O:

# Current limits (KiB/s per array member)
sysctl dev.raid.speed_limit_min dev.raid.speed_limit_max

# Raise the floor so an urgent rebuild gets real bandwidth
sysctl -w dev.raid.speed_limit_min=200000

The min value is a guaranteed floor md will try to sustain even under load; max caps it when the system is idle. For an array that is degraded and exposed, raising speed_limit_min shortens the dangerous window at the cost of foreground latency. Make it persistent in /etc/sysctl.d/:

# /etc/sysctl.d/90-md-rebuild.conf
dev.raid.speed_limit_min = 100000
dev.raid.speed_limit_max = 500000

You can also bump the per-device stripe cache for parity arrays to help rebuild and write throughput:

echo 8192 > /sys/block/md0/md/stripe_cache_size

4. Configuring hot spares and spare-group sharing

A hot spare is an idle member md automatically pulls in the instant another member fails, starting the rebuild without human intervention. For any array you cannot babysit, a spare is the difference between “degraded for an hour” and “degraded until someone notices.”

Add a spare to an existing array — md parks it as a spare because the array already has its full complement of active devices:

mdadm --add /dev/md0 /dev/sdf
mdadm --detail /dev/md0 | grep -E 'State|spare'

You will see the new disk listed with state spare. If an active member now fails, md immediately promotes /dev/sdf and begins rebuilding onto it.

A single spare disk does not have to be dedicated to one array. Spare groups let several arrays share a pool: mdadm --monitor will physically move a spare from an array that has a surplus to one that just went degraded and has none. Declare the group in mdadm.conf:

ARRAY /dev/md0 spare-group=pool1 UUID=3a9f...:...
ARRAY /dev/md1 spare-group=pool1 UUID=7b2e...:...

With both arrays in pool1 and the monitor daemon running (section 7), a failure on md1 that has no spare will trigger md to migrate an unused spare from md0. This is how you cover a rack of arrays with fewer spare disks than arrays.

5. Simulating disk failure and replacing a failed member

You should rehearse the replacement procedure before you need it. md lets you fail a disk on purpose.

# Mark a member as failed
mdadm --manage /dev/md0 --fail /dev/sdc
cat /proc/mdstat

/proc/mdstat now shows the array degraded with an (F) next to the failed device. If you configured a hot spare, a rebuild onto it kicks off automatically and you will see a recovery progress line. Remove the failed disk so you can physically pull it:

mdadm --manage /dev/md0 --remove /dev/sdc

When the replacement drive is installed, wipe and add it. If there was no spare, the add directly triggers the rebuild; if a spare already absorbed the failure, the new disk simply becomes the next spare.

wipefs -a /dev/sdc
mdadm --manage /dev/md0 --add /dev/sdc

There is a better primitive when the failing disk has not died yet but is throwing rerror counts: --replace. It rebuilds onto a spare while keeping the suspect disk in the array as a read source, so you never drop to a degraded, parity-only state during the copy. This is strictly safer than fail-then-rebuild whenever the old disk is still readable.

# Copy onto a spare while the old disk still helps satisfy reads
mdadm --manage /dev/md0 --replace /dev/sdc
# Optionally name the target explicitly
mdadm --manage /dev/md0 --replace /dev/sdc --with /dev/sdf

Use --replace for proactive swaps of a SMART-warned drive; use --fail/--remove only when the disk is already dead or actively corrupting reads. The whole point of --replace is to avoid the exposure window.

6. Growing arrays and reshaping RAID levels online

md can grow and reshape arrays while they stay mounted and serving I/O. This is genuinely online — the data is live throughout — but it is also slow and write-intensive, so do it during a maintenance window and make sure you have a current backup first; a reshape rewrites stripe layout across every disk.

Add a disk and grow the array’s device count (here, 4 -> 5 members on a RAID 6):

mdadm --add /dev/md0 /dev/sdf
mdadm --grow /dev/md0 --raid-devices=5

A reshape that adds disks needs scratch space to back up stripes it is rewriting. For risky reshapes, point it at a backup file on a different device so an interruption is recoverable:

mdadm --grow /dev/md0 --raid-devices=5 --backup-file=/root/md0-reshape.bak

Watch the reshape; it reports a distinct reshape progress line in /proc/mdstat. When it finishes, the array is larger at the md layer but the filesystem still thinks it is the old size — grow the filesystem to claim the new space:

# XFS grows online by mount point
xfs_growfs /srv/data

# ext4 equivalent
resize2fs /dev/md0

md can also convert between levels online — for example RAID 5 to RAID 6 by adding a disk and a second parity:

mdadm --grow /dev/md0 --level=6 --raid-devices=5 --backup-file=/root/md0-reshape.bak

Level conversions are powerful but the slowest and most fragile operations md offers. Treat them as a planned migration, not a routine tweak, and never let the --backup-file live on the array you are reshaping.

7. Email and scripted monitoring with mdadm --monitor

An array that silently goes degraded is worse than no RAID, because you are now running on the last copy without knowing it. mdadm --monitor polls arrays and fires events (Fail, FailSpare, DegradedArray, SpareActive, RebuildFinished) to email and/or a script.

On a packaged system the monitor already runs as a service; confirm it and set the destination:

# Ensure MAILADDR is set in mdadm.conf (section 2), then:
systemctl enable --now mdmonitor       # RHEL family
systemctl enable --now mdmonitor.service  # Debian/Ubuntu: mdadm service

Send a synthetic test event to prove the mail path works end to end before you rely on it:

mdadm --monitor --scan --test --oneshot

For richer alerting, hand --monitor a program; md invokes it with the event name, the array, and (when relevant) the device as arguments:

mdadm --monitor --scan --program=/usr/local/sbin/md-alert.sh --daemonise

A minimal handler that forwards to your on-call webhook:

#!/usr/bin/env bash
# /usr/local/sbin/md-alert.sh  (chmod 0755)
event="$1"; array="$2"; device="$3"
case "$event" in
  Fail|DegradedArray|FailSpare)
    curl -fsS -X POST "$ALERT_WEBHOOK" \
      -H 'Content-Type: application/json' \
      -d "{\"text\":\"md ${event} on ${array} ${device:-} ($(hostname -f))\"}"
    ;;
esac

Treat Fail and DegradedArray as paging events. RebuildFinished is informational but worth logging so you can confirm the array returned to clean.

8. Assembling arrays in initramfs and rescue scenarios

The boot story matters most when your root filesystem is on md. The initramfs must contain mdadm, the array’s config, and the assembly logic, or the kernel mounts root before md exists and the boot fails.

The config baked into initramfs comes from mdadm.conf, so the section-2 ordering is load-bearing: write the ARRAY line first, then regenerate.

# RHEL family
dracut --force --add mdraid

# Debian/Ubuntu
update-initramfs -u

If you boot to an emergency shell or a rescue ISO and arrays are not present, assemble them by scanning superblocks — no /dev paths or config needed, because md keys on UUID:

mdadm --assemble --scan
cat /proc/mdstat

To assemble one specific array explicitly:

mdadm --assemble /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde

A degraded array will not always auto-assemble if members are missing, because md is conservatively refusing to start incomplete. When you are certain the present members are good and the missing one is genuinely gone, force it:

mdadm --assemble --scan --run     # start arrays that have enough members
mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd

--force overrides md’s safety check and can update event counts to make a stale member usable. It is the correct tool in a real recovery, but it bypasses the protection that stops you assembling an inconsistent set. Use it deliberately, and never combine --force with a member you suspect is corrupt.

To take an array down cleanly (for maintenance or to move disks to another host):

umount /srv/data
mdadm --stop /dev/md0

Verify

Confirm the array is in the state you expect before declaring victory.

# 1. Array is clean and all members active (not 'degraded' or 'recovering')
mdadm --detail /dev/md0 | grep -E 'State :|Active Devices|Working Devices|Failed|Spare'

# 2. Live status and any rebuild/reshape progress
cat /proc/mdstat

# 3. Per-disk superblock sanity and matching array UUID
mdadm --examine /dev/sd[b-e] | grep -E 'Array UUID|Device Role|State'

# 4. Config persists and matches the running array
mdadm --detail --scan
grep -E 'ARRAY|MAILADDR|HOMEHOST' /etc/mdadm.conf /etc/mdadm/mdadm.conf 2>/dev/null

# 5. Bitmap is present (look for the 'bitmap' line in detail output)
mdadm --detail /dev/md0 | grep -i bitmap

# 6. Monitoring is live and the mail path works
systemctl is-active mdmonitor 2>/dev/null || systemctl is-active mdadm
mdadm --monitor --scan --test --oneshot

A healthy parity array reports State : clean, Failed Devices : 0, and the expected Active Devices count, with a bitmap line present and no recovery/reshape line in /proc/mdstat.

Enterprise scenario

A media-analytics platform team ran a 12-disk RAID 6 of 16 TB SATA drives as a near-line ingest tier. A drive failed on a Friday; the on-call did the obvious thing — --fail, --remove, swap, --add — and the array dropped into degraded state and began a full parity rebuild. At their configured speed_limit_min, reading 11 surviving 16 TB disks end-to-end was projected at well over a day. Midway through, a second disk threw a cluster of UREs. RAID 6 survived (it tolerates two failures), but it was a near-miss that would have been fatal on RAID 5, and it exposed two process gaps: rebuilds were too slow, and they were dropping to fully-degraded for swaps that did not need to.

The fix was procedural and configurational, not a rebuild of the array. They standardized three things across every md host:

The standing config that codified it:

# Proactive swap on SMART warning: no degraded window
mdadm --manage /dev/md0 --replace /dev/sdd --with /dev/sdm

# Shared spare pool + faster floor, made persistent
cat >> /etc/mdadm.conf <<'EOF'
ARRAY /dev/md0 spare-group=chassis1 UUID=...
ARRAY /dev/md1 spare-group=chassis1 UUID=...
MAILADDR storage-oncall@example.com
EOF
echo 'dev.raid.speed_limit_min = 200000' > /etc/sysctl.d/90-md-rebuild.conf
sysctl --system

The result was not a faster array — it was a smaller blast radius. Proactive --replace eliminated most degraded windows entirely, the shared spare turned every reactive failure into an automatic rebuild, and the paging rule meant nobody discovered a degraded array hours late again.

Checklist

linuxmdadmraidstoragehigh-availability

Comments

Keep Reading