Servers Administration

Advanced LVM: Thin Provisioning, Snapshots, and Cache Pools

Classic (“thick”) LVM allocates every extent up front, so a 500 GiB logical volume consumes 500 GiB the moment you create it. That is wasteful when you run dozens of volumes that are mostly empty, and its old copy-on-write snapshots are a footgun: size them too small and they invalidate silently the instant they fill. Thin provisioning fixes both. Blocks are allocated on first write from a shared pool, snapshots cost only the metadata until something diverges, and dm-cache lets you front a slow spinning volume with a fast NVMe tier. The price is that you can now overcommit yourself into an outage. This guide covers the mechanics and the guardrails.

Everything below assumes lvm2 2.03+ and a kernel with dm-thin-pool and dm-cache (any current RHEL 8/9, Debian 12, or Ubuntu 22.04+). Confirm the tooling and modules first:

lvm version | head -1
modprobe dm_thin_pool dm_cache
lsmod | grep -E 'dm_thin_pool|dm_cache'

1. LVM layers: PVs, VGs, LVs, and the thin model

LVM stacks three abstractions. Physical volumes (PVs) are raw block devices (a disk, a partition, an mdraid array, a multipath device) initialized with an LVM header. Volume groups (VGs) pool one or more PVs into a single allocation arena carved into fixed-size physical extents (PEs), 4 MiB by default. Logical volumes (LVs) are what you format and mount; they are sequences of extents mapped onto the VG.

Thin provisioning inserts a fourth concept. Instead of mapping an LV directly onto VG extents, you create a thin pool — itself a special LV — and then create thin LVs that draw their blocks from that pool on demand. The pool has two internal components: a large data LV holding actual blocks, and a small metadata LV that records which pool block backs which logical address for every thin volume and snapshot.

# Initialize PVs and build a VG with a 4 MiB extent size (the default)
pvcreate /dev/sdb /dev/sdc
vgcreate vg_data /dev/sdb /dev/sdc
vgdisplay vg_data | grep -E 'PE Size|Total PE|Free  PE'

The mental model that prevents disasters: a thin pool can promise more space than it physically holds. Ten 100 GiB thin volumes on a 500 GiB pool is legal and useful — as long as their combined actual usage stays under 500 GiB. When the pool’s data hits 100%, writes to every thin volume on it block or fail. That is the central risk you are managing for the rest of this article.

2. Creating thin pools and sizing the metadata volume

You can create the pool in one command and let LVM auto-size metadata, but on anything production-grade you should size metadata deliberately. Metadata exhaustion is just as fatal as data exhaustion, and it is harder to grow under pressure.

# Create a 400 GiB thin pool; LVM auto-calculates a metadata LV
lvcreate --type thin-pool -L 400G -n thinpool vg_data

The metadata requirement scales with pool size and chunk size (the pool’s allocation granularity, distinct from the VG extent size). The rule of thumb is roughly pool_size / chunk_size * 64 bytes. With the default 64 KiB chunk, a 400 GiB pool needs on the order of 400 MiB of metadata; bump it for safety and to leave room for many snapshots. To control both explicitly:

# Explicit chunk size, explicit metadata size, then assemble the pool.
# Larger chunks = less metadata + better sequential throughput,
# but coarser snapshot CoW (more write amplification on small writes).
lvcreate -L 400G   --chunksize 128K -n thinpool_tdata vg_data
lvcreate -L 1G                      -n thinpool_tmeta vg_data
lvconvert --type thin-pool \
  --poolmetadata vg_data/thinpool_tmeta \
  vg_data/thinpool_tdata
lvrename vg_data/thinpool_tdata thinpool

Choosing a chunk size is a real tradeoff. Small chunks (64 KiB) keep snapshots cheap because copy-on-write duplicates less data per changed block, but they balloon metadata and add per-IO overhead. Large chunks (256 KiB to 1 MiB) suit big sequential workloads and database files but waste space when snapshots diverge by tiny amounts. For mixed general-purpose storage, 128 KiB is a defensible default.

Always carry a metadata spare. vgcreate reserves a --poolmetadataspare LV by default so lvconvert --repair has somewhere to write a rebuilt copy. Do not delete it to reclaim a gigabyte.

3. Provisioning thin volumes and monitoring overcommit

Now carve thin volumes out of the pool. Note the -V (virtual size) flag — that is the size the volume advertises, not what it consumes.

# Three thin volumes, each advertising 200 GiB, from a 400 GiB pool.
# Total virtual = 600 GiB on 400 GiB physical: 1.5x overcommit.
lvcreate --type thin -V 200G --thinpool thinpool -n app01 vg_data
lvcreate --type thin -V 200G --thinpool thinpool -n app02 vg_data
lvcreate --type thin -V 200G --thinpool thinpool -n db01  vg_data

mkfs.ext4 /dev/vg_data/app01
mount /dev/vg_data/app01 /srv/app01

The numbers you must watch are the pool’s Data% and Meta%. The pool row aggregates real consumption across every thin LV and snapshot it backs:

lvs -o name,lv_size,data_percent,metadata_percent,pool_lv vg_data

#   LV        LSize    Data%  Meta%  Pool
#   thinpool  400.00g  41.27  3.10
#   app01     200.00g  18.44         thinpool
#   app02     200.00g  62.10         thinpool
#   db01      200.00g  43.05         thinpool

Here the per-volume Data% is the fraction of each thin LV’s virtual size that has been written, while the pool’s Data% (41.27) is what actually matters for survival — it is the share of physical pool capacity in use. The per-LV percentages can each look modest while the pool quietly approaches full. Alert on the pool row, not the volumes. For ext4 and xfs, enable discard/trim so deleting files actually returns blocks to the pool (fstrim on a timer is gentler than the discard mount option for most workloads).

# Return freed blocks to the pool weekly rather than inline.
systemctl enable --now fstrim.timer

4. Taking, merging, and rolling back thin snapshots

Thin snapshots are nearly free at creation: they share all blocks with their origin and only allocate new pool blocks where origin or snapshot subsequently diverges. Unlike legacy LVM snapshots, they never need a pre-sized CoW area, and they do not silently break when “full” — they simply consume pool space like any other thin volume.

# Snapshot of a live volume. -kn makes it activatable (snapshots are
# created skip-activation by default so they don't auto-mount on boot).
lvcreate --snapshot --name app01_snap_pre_deploy vg_data/app01
lvchange -kn -ay vg_data/app01_snap_pre_deploy

To roll the origin back to the snapshot’s state, merge it. The merge of an in-use origin is deferred until the next deactivation/reactivation, so for a mounted root or busy volume you unmount (or reboot) to let it complete.

# Unmount the origin first if possible, then merge.
umount /srv/app01
lvconvert --merge vg_data/app01_snap_pre_deploy
# Origin now reflects the snapshot's contents; the snapshot LV is consumed.
mount /dev/vg_data/app01 /srv/app01

A snapshot is also a clean source for backups: snapshot, mount read-only, back up the stable point-in-time image, then drop it. Crucially, snapshots count against pool capacity. A long-lived snapshot of a write-heavy volume can consume as much pool space as a full second copy. Treat them as time-boxed, and reap them:

mount -o ro /dev/vg_data/app01_snap_pre_deploy /mnt/snap
# ... back up /mnt/snap ...
umount /mnt/snap
lvremove -y vg_data/app01_snap_pre_deploy

5. Accelerating slow volumes with dm-cache and cachepools

dm-cache puts a fast device (NVMe/SSD) in front of a slow origin LV (HDD, network-backed, or RAID) and promotes hot blocks transparently. You build a cache pool on the fast PV — data plus its own metadata — then attach it to the origin. The result is one cached LV; the application sees no change.

# Add the fast device to the VG, then build a cache pool on it.
vgextend vg_data /dev/nvme0n1
lvcreate --type cache-pool -L 64G \
  --poolmetadatasize 128M \
  --cachemode writethrough \
  -n cachepool vg_data /dev/nvme0n1

# Attach the cache pool to a slow origin LV.
lvconvert --type cache \
  --cachepool vg_data/cachepool \
  vg_data/db01

Mode choice is a durability decision. writethrough (the safe default) acknowledges a write only after it lands on the slow origin, so a failed cache device never loses acknowledged data — you get read acceleration with no write risk. writeback acknowledges from the fast device and destages later, which is far faster for write-heavy workloads but means a dead single cache device can lose unflushed data; only run it on a redundant (mirrored) cache. Inspect and detach as follows:

lvs -o name,cache_mode,cache_total_blocks,cache_used_blocks,cache_dirty_blocks vg_data
# Splitting flushes dirty blocks to origin and removes the cache cleanly.
lvconvert --splitcache vg_data/db01

Detach the cache before operations like pvmove of the origin’s slow device, and always --splitcache (which flushes) rather than yanking the NVMe. With writeback, an unflushed detach is data loss.

6. Online extend, reduce, and pvmove migrations

LVM’s headline feature is doing all of this live. Growing is always safe; shrinking is risky and must follow the filesystem order strictly.

# Grow an LV and its ext4/xfs filesystem in one shot.
lvextend -L +50G --resizefs vg_data/app02

--resizefs calls resize2fs (ext4) or xfs_growfs (xfs) for you. XFS cannot shrink at all — plan capacity accordingly. To reduce an ext4 volume you must shrink the filesystem first, then the LV, or you truncate live data:

# REDUCE (ext4 only): shrink FS first, then the LV. -r does both safely.
umount /srv/app01
e2fsck -f /dev/vg_data/app01
lvreduce -r -L 120G vg_data/app01   # -r resizes FS before shrinking the LV

To evacuate a failing disk with zero downtime, pvmove relocates extents to other PVs in the VG while the volumes stay mounted. It is restartable if interrupted.

# Move everything off a dying PV onto the rest of the VG, then remove it.
pvmove -i 5 /dev/sdc        # -i 5 prints progress every 5 seconds
vgreduce vg_data /dev/sdc
pvremove /dev/sdc

Note: thin pool data and metadata LVs can be extended online (lvextend vg_data/thinpool), but you cannot shrink a thin pool. Size pool growth, not pool shrinkage, into your runbooks.

7. Recovering a full thin pool and emergency growth

This is the incident that wakes people up. When a thin pool’s Data% or Meta% reaches 100%, every thin volume it backs starts erroring on write; filesystems may remount read-only. Diagnose immediately:

lvs -o name,data_percent,metadata_percent,lv_health_status vg_data
dmesg | grep -i thin   # look for "out-of-data-space" / "metadata exhaustion"

The fastest non-destructive fix is to add capacity to the pool — assuming the VG has free extents or you can add a PV in seconds:

# Emergency: extend pool data (and/or metadata) to relieve pressure.
vgextend vg_data /dev/sdd          # if no free extents remain
lvextend -L +100G vg_data/thinpool                    # grow data area
lvextend --poolmetadatasize +512M vg_data/thinpool    # grow metadata

If the pool is wedged, an offline metadata check often clears a transactional inconsistency. Deactivate the pool, then repair:

vgchange -an vg_data
lvconvert --repair vg_data/thinpool   # rebuilds tmeta from the spare
vgchange -ay vg_data

The fastest destructive relief is deleting expendable snapshots, which immediately returns their diverged blocks to the pool — which is exactly why a runaway snapshot is the most common root cause of a full pool. Reap stale snapshots before reaching for new disks.

8. Automating thin-pool alerting and autoextend thresholds

You should never discover a full pool from dmesg. LVM’s dmeventd monitoring can automatically grow a pool when it crosses a threshold. Configure it in /etc/lvm/lvm.conf:

# /etc/lvm/lvm.conf  ->  activation { ... }
# When the pool exceeds 80% used, grow it by 20% of current size.
# Set percent to 100 to disable autoextend (monitoring/warnings still fire).
thin_pool_autoextend_threshold = 80
thin_pool_autoextend_percent = 20

Autoextend only works while the VG has free extents and the pool is monitored, so confirm monitoring is on and pair it with external alerting that fires before the autoextend threshold:

# Ensure the pool is monitored by dmeventd.
lvchange --monitor y vg_data/thinpool

# Independent alert at 75% so a human reacts before autoextend at 80%.
USED=$(lvs --noheadings -o data_percent vg_data/thinpool | tr -d ' %' | cut -d. -f1)
[ "${USED:-0}" -ge 75 ] && echo "WARN: vg_data/thinpool data ${USED}%" | logger -t lvm-watch

Wire that check into a systemd timer or your monitoring agent (a Prometheus node_exporter textfile collector reading lvs works well), and alert on both data_percent and metadata_percent. Metadata exhaustion gets forgotten precisely because the data area looks healthy.

Verify

Run these after standing up a pool to confirm the whole stack is sound:

# 1. Pool exists, is a thin-pool, and reports sane Data%/Meta%.
lvs -a -o name,lv_attr,lv_size,data_percent,metadata_percent vg_data

# 2. Metadata spare LV is present (needed for --repair).
lvs -a vg_data | grep -i pmspare

# 3. Pool is monitored by dmeventd (look for the 'm' attr bit / 'monitored').
lvs -o name,lv_attr vg_data/thinpool

# 4. Autoextend thresholds are loaded as configured.
lvmconfig activation/thin_pool_autoextend_threshold \
          activation/thin_pool_autoextend_percent

# 5. A cached LV shows cache hit/miss stats accumulating.
lvs -o name,cache_used_blocks,cache_dirty_blocks vg_data/db01

# 6. Snapshot create -> merge -> remove cycle works end to end.
lvcreate -s -n _vtest vg_data/app01 && lvremove -y vg_data/_vtest

You want: an lv_attr starting with t for the pool (twi-...), non-100% Data%/Meta%, a visible [lvol*_pmspare], monitoring enabled, and your configured autoextend values echoed back.

Enterprise scenario

A fintech platform team ran a multi-tenant PostgreSQL fleet on a single LVM thin pool to claw back capacity from databases that provisioned 500 GiB but used 80. The constraint: a nightly pg_basebackup plus per-tenant LVM snapshots for fast restore testing. One quarter-end, a batch job rewrote a large partitioned table across most tenants at once. CoW divergence on dozens of snapshots, plus the write surge, drove the pool from 70% to 100% in under twenty minutes. Every tenant’s filesystem remounted read-only simultaneously — a fleet-wide outage from a single shared pool.

The fix had three parts. First, fail safe: they set the pool to error rather than queue on exhaustion so the blast radius was a clean read-only mount instead of indefinite hangs and corruption.

lvchange --errorwhenfull y vg_data/thinpool

Second, defense in depth on growth: autoextend at 80% with headroom always kept free in the VG, plus an independent alert at 70% that pages a human, because autoextend silently does nothing once the VG runs out of extents.

# lvm.conf
thin_pool_autoextend_threshold = 80
thin_pool_autoextend_percent = 25

Third, operational discipline: restore-test snapshots became strictly time-boxed (created, validated, dropped within an hour by a cron-driven reaper) so divergent CoW blocks could never accumulate unbounded. They also capped overcommit per pool at 1.5x rather than the 4x that had crept in, and split the largest tenants onto their own pools so one noisy database could no longer starve the rest. The pool has not filled since.

Checklist

linuxlvmstoragesnapshotsperformance

Comments

Keep Reading