Unpatched Linux boxes do not get compromised because nobody knew about the CVE. They get compromised because patching was a manual ritual that someone deprioritized for three sprints. The fix is not “patch more often” by willpower; it is a pipeline where applying updates is the default, reboots are orchestrated rather than feared, and a bad package can be rolled back without a war room. This walkthrough builds that pipeline on RHEL-family (dnf) and Debian-family (apt) hosts, layers in kernel live patching to shrink the reboot window, and wires up staged rings with drain-and-reboot automation.
Everything below assumes a modern fleet: RHEL/Rocky/AlmaLinux 8/9 or Fedora on the dnf side, Ubuntu 20.04+ or Debian 11+ on the apt side. Adjust package names if you are still on yum-era systems, but the strategy holds.
1. Decide the strategy before touching a config
Two decisions drive everything else: what you apply automatically, and how often.
On the what axis, the only safe default for unattended application is security updates only. Full updates pull in feature changes, behavioural shifts, and the occasional regression that you do not want landing at 03:00 with no human watching. Reserve full-package upgrades for a windowed, ring-gated process.
On the cadence axis, separate two concerns that people conflate:
| Concern | Recommended cadence | Notes |
|---|---|---|
| Metadata refresh + download | Daily | Cheap, idempotent, keeps the cache warm. |
| Security update application | Daily on lower rings, weekly on production | Decoupled from reboot. |
| Reboot for kernel/glibc/systemd | Weekly to monthly, ring-staged | Live patching extends this. |
| Full updates | Monthly maintenance window | Manual approval, tested in lower rings first. |
The single most important architectural choice: applying a package and rebooting into it are different events. A patch pipeline that couples them inherits the worst of both — either you reboot too aggressively or you never apply security fixes. Keep them separate and orchestrate the reboot explicitly (step 7).
2. Configure dnf-automatic safely (RHEL family)
dnf-automatic is the supported unattended-update mechanism on RHEL-family systems. Install it and edit one config file:
sudo dnf install -y dnf-automatic
sudo vi /etc/dnf/automatic.conf
The defaults download but do not apply. For a security-only, apply-on-its-own posture:
[commands]
# Only security-flagged updates. The other option is "default" (everything).
upgrade_type = security
# Apply downloaded updates. Other values: no (notify only) or download only.
apply_updates = yes
# Do NOT reboot from here. Reboot is orchestrated separately.
reboot = never
[emitters]
emit_via = stdio
[base]
# Be explicit; assumeyes inherits this anyway under the timer.
assumeyes = True
dnf-automatic ships two systemd timers. Use the right one — they are not the same:
dnf-automatic.timer— honoursapply_updates/download_updatesfrom the config. This is the one you want.dnf-automatic-install.timer— forces install regardless of config. Avoid; it bypasses yourapply_updatesgate.
sudo systemctl enable --now dnf-automatic.timer
systemctl list-timers dnf-automatic.timer
By default the timer fires roughly an hour after boot plus a randomized delay. Pin the schedule and add jitter so a thousand hosts do not hammer the mirror simultaneously:
sudo systemctl edit dnf-automatic.timer
[Timer]
# Clear the shipped value before setting your own, or both apply.
OnCalendar=
OnCalendar=*-*-* 03:30:00
RandomizedDelaySec=45m
Persistent=true
Persistent=true means a host that was powered off at 03:30 runs the job at next boot instead of silently skipping the window.
3. Configure unattended-upgrades safely (Debian family)
The Debian-family equivalent is unattended-upgrades:
sudo apt-get update
sudo apt-get install -y unattended-upgrades apt-listchanges
The behaviour lives in /etc/apt/apt.conf.d/50unattended-upgrades. The key block is the origins allowlist — this is what scopes you to security only. On Ubuntu, keep the security pocket and drop the -updates pocket for unattended runs:
Unattended-Upgrade::Allowed-Origins {
"${distro_id}:${distro_codename}";
"${distro_id}:${distro_codename}-security";
"${distro_id}ESMApps:${distro_codename}-apps-security";
"${distro_id}ESM:${distro_codename}-infra-security";
// "${distro_id}:${distro_codename}-updates"; // leave commented for security-only
};
// Clean up obsolete kernels so /boot does not fill — a real outage cause.
Unattended-Upgrade::Remove-Unused-Kernel-Packages "true";
Unattended-Upgrade::Remove-Unused-Dependencies "true";
// Detect reboot-required but do NOT auto-reboot here.
Unattended-Upgrade::Automatic-Reboot "false";
// Mail on failure if you have a local MTA; otherwise rely on the report (step 8).
Unattended-Upgrade::Mail "";
Unattended-Upgrade::MailReport "on-change";
Enable the activation flags in /etc/apt/apt.conf.d/20auto-upgrades:
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "1";
APT::Periodic::Download-Upgradeable-Packages "1";
APT::Periodic::AutocleanInterval "7";
Validate the config and do a dry run before trusting it:
sudo unattended-upgrade --dry-run --debug
This prints exactly which packages would be touched and which origins were considered. Read it. The apt-daily and apt-daily-upgrade systemd timers drive the schedule; override apt-daily-upgrade.timer the same way as the dnf timer above if you want a fixed window.
4. Pin, exclude, and version-lock the packages that must not move
Automation without exclusions is how you take down a database fleet because postgresql-server jumped a minor version overnight. Lock the packages whose upgrade you want to drive by hand.
RHEL family — use the versionlock plugin:
sudo dnf install -y 'dnf-command(versionlock)'
sudo dnf versionlock add postgresql-server kernel
sudo dnf versionlock list
# To later allow a controlled bump:
sudo dnf versionlock delete postgresql-server
You can also exclude packages globally in /etc/dnf/dnf.conf (applies to every dnf run, including manual ones):
[main]
exclude=kernel* kmod-* postgresql*
Excluding kernel* is a deliberate choice some shops make: it forces the kernel onto a separate, reboot-aware track and prevents dnf-automatic from staging a new kernel you have not scheduled a reboot for. Pair it with live patching (step 5) so you stay protected in the meantime.
Debian family — use apt-mark hold:
sudo apt-mark hold postgresql-16 linux-image-generic
apt-mark showhold
# Release when ready:
sudo apt-mark unhold postgresql-16
Held packages are skipped by unattended-upgrades automatically. For finer control (pin a specific version or a whole origin to a priority), drop a file in /etc/apt/preferences.d/:
Package: postgresql-16
Pin: version 16.3*
Pin-Priority: 1001
A priority above 1000 forces apt to hold at or downgrade to that version — stronger than a hold, which only freezes.
5. Live patch the kernel to shrink the reboot window
Live patching applies critical kernel security fixes to the running kernel without a reboot, buying you time between your unhurried reboot windows. It does not replace reboots — a live patch covers a specific set of CVEs, and you still reboot to land the on-disk kernel and everything live patching cannot touch.
RHEL family (kpatch) — Red Hat ships kernel live patches as kpatch-patch packages keyed to a specific kernel build:
sudo dnf install -y kpatch kpatch-dnf
# Opt this kernel (and future minor updates) into automatic live patching:
sudo dnf kpatch auto
# Inspect what is loaded:
sudo kpatch list
dnf kpatch auto is the important command: it installs the matching kpatch-patch for your current kernel and ensures the right patch is pulled whenever the kernel updates, so the running kernel stays covered until you reboot.
Ubuntu (Canonical Livepatch) — bundled with Ubuntu Pro (free for personal/small-scale use). Attach the machine and enable the service:
sudo pro attach <your-token>
sudo pro enable livepatch
canonical-livepatch status --verbose
canonical-livepatch status reports the patch level and which CVEs are covered on the running kernel. When it shows kernelState: kernel-needs-reboot or a check-state indicating the running kernel is end-of-life for patching, that is your signal to schedule the reboot.
Live patching is a bridge, not a destination. Track “days since reboot” as a fleet metric. A host that has been live-patched for 90 days has accumulated on-disk changes (glibc, systemd, openssl) that only a reboot or service restart activates. Live patching covers the kernel; it does nothing for userspace.
6. Detect reboot-required state across distributions
You cannot orchestrate a reboot you cannot detect. The signal differs by distro, so normalize it.
Debian family is easy — the package system drops a flag file:
# Present == reboot required. The companion file lists the responsible packages.
test -f /var/run/reboot-required && echo "REBOOT REQUIRED"
cat /var/run/reboot-required.pkgs 2>/dev/null
RHEL family has no flag file by default; use dnf needs-restarting:
sudo dnf install -y dnf-utils
# Exit code 1 == reboot recommended (core libs/kernel changed), 0 == not needed.
dnf needs-restarting --reboothint
--reboothint specifically checks whether the running kernel or core libraries (glibc, systemd, openssl, dbus) were updated since boot. Wrap both into one portable check your automation can call anywhere:
#!/usr/bin/env bash
# reboot-required.sh -> exits 1 if a reboot is needed, 0 otherwise.
set -euo pipefail
if [ -f /var/run/reboot-required ]; then
exit 1
fi
if command -v dnf >/dev/null 2>&1; then
# needs-restarting exits 1 when a reboot is advised.
dnf needs-restarting --reboothint >/dev/null 2>&1 || exit 1
fi
exit 0
For service restarts (not full reboots) after a library update, needs-restarting without --reboothint lists the running processes linked against deleted/updated libraries. On Debian, checkrestart from the debian-goodies package does the same job.
7. Stage rollout rings and pre-flight package testing
Rings turn “we patched and it broke everything” into “ring 0 caught it and 95% of the fleet never saw it.” Define rings by blast radius, not by team ownership:
| Ring | Population | Window offset | Gate to promote |
|---|---|---|---|
| 0 - canary | 1-2% (non-customer-facing) | Day 0 | No new alerts after 24h. |
| 1 - early | ~10% (internal + low-traffic) | Day 1-2 | Error budget intact, no support tickets. |
| 2 - broad | ~40% | Day 3-5 | Ring 1 clean for 48h. |
| 3 - remainder | rest of fleet | Day 5+ | Manual sign-off. |
Tag hosts so automation can target a ring. With Ansible, an inventory group is enough:
[patch_ring0]
canary-01.prod.internal
canary-02.prod.internal
[patch_ring1]
app-[01:05].prod.internal
Pre-flight testing means: before you let a ring apply, you have already resolved the transaction somewhere and inspected it. dnf can do a no-op resolution that downloads and computes the transaction without applying:
# Resolve + download, but do not apply. Surfaces conflicts and the exact package set.
sudo dnf upgrade --security --downloadonly --assumeno
# Then apply from cache during the window (fast, no surprise re-resolution):
sudo dnf upgrade --security --cacheonly -y
On Debian, apt-get -s (simulate) gives the same dry-run transaction. Run it in ring 0 and diff the resulting package list against ring 1 to confirm you are promoting the same set of versions — mirror drift between days is a real cause of “it worked in canary.”
8. Drain nodes and orchestrate the reboot
This is the part most pipelines skip and then page someone over. A bare reboot on a node serving traffic drops connections. Drain first, reboot, verify, then return to service.
For a load-balanced fleet, an Ansible block makes the sequence explicit and bounds concurrency with serial:
- name: Patch and reboot a ring with drain/verify
hosts: patch_ring1
serial: 1 # one host at a time; raise carefully
become: true
max_fail_percentage: 0 # stop the whole ring if any host fails
tasks:
- name: Apply security updates from cache
ansible.builtin.dnf:
name: "*"
state: latest
security: true
cacheonly: true
- name: Check whether a reboot is required
ansible.builtin.command: dnf needs-restarting --reboothint
register: reboot_check
changed_when: false
failed_when: false # exit 1 is "reboot needed", not an error
- name: Drain, reboot, and wait for the host to return
when: reboot_check.rc == 1
block:
- name: Remove from load balancer / mark node draining
ansible.builtin.command: /usr/local/bin/lb-drain.sh "{{ inventory_hostname }}"
- name: Reboot and block until SSH + systemd are back
ansible.builtin.reboot:
reboot_timeout: 900
post_reboot_delay: 30
test_command: systemctl is-system-running --wait
- name: Return node to the load balancer
ansible.builtin.command: /usr/local/bin/lb-enable.sh "{{ inventory_hostname }}"
Two details that matter: serial: 1 guarantees only one node is ever out of rotation, and test_command: systemctl is-system-running --wait ensures Ansible waits for the boot to actually settle (running or degraded) before re-enabling the node — not merely for SSH to answer. If you run Kubernetes, the analogue is kubectl drain <node> --ignore-daemonsets --delete-emptydir-data before the reboot and kubectl uncordon <node> after; honour PodDisruptionBudgets so you never evict past quorum.
For non-clustered hosts, schedule the reboot inside the maintenance window rather than reaching for reboot directly, so users on the box get notice:
# Reboot in 5 minutes with a wall message; cancel with: sudo shutdown -c
sudo shutdown -r +5 "Scheduled kernel patch reboot (maintenance window)"
9. Report compliance and roll back a bad update
Reporting. You need to answer “which hosts are missing security patches?” on demand. Per-host, the truth is one command:
# RHEL family: lists available security advisories not yet applied.
dnf updateinfo list --security --available
# Debian family: count of pending security upgrades.
apt-get -s upgrade | grep -ci '^Inst.*security'
Pipe that into whatever you already aggregate with — Prometheus node-exporter textfile collector is the low-friction option:
#!/usr/bin/env bash
# /etc/cron.hourly/patch-metrics -> writes a metric node-exporter scrapes.
set -euo pipefail
out=/var/lib/node_exporter/textfile_collector/patching.prom
tmp="$(mktemp)"
pending=$(dnf -q updateinfo list --security --available 2>/dev/null | grep -c '/Sec' || true)
reboot=0; dnf needs-restarting --reboothint >/dev/null 2>&1 || reboot=1
{
echo "# TYPE node_security_updates_pending gauge"
echo "node_security_updates_pending ${pending}"
echo "# TYPE node_reboot_required gauge"
echo "node_reboot_required ${reboot}"
} > "$tmp"
mv "$tmp" "$out" # atomic swap so the scraper never reads a half-written file
Now node_security_updates_pending > 0 for longer than your SLA is an alert, and node_reboot_required == 1 for 14d is your “live patch is masking a stale host” alarm.
Rollback. This is why transaction history matters. dnf records every transaction and can reverse one:
sudo dnf history list # find the transaction ID of the bad update
sudo dnf history info 142 # inspect exactly what it changed
sudo dnf history undo 142 -y # reverse it (reinstalls prior versions)
The kernel case is the cleanest rollback of all: RHEL and Debian keep the previous kernel installed, so a bad kernel is recovered by selecting the prior entry at the GRUB menu, or for the next boot only:
# Boot the previous kernel once, without making it the permanent default.
sudo grub2-reboot "$(grep -E '^menuentry ' /boot/grub2/grub.cfg | sed -n '2p' | cut -d"'" -f2)"
sudo reboot
On Debian, apt-get install <package>=<old-version> performs the equivalent downgrade for a userspace package; pair it with an apt-mark hold so automation does not immediately re-upgrade it while you investigate.
Verify
Before declaring the pipeline production-ready, confirm each link end to end:
# 1. The timer is active and scheduled when you expect.
systemctl list-timers dnf-automatic.timer # or: apt-daily-upgrade.timer
# 2. A dry run shows ONLY security-scoped packages, nothing else.
sudo unattended-upgrade --dry-run --debug | grep -iE 'allowed origins|packages that'
sudo dnf upgrade --security --assumeno
# 3. Your locks actually hold.
sudo dnf versionlock list # or: apt-mark showhold
# 4. Live patching is loaded and covering the running kernel.
sudo kpatch list # or: canonical-livepatch status
# 5. Reboot detection returns the right exit code.
bash reboot-required.sh; echo "exit=$?"
# 6. The metric file is being written and scraped.
cat /var/lib/node_exporter/textfile_collector/patching.prom
# 7. A rollback is reversible.
sudo dnf history list | head
If the dry run in step 2 lists a feature package you did not expect, your origins/upgrade_type scoping is wrong — fix it before the timer ever fires unattended.
Enterprise scenario
A fintech platform team ran ~1,800 RHEL 8 hosts behind an internal mirror, with dnf-automatic set to upgrade_type = security and reboot = never. Compliance was green on “patches applied,” yet a routine audit flagged dozens of hosts as running a kernel with a known privilege-escalation CVE. The patch was installed on disk — but the hosts had been live-patched via kpatch and not rebooted in over 70 days, so the audit scanner, which reads the running kernel version, saw the old build. Worse, several of those hosts had kernel* in dnf.conf’s exclude, so even the on-disk kernel was stale and live patching was the only thing protecting them; when the kpatch-patch for that build aged out, the coverage silently lapsed.
The constraint was real: these were latency-sensitive trading-support nodes where ad-hoc reboots were genuinely disruptive, which is exactly why the team had leaned on live patching in the first place. The fix was to stop treating “live-patched” as a terminal state and make reboot cadence a measured, enforced SLA. They emitted node_reboot_required and a days_since_boot metric, alerted when either crossed threshold, removed the blanket kernel* exclusion in favour of versionlock on a specific approved kernel build (so kernels still landed on disk on a controlled track), and drove reboots through the ring + drain Ansible flow with serial: 1. The alert that closed the gap was deliberately simple:
# Prometheus: a host live-patched but not rebooted past the SLA.
- alert: HostRebootOverdue
expr: (node_reboot_required == 1) and on(instance) (time() - node_boot_time_seconds > 30*86400)
for: 24h
labels: { severity: warning }
annotations:
summary: "{{ $labels.instance }} needs a reboot to land kernel/libc updates (live patch is masking staleness)."
Within a quarter the “patched but vulnerable” gap closed, and because reboots now flowed through drained rings one node at a time, not a single trading-support service took a connection drop during the cleanup.