Most people poke systemd with systemctl start and never look further. That is a missed opportunity: a well-written unit file gives you restart policies, dependency ordering, resource caps, and a security sandbox for free — no supervisor daemon, no wrapper scripts, no init hacks. This is a practical walk through production-grade units, replacing cron with timers, fencing services into cgroups, and hardening them with the directives that actually move the needle.
Everything below assumes a modern distribution (systemd v245+, cgroup v2 as the default hierarchy). Check yours with systemctl --version and stat -fc %T /sys/fs/cgroup/ (you want cgroup2fs).
1. Unit anatomy: types, dependencies, and ordering
A unit is an INI-style file. Service units live in three directories, searched in ascending priority:
| Path | Purpose |
|---|---|
/usr/lib/systemd/system/ |
Shipped by packages. Never edit directly. |
/etc/systemd/system/ |
Your local overrides and custom units. This wins. |
/run/systemd/system/ |
Runtime-generated, volatile. |
The Type= in [Service] tells systemd when the unit is “started”. This is the single most misconfigured field:
simple— the default whenExecStartis set withoutType=. systemd considers the service started the instant it forks. It does not wait for readiness.exec— likesimple, but readiness is the successfulexecve()of the binary. Catches a missing binary or bad permissions immediately. Prefer this oversimple.forking— the process double-forks and the parent exits. SetPIDFile=so systemd can track the real daemon. Classic for older daemons.notify— the service callssd_notify(3)withREADY=1when it is genuinely ready to serve. The gold standard if your framework supports it.oneshot— runs to completion, then exits. Pair withRemainAfterExit=yesfor “apply config once” units.
“Started” is not “ready”. With
Type=simple, a unit orderedAfter=yours may launch while yours is still binding its socket. UseType=notifyor socket activation for genuine readiness gating.
Dependencies vs. ordering
These are orthogonal, and conflating them causes most “works on reboot, breaks on restart” bugs.
Wants=/Requires=express dependency: if ARequires=B, starting A pulls in B, and if B fails, A is stopped too.Wants=is the soft form — B is pulled in but its failure does not take A down. Default toWants=; reserveRequires=for genuinely hard dependencies.After=/Before=express ordering only.After=Bmeans “do not start until B has finished starting”. It says nothing about pulling B in.
You almost always need both. “Start my app after the database is up, and pull the database in if it isn’t” is:
[Unit]
Wants=postgresql.service
After=postgresql.service network-online.target
Note network-online.target, not network.target. The former blocks until the network is actually configured (it requires the relevant wait-online service to be enabled); the latter is only an ordering point for the networking stack. If your service needs a routable address at start, use network-online.target plus Wants=network-online.target.
The [Install] section
[Install] is inert until you run systemctl enable. It defines enable-time behaviour, typically:
[Install]
WantedBy=multi-user.target
enable reads WantedBy=multi-user.target and creates a symlink in multi-user.target.wants/, which is how the unit starts at boot. No [Install] section means enable has nothing to do and the unit never starts automatically — a common gotcha.
2. Writing a production service unit
A complete, opinionated unit for a long-running daemon. Save it as /etc/systemd/system/widgetd.service.
[Unit]
Description=Widget API daemon
Documentation=https://internal.example.com/runbooks/widgetd
Wants=network-online.target
After=network-online.target postgresql.service
StartLimitIntervalSec=60
StartLimitBurst=5
[Service]
Type=notify
ExecStart=/usr/local/bin/widgetd --config /etc/widgetd/config.toml
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=2
TimeoutStartSec=30
TimeoutStopSec=20
WatchdogSec=30
# Run as an unprivileged, system-managed user
DynamicUser=yes
StateDirectory=widgetd
RuntimeDirectory=widgetd
[Install]
WantedBy=multi-user.target
A few deliberate choices:
Restart=on-failurerestarts only on non-zero exit, signals, timeouts, and watchdog failures — not on a cleanexit 0. UseRestart=alwaysonly if a clean exit should also trigger a restart.StartLimitIntervalSec/StartLimitBurstform a crash-loop circuit breaker. More than 5 restarts in 60 seconds and systemd marks the unitfailedinstead of spinning forever. These live in[Unit], not[Service]— a frequent mistake.WatchdogSec=30plusType=notifyis a true health gate. The process must callsd_notify(0, "WATCHDOG=1")at least every 15 seconds (half the interval); miss it and systemd treats the service as hung and restarts it — catching deadlocks a process-alive check never would.DynamicUser=yesallocates a transient UID/GID for the service lifetime, with private/tmpand locked-down filesystem access by default. WithStateDirectory=, systemd creates/var/lib/widgetdwith correct ownership. The single highest-leverage hardening directive.
Load and start it:
sudo systemctl daemon-reload
sudo systemctl enable --now widgetd.service
systemctl status widgetd.service
daemon-reload is mandatory after any unit change; systemd caches parsed units in memory.
3. Drop-in overrides and templated units
Drop-ins: never edit shipped units
To tweak a vendor unit, do not copy it into /etc. Use a drop-in, merged on top of the original. The clean way:
sudo systemctl edit nginx.service
This opens an editor for /etc/systemd/system/nginx.service.d/override.conf. Add only the deltas:
[Service]
LimitNOFILE=65536
Restart=on-failure
RestartSec=5
Drop-ins survive package upgrades because the base unit stays untouched. To verify the merge:
systemctl cat nginx.service # shows base unit + all drop-ins
systemctl show nginx.service -p LimitNOFILE
Gotcha: most directives are additive or last-one-wins, but list-valued ones like
ExecStart=are not simply overridden. To replaceExecStart=, first clear it with an emptyExecStart=line, then set the new value. Otherwise you get twoExecStartentries.
[Service]
ExecStart=
ExecStart=/usr/local/bin/nginx-wrapper
Templated (instanced) units
A unit file with @ in its name is a template. The string after @ is the instance, available as %i (and %I unescaped). One template, many instances:
/etc/systemd/system/tunnel@.service:
[Unit]
Description=SSH tunnel to %i
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/bin/autossh -M 0 -N %i
Restart=always
RestartSec=10
DynamicUser=yes
[Install]
WantedBy=multi-user.target
Now start any number of instances from the one file:
sudo systemctl enable --now tunnel@db-replica.service
sudo systemctl enable --now tunnel@cache-01.service
Each gets independent state, logs, and lifecycle. Common specifiers: %i (instance), %n (full unit name), %H (hostname), %h (user home for user units). Use systemd-escape to safely build instance names containing slashes or special characters:
systemd-escape --template tunnel@.service "user@10.0.0.5"
4. Replacing cron with systemd timers
Timers beat cron for anything you want to observe: full journald logging, dependency ordering, resource control, and Persistent= to catch up on missed runs. A timer is two units — a .service that does the work and a .timer that triggers it.
/etc/systemd/system/backup.service:
[Unit]
Description=Nightly database backup
Wants=postgresql.service
After=postgresql.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/db-backup.sh
DynamicUser=yes
StateDirectory=db-backup
/etc/systemd/system/backup.timer:
[Unit]
Description=Run database backup nightly
[Timer]
OnCalendar=*-*-* 02:30:00
Persistent=true
RandomizedDelaySec=300
AccuracySec=1m
[Install]
WantedBy=timers.target
Key fields:
OnCalendar=uses aDayOfWeek Year-Month-Day Hour:Minute:Secondgrammar.*-*-* 02:30:00is daily at 02:30,Mon..Fri 09:00is weekdays at 9am,*-*-01 00:00:00is the first of each month. Always validate it (see Verify).Persistent=truerecords the last run on disk. If the machine was off at 02:30, the job fires shortly after the next boot — the cron-with-anacron behaviour you usually want, and the reason to switch.RandomizedDelaySec=jitters the start. Indispensable when a fleet would otherwise hammer a backup target at exactly 02:30:00.AccuracySec=lets systemd coalesce timers within a window to reduce wakeups. The default is one minute; tighten only if you need second precision.
Enable the timer, not the service:
sudo systemctl daemon-reload
sudo systemctl enable --now backup.timer
You can also drive a timer with OnBootSec= (after boot) or OnUnitActiveSec= (interval since the service last ran) for “every 15 minutes” schedules:
[Timer]
OnBootSec=15min
OnUnitActiveSec=15min
5. Resource control with cgroup v2
Every service gets its own cgroup. On cgroup v2 you control resources with directives in [Service] (or a drop-in) that map directly onto kernel controllers — no ulimit guesswork.
[Service]
# CPU: cap at 1.5 cores; weight only matters under contention
CPUQuota=150%
CPUWeight=200
# Memory: throttle at 512M, OOM-kill the cgroup at 768M
MemoryHigh=512M
MemoryMax=768M
# Block IO, per device
IOWeight=100
IOWriteBandwidthMax=/dev/sda 20M
# Fork-bomb protection
TasksMax=256
What each one does:
CPUQuota=is an absolute ceiling:150%allows 1.5 cores of CPU time regardless of how idle the box is.CPUWeight=(default 100, range 1–10000) is relative — it only matters under contention, dividing CPU proportionally among competing cgroups.MemoryMax=is the hard limit; exceeding it triggers the cgroup OOM killer.MemoryHigh=is a soft throttle — past it the kernel aggressively reclaims and stalls the cgroup but does not kill. SetMemoryHighbelowMemoryMaxso the workload can shed memory before the hammer drops.IOWeight=is proportional (likeCPUWeight); the*BandwidthMax=/*IOPSMax=directives are absolute per-device caps and require the device path.TasksMax=caps the pids in the cgroup. A cheap, effective fork-bomb guard.
Apply via drop-in to tune without touching the unit, then inspect the live accounting:
sudo systemctl set-property widgetd.service MemoryMax=1G CPUQuota=200%
systemctl show widgetd.service -p MemoryMax -p CPUQuota
systemd-cgtop
set-property writes a drop-in under /etc/systemd/system/widgetd.service.d/ and applies it live. systemd-cgtop is top for cgroups — per-service CPU, memory, and IO, the fastest way to find the noisy neighbour on a box.
Note: accounting must be on for the numbers to be real. Most distros enable
MemoryAccounting/CPUAccountingby default, but ifsystemctl showreports zeros, setMemoryAccounting=yesandCPUAccounting=yesexplicitly.
6. Sandboxing and service hardening
This is where systemd quietly replaces a pile of AppArmor/SELinux work for common cases. Add these to [Service]; each line shrinks the blast radius of a compromised process.
[Service]
# Filesystem
ProtectSystem=strict # entire fs read-only...
ReadWritePaths=/var/lib/widgetd # ...except these
ProtectHome=true
PrivateTmp=true
# Privilege
NoNewPrivileges=true
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
# Kernel / device exposure
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
PrivateDevices=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
SystemCallFilter=@system-service
SystemCallArchitectures=native
LockPersonality=true
MemoryDenyWriteExecute=true
The high-value ones, in priority order:
NoNewPrivileges=true— the keystone. A process and its children can never gain privileges via setuid binaries or filesystem capabilities. Turn this on everywhere unless the service legitimately needs to escalate.ProtectSystem=strict— mounts the whole filesystem read-only inside the service’s namespace; punch holes withReadWritePaths=. The weaker valuestrue(/usr,/boot,/efi) andfull(also/etc) are stepping stones;strictis the target.CapabilityBoundingSet=— drops every Linux capability not listed. Set it to exactly what the service needs. The canonical case is a web server that binds port 80/443 but otherwise runs unprivileged: it needs onlyCAP_NET_BIND_SERVICE, andAmbientCapabilities=grants that single capability to the unprivileged process so it can actually bind.SystemCallFilter=@system-service— a seccomp allowlist of syscalls appropriate for typical services, implicitly blocking exotic and dangerous ones (@swap,@reboot,@module, rawptrace). Start here and add back named groups only if the service breaks.PrivateTmp=trueandProtectHome=true— namespace/tmpand hide user data. Cheap, almost never breaks anything.
systemd ships a scoring tool. After hardening, run
systemd-analyze security widgetd.service. It rates each unit from “UNSAFE” to “OK” with an exposure score and a per-directive breakdown. Treat it as a checklist: drive the score down, retest the service after each change.
Verify
Validate everything before declaring victory.
# Unit file syntax and warnings (catches typos, deprecated keys)
systemd-analyze verify /etc/systemd/system/widgetd.service
# Confirm the merged, effective configuration
systemctl cat widgetd.service
systemctl show widgetd.service -p Restart -p MemoryMax -p NoNewPrivileges
# Is it actually running and healthy?
systemctl status widgetd.service
systemctl is-active widgetd.service && echo OK
# Timers: list next/last run times, then dry-run the calendar expression
systemctl list-timers --all
systemd-analyze calendar "*-*-* 02:30:00" --iterations 5
# Trigger a oneshot timer's service manually without waiting for the clock
sudo systemctl start backup.service
journalctl -u backup.service -n 50 --no-pager
# Security posture
systemd-analyze security widgetd.service
systemd-analyze calendar printing five upcoming timestamps is the fastest way to prove an OnCalendar= expression means what you think, and systemd-analyze verify reports problems without starting anything.
Reading journald effectively
Logs are only useful if you can find the signal. journald gives structured, indexed filtering that grep over flat files cannot.
# Follow one unit, like tail -f
journalctl -u widgetd.service -f
# Since last boot, this boot only, errors and worse
journalctl -u widgetd.service -b -p err
# Time-bounded
journalctl -u widgetd.service --since "2026-03-18 09:00" --until "2026-03-18 10:00"
# Everything since the previous boot (-1), useful for crash post-mortems
journalctl -b -1 -p warning
# Disk usage and retention
journalctl --disk-usage
sudo journalctl --vacuum-time=14d
Two configuration points that bite people:
- Persistence. On many systems the journal is volatile (
/run/log/journal) and vanishes on reboot. To keep logs, setStorage=persistentin/etc/systemd/journald.confand create/var/log/journal(mkdir -p /var/log/journal && systemctl restart systemd-journald). Without this, post-mortem evidence is gone the moment the box reboots. - Rate limiting. journald drops bursts by default — roughly
RateLimitBurst=10000messages perRateLimitIntervalSec=30sper service. A chatty service has lines silently dropped (you will see a “Suppressed N messages” notice). Raise the limits with a journald drop-in, or fix the noise at the source — usually the latter.
Debugging boot, cycles, and failures
# What failed, and why
systemctl --failed
systemctl status <unit> # last log lines + exit code/signal
journalctl -u <unit> -b
# Where is boot time going?
systemd-analyze # total firmware/loader/kernel/userspace time
systemd-analyze blame # slowest units, descending
systemd-analyze critical-chain # the ordering chain that gated boot
# Dependency cycles — systemd breaks them by dropping a unit and tells you which
journalctl -b | grep -i "found ordering cycle"
# Visualize what pulls in what
systemctl list-dependencies <unit>
Reach for systemd-analyze critical-chain when boot is slow: unlike blame (which lists slow units in isolation) it shows the serialized path that actually delayed reaching the target, accounting for parallelism. For ordering cycles, the journal names the units involved and which one systemd deleted to break the loop — fix it by relaxing an After=/Requires= rather than letting systemd choose for you.
Enterprise scenario
A payments platform ran a JVM settlement service under Restart=always with MemoryMax=4G. During a vendor outage the service wedged — threads alive, heap full, GC thrashing, but no exit. The process-alive check stayed green, so nothing restarted; settlements silently backed up for 40 minutes until on-call noticed the queue depth. The gotcha: Restart= only fires on process death, and a deadlocked-but-alive JVM never dies. MemoryMax made it worse — the cgroup OOM killer reaped the service mid-batch, and because Type=simple reported “started” at fork, the restart raced ahead of the readiness probe and took traffic before the DB pool reconnected.
The fix was a true liveness gate via the watchdog, not a process check. They switched to Type=notify, added a dedicated watchdog thread pinging sd_notify only when the work loop and DB pool were healthy, and split the soft/hard memory limits so the JVM could reclaim before being killed:
[Service]
Type=notify
WatchdogSec=20
Restart=on-watchdog
RestartSec=5
StartLimitIntervalSec=120
StartLimitBurst=4
MemoryHigh=3500M
MemoryMax=4G
// ping only when genuinely able to serve
if (workLoop.isHealthy() && dbPool.isReachable()) {
sdNotify.notifyWatchdog(); // every WatchdogSec/2 = 10s
}
Now a wedge trips the watchdog in 20 seconds and Restart=on-watchdog cycles it, while MemoryHigh lets the heap shed under pressure before MemoryMax ever bites. StartLimitBurst=4 keeps a genuinely broken build from crash-looping into a thundering reconnect storm.
Checklist
Pitfalls and next steps
The recurring failure modes: forgetting daemon-reload after editing a unit (the change silently does nothing); putting StartLimit* in [Service] instead of [Unit] (ignored); using Requires= where Wants= belongs and cascading a non-critical failure into an outage; ordering on network.target when the service needs network-online.target; and over-hardening with SystemCallFilter until the service dies on an unrelated syscall — always test the running service after each hardening directive, not just that it starts.
From here, look at socket activation (.socket units) for on-demand start and zero-downtime restarts, systemd-run --scope/--slice for ad-hoc resource-controlled commands, and per-slice budgets to carve a box into tiers (a system-batch.slice with a hard CPU and memory ceiling so background jobs can never starve the foreground service). Run systemd-analyze security across the fleet and treat the worst scores as a backlog.