Servers Administration

Running Rootless Containers in Production with Podman and Quadlet

The container conversation defaulted to “just run a daemon as root” for a decade, and most production fleets are still paying interest on that decision. Podman flips the model: no long-lived daemon, no root-owned socket, and containers that run entirely inside an unprivileged user’s namespace. Pair it with Quadlet – the systemd generator that turns declarative .container files into real units – and you get containers that are first-class systemd services, supervised by PID 1, with restart policies, resource caps, and journald logging for free. This is a walk through standing that up the way a platform team should, not the way a quickstart does.

Everything below assumes a recent Podman (v4.4 or newer for Quadlet; v5.x for pasta-by-default networking) on a cgroup v2 host. Check with podman version and stat -fc %T /sys/fs/cgroup/ – you want cgroup2fs. Quadlet shipped with Podman 4.4 and is the supported successor to the now-removed podman generate systemd.

1. Podman versus Docker: daemonless and rootless by design

The architectural difference is not cosmetic. Docker runs a privileged daemon (dockerd) that owns every container; your CLI is a thin client talking to a root-owned socket. Compromise the socket, compromise the host. Podman has no daemon. podman run forks conmon, which execs the OCI runtime (crun or runc), and the container is a child of your process tree. When you wire that into systemd, conmon becomes a child of the unit, and the kernel – not a daemon – is the source of truth.

Property Docker (default) Podman (rootless)
Central daemon dockerd runs as root None
Container ownership Child of daemon Child of your shell or systemd unit
Privilege at runtime Root, unless userns remapped Your UID, mapped via subuid/subgid
Supervision daemon restarts policy systemd unit (Restart=)
Socket exposure root-owned /var/run/docker.sock optional, per-user, off by default

The practical wins: a crashed container is restarted by systemd with the same policy you use for every other service; logs land in the journal; and a remote code execution inside a container lands an attacker as an unprivileged UID with no capabilities, not root.

Podman is also CLI-compatible with Docker. alias docker=podman covers the vast majority of muscle memory, and podman generate kube / podman kube play let you carry Kubernetes-style YAML between dev and a cluster.

2. Setting up rootless mode: subuid, subgid, and lingering

Rootless containers work by mapping a range of host UIDs/GIDs into the container’s user namespace. Your single login UID becomes root (UID 0) inside the container, and the container’s other UIDs map to a block of unprivileged host UIDs that belong to nobody else. Those ranges live in /etc/subuid and /etc/subgid.

On a fresh user, allocate a non-overlapping range of at least 65536 IDs:

# Grant the 'app' user a 65536-wide subordinate range, auto-picking the next free block
sudo usermod --add-subuids 100000-165535 --add-subgids 100000-165535 app

# Verify
grep app /etc/subuid /etc/subgid
# /etc/subuid:app:100000:65536
# /etc/subgid:app:100000:65536

# After editing the maps, force Podman to re-read them
sudo -u app podman system migrate

The 65536 width matters: it lets a container reference the full standard UID space (0-65535). Too narrow and images that chown to high UIDs will fail with lchown ... invalid argument.

For services that must survive logout and start at boot, enable lingering. Without it, the user’s systemd instance is torn down when their last session ends, taking every container with it.

# Keep the user's systemd --user manager running with no active login session
sudo loginctl enable-linger app

# Confirm
loginctl show-user app --property=Linger
# Linger=yes

Lingering is the single most-forgotten step. The container works in your SSH session, you log out, and it vanishes. Enable linger for any user hosting boot-persistent workloads.

3. User-namespace mapping, storage drivers, and volumes

Inside the namespace, the container believes it is root. On the host it is not. Prove it:

podman run --rm docker.io/library/alpine id
# uid=0(root) gid=0(root) ...   <- inside the container

podman unshare cat /proc/self/uid_map
#          0     100000      65536   <- container UID 0 == host UID 100000

That second line is the whole security model: container UID 0 is host UID 100000, an account that owns nothing else on the system.

Storage driver

Rootless Podman defaults to overlay via the kernel’s unprivileged overlayfs (kernel 5.11+) or, failing that, fuse-overlayfs in userspace. You do not need vfs on any modern kernel, and you should avoid it – vfs deep-copies every layer and wastes enormous disk. Confirm:

podman info --format '{{.Store.GraphDriverName}}'
# overlay

Rootless storage lives under $HOME/.local/share/containers/storage, not /var/lib/containers. Put $HOME on a filesystem that supports the driver (xfs or ext4 are fine; some network filesystems are not).

Volumes and the :U trick

Because of the UID shift, a host bind mount owned by your real UID appears unowned inside the container. The fix is the :U mount option, which recursively chowns the source to match the container’s mapped user, plus :Z/:z for SELinux relabelling on RHEL-family hosts:

# :U  -> chown source to the container user's mapped UID/GID
# :Z  -> apply a private SELinux label so the container can access it
podman run --rm \
  -v /srv/appdata:/data:U,Z \
  docker.io/library/alpine touch /data/written-by-container

For state you do not need to inspect from the host, prefer named volumes (podman volume create). They live inside Podman storage, already carry the right ownership, and sidestep the relabel dance entirely.

4. Grouping containers into pods and shared namespaces

A pod is Podman’s borrowing from Kubernetes: a set of containers that share a network namespace (one IP, one localhost, shared port space) and optionally other namespaces. The pod owns an infra container that holds the namespaces open so individual containers can come and go.

# Create a pod, publishing 8080 on the host once -- all members share it
podman pod create --name web --publish 8080:80

# A reverse proxy and the app it fronts, both in the pod
podman run -d --pod web --name app   docker.io/library/nginx:alpine
podman run -d --pod web --name cache docker.io/library/redis:7-alpine

# Inside the pod, 'app' reaches 'cache' on 127.0.0.1:6379 -- same netns
podman pod inspect web --format '{{.InfraContainerID}}'

The key insight for networking: you publish ports on the pod, not the member containers, because they share one network namespace. Containers in a pod talk to each other over localhost. This maps cleanly onto a sidecar pattern – proxy, app, and a metrics exporter co-scheduled and co-supervised.

5. Generating systemd units with Quadlet .container files

This is the heart of a production setup. You do not write [Service] units by hand and you do not use the removed podman generate systemd. You write a declarative .container file and let the Quadlet generator translate it into a real unit at boot and on daemon-reload.

Rootless Quadlet files go in ~/.config/containers/systemd/. (Root/system-wide units go in /etc/containers/systemd/.) Create ~/.config/containers/systemd/myapp.container:

[Unit]
Description=My application container
After=network-online.target
Wants=network-online.target

[Container]
Image=ghcr.io/acme/myapp:1.4.2
ContainerName=myapp
PublishPort=8080:8080
Environment=LOG_LEVEL=info
Volume=myapp-data.volume:/var/lib/myapp:Z
# Drop all capabilities, add back only what the app needs
DropCapability=ALL
AddCapability=NET_BIND_SERVICE
# Mark this image as a candidate for `podman auto-update`
AutoUpdate=registry
HealthCmd=curl -fsS http://localhost:8080/healthz || exit 1
HealthInterval=30s
HealthRetries=3

[Service]
Restart=on-failure
RestartSec=5

[Install]
WantedBy=default.target

Note there is no [Service] ExecStart – Quadlet synthesises the podman run line from the [Container] section. The companion myapp-data.volume:/... references a .volume file you create alongside it:

# ~/.config/containers/systemd/myapp-data.volume
[Volume]
# Quadlet names the real volume "systemd-myapp-data"

Now reload and start. The unit name is the filename stem plus .service:

# Re-run the generator (Quadlet is invoked by daemon-reload)
systemctl --user daemon-reload

# Start it -- the unit is named after the .container file
systemctl --user start myapp.service
systemctl --user status myapp.service

Quadlet units are generated read-only into /run/user/<uid>/systemd/generated/; you never edit those. You edit the .container file and daemon-reload. This is the declarative loop that makes Podman feel like GitOps on a single host – the .container files are your source of truth and live in version control.

For a whole pod, Quadlet provides a .pod file and you point each .container at it with Pod=mypod.pod. The generator wires the ordering dependencies for you.

6. Rootless networking with pasta, slirp4netns, and port mapping

Rootless containers cannot create veth pairs against a host bridge – that needs CAP_NET_ADMIN on the host. So outbound traffic and published ports flow through a userspace network stack. There are two:

# What is actually backing rootless networking?
podman info --format '{{.Host.NetworkBackend}}'   # netavark (the backend)
podman info --format '{{.Host.Slirp4netns.Executable}}'  # path if used

# Run a container on the default rootless network (pasta on Podman 5)
podman run -d --name web -p 8080:80 docker.io/library/nginx:alpine

# Publish on a specific host interface only
podman run -d -p 127.0.0.1:8080:80 docker.io/library/nginx:alpine

Two production gotchas:

  1. Ports below 1024. A rootless user cannot bind privileged host ports by default. Either publish a high port and front it with a reverse proxy / firewall DNAT, or lower the threshold once, deliberately:

    # Allow rootless processes to bind from port 80 upward
    sudo sysctl -w net.ipv4.ip_unprivileged_port_start=80
    # Persist it
    echo 'net.ipv4.ip_unprivileged_port_start=80' | sudo tee /etc/sysctl.d/99-rootless-ports.conf
    
  2. Source IP for logging/allowlists. With pasta you can keep the real client IP. If an app’s access logs or IP allowlists matter, prefer pasta and verify the observed source address rather than assuming 10.0.2.x from slirp4netns.

For container-to-container DNS, create a named network – members resolve each other by container name through the aardvark-dns resolver that ships with netavark:

podman network create appnet
podman run -d --network appnet --name api  ghcr.io/acme/api:latest
podman run -d --network appnet --name web  ghcr.io/acme/web:latest
# 'web' can now reach the API at http://api:PORT

7. Automatic image updates with podman auto-update and rollback

Any container labelled io.containers.autoupdate=registry (the AutoUpdate=registry line in the Quadlet file sets this) participates in podman auto-update. It checks the registry for a newer digest on the same tag, pulls it, and restarts the systemd unit – only for containers managed by systemd, which Quadlet ones are by definition.

# Dry run: show what would update, change nothing
podman auto-update --dry-run

# Apply: pull newer digests and restart affected units
podman auto-update

The feature that makes this safe in production is automatic rollback. After restarting an updated unit, Podman waits for the container to become healthy. If the unit fails to come up, Podman rolls the image reference back to the previous digest and restarts again. Rollback is keyed on the unit reaching a running/healthy state, which is exactly why a real HealthCmd matters – without one, “healthy” means merely “the process did not immediately exit”, and a subtly broken image will sail through.

Wire it to a timer so it runs unattended. Podman ships podman-auto-update.service and podman-auto-update.timer; enable them per-user:

systemctl --user enable --now podman-auto-update.timer
systemctl --user list-timers podman-auto-update.timer

Pin to immutable digests for anything where surprise is unacceptable, and reserve AutoUpdate=registry for tags you genuinely want to track (a stable channel, an internal :prod tag your pipeline re-points). Auto-update on a :latest you do not control is how you get paged at 3 a.m.

8. Logging, healthchecks, and resource limits via cgroups v2

Because the container is a systemd unit, its stdout/stderr go to the journal automatically. No log driver to configure.

# Follow a Quadlet unit's logs
journalctl --user -u myapp.service -f

# Or Podman's own view, which also reads the journal
podman logs -f myapp

Healthchecks declared in the Quadlet file (HealthCmd, HealthInterval, HealthRetries) are enforced by Podman via a transient systemd timer per container. Inspect current health:

podman healthcheck run myapp        # run it once, now
podman inspect myapp --format '{{.State.Health.Status}}'   # healthy | unhealthy | starting

You can have Podman act on failure with HealthOnFailure=kill (or restart / stop), turning a failed probe into a restart that systemd then re-supervises.

Resource limits ride on cgroup v2. Rootless support for CPU and memory controllers requires that the controllers be delegated to your user slice – modern systemd does this for cpu, memory, pids, and io out of the box via user@.service. Set caps declaratively in the Quadlet file:

[Container]
Image=ghcr.io/acme/myapp:1.4.2
PodmanArgs=--memory=512m --cpus=1.5 --pids-limit=256

Or, more idiomatically, drive them through the unit’s [Service] directives so they appear in systemctl status:

[Service]
MemoryMax=512M
CPUQuota=150%
TasksMax=256

Verify delegation and live accounting:

# Which controllers are delegated to your user slice?
cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.controllers
# Expect: cpu memory pids io  (delegation present)

# Live resource usage for running containers
podman stats --no-stream

If memory limits silently do nothing, the memory controller is not delegated – check the line above before blaming Podman.

Verify

Run this end to end on a clean user to confirm the whole chain works:

# 1. Identity ranges and lingering are in place
grep "^$USER:" /etc/subuid /etc/subgid
loginctl show-user "$USER" --property=Linger        # Linger=yes

# 2. Rootless storage uses overlay, not vfs
podman info --format '{{.Store.GraphDriverName}}'   # overlay

# 3. The Quadlet unit generated and is active
systemctl --user daemon-reload
systemctl --user is-active myapp.service            # active

# 4. Container is healthy and listening
podman inspect myapp --format '{{.State.Health.Status}}'   # healthy
curl -fsS http://localhost:8080/healthz

# 5. It actually runs unprivileged on the host
podman top myapp huser hpid user                    # huser is your mapped UID, not 0

# 6. Auto-update sees it and the timer is armed
podman auto-update --dry-run
systemctl --user is-enabled podman-auto-update.timer   # enabled

# 7. Survives a reboot (the real test)
sudo systemctl reboot && sleep 60 && \
  systemctl --user is-active myapp.service          # active, post-reboot

If step 7 fails but everything else passed, you forgot loginctl enable-linger.

Enterprise scenario

A payments platform team I worked with ran ~40 stateless API containers per edge node across a fleet of bare-metal RHEL 9 boxes, originally on a root Docker daemon. Their security organisation issued a hard control: no root-owned container daemon and no privileged ports bound by application processes, driven by a PCI assessment that flagged the Docker socket as a single point of total host compromise. The app, however, was hard-coded to bind port 443 and the access logs fed an IP-based fraud model, so they could not lose the real client source address.

They moved each service to rootless Podman under a dedicated svc-api user with linger enabled, defined every workload as a Quadlet .container file checked into the same Git repo as their Ansible, and let the generator produce the systemd units on each host. Two specifics made it stick. First, instead of granting the service the privilege to bind 443, they kept the container on a high port and lowered the unprivileged port threshold only as far as needed, fronting it with the host firewall so nothing application-owned ran as root:

# /etc/containers/systemd/payments-api.container  (system-wide, runs as svc-api)
[Container]
Image=registry.internal/payments/api:prod
User=svc-api
PublishPort=8443:8443
AutoUpdate=registry
HealthCmd=curl -fsS https://localhost:8443/healthz -k || exit 1
HealthInterval=15s
# pasta preserves the real client source IP for the fraud model
Network=pasta

Second, they switched the rootless network backend to pasta, which preserved the genuine client source IP end to end – slirp4netns had been rewriting every request to its internal gateway address, which would have blinded the fraud model. Auto-update was pointed at the internal :prod tag, which their pipeline re-pinned to an immutable digest only after a canary cleared, and the built-in rollback (gated on the 15-second healthcheck) caught two bad pushes in the first quarter without a human touching a host. The Docker socket finding was closed, and “restart policy” became a one-line Restart=on-failure that every other systemd service on the box already used.

Checklist

linuxpodmancontainerssystemdrootless

Comments

Keep Reading