Ansible for Edge and IoT Fleet Management, In Depth — Pull-Mode, Signed Manifests, Constrained Devices and Intermittent Networks
Datacentre Ansible assumes warm prerequisites: every host reachable on a stable network, fast SSH, predictable hardware, plentiful CPU and disk. Edge Ansible assumes the opposite. Devices are scattered across thousands of locations, behind NAT and shared 4G modems, with 1GB of RAM and 8GB of flash, online for minutes a day. The “fleet” is 5,000 retail kiosks, 25,000 wind-turbine controllers, 100,000 vehicle telematics units, or 1.5 million water-meter gateways. Push-mode Ansible — open SSH from a control plane to every host on a schedule — is the wrong shape.
This lesson is the specialist guide to inverting the model: pull-based agents, signed manifests, fleet operators built on Kubernetes-at-the-edge (k3s, MicroK8s, KubeEdge), and the realistic operational patterns for constrained, intermittently-connected hardware. We will use ansible-pull, the Red Hat Device Edge / Image Mode for RHEL stack, and the ostree/bootc image-based update model where appropriate.
We will be opinionated about scale: at a hundred edge hosts, AAP push-mode still works. At a thousand, mesh execution nodes start to creak. At ten thousand and beyond, you must move to pull-mode and image-based updates; if you do not, you will brick devices in the field. The patterns here scale from one to one million.
Position in the curriculum. Tier 1–4 fluency required, plus the Tier 5 air-gapped lesson — many edge fleets are also air-gapped (industrial OT, aviation, defence). The compliance lesson informs the signing model.
What “edge” means and why it changes the rules
“Edge” covers a wide range; the relevant attributes for Ansible are:
- Numbers: 1k–1M devices, sometimes more.
- Location: thousands of physical sites — retail stores, wind farms, hospitals, vehicles.
- Connectivity: intermittent 4G/5G/satellite/WiFi, often behind carrier NAT. No inbound reachability.
- Hardware: small ARM64/x86 boards (Raspberry Pi, NXP iMX, NVIDIA Jetson). 1–8GB RAM. 8–256GB flash. No swap.
- Power: often constrained. Reboots cause field outages.
- Update window: minutes per day, sometimes per week. Updates must be atomic and rollback-able.
- Identity: device certificates, not user/password.
- Lifecycle: 5–10 years in field with occasional swap-outs.
Together these break every assumption datacentre Ansible makes:
| Datacentre assumption | Edge reality |
|---|---|
| Control plane connects out to host (push) | Host must connect to control plane (pull) |
| SSH always reachable | Device may be online for 4 minutes per day |
| Failed task = retry | Failed task on a 4G link = wait until tomorrow |
dnf install foo or apt install foo |
No network during the install window |
| Rollback = revert config | Rollback = revert the entire OS image (atomic) |
| Privilege escalation via sudo | Device root-of-trust signed firmware; no sudo at all |
| Inventory in CMDB updated by operators | Inventory is a serial number on a sticker, scaling weekly |
The mental shift is from “playbooks executed on hosts” to “hosts that pull and apply signed manifests.”
Three operational patterns at the edge
There are three patterns at scale; pick deliberately.
1. ansible-pull — the simplest pull-mode. Each device runs ansible-pull from cron / systemd timer, fetches a Git repo, runs the playbook locally. Works up to about 5k devices with discipline.
2. Image-based (ostree/bootc) — every change ships as a new bootable OS image. The device reboots into the new image (atomic) or rolls back (atomic) on failure. This is the Red Hat Device Edge and Fedora bootc / RHEL Image Mode model. Scales to millions; what big telcos and automakers actually use.
3. Kubernetes-at-the-edge — k3s/MicroK8s on each device, with a fleet operator (Rancher Fleet, Argo CD edge mode, EdgeX Foundry) that pulls and reconciles workloads. Most appropriate when the workload is itself container-based (ML models, data pipelines, video analytics).
In real fleets, you usually combine: image-based for the OS and platform layer (Pattern 2), Kubernetes for the application layer (Pattern 3), and ansible-pull for one-shot operational tasks (Pattern 1) when needed.
Pattern 1: ansible-pull for small to medium fleets
ansible-pull is the bare-minimum pull-mode runner. It clones a Git repo, runs a playbook locally with --connection=local, and exits. Configure on each device:
# /etc/systemd/system/ansible-pull.service
[Unit]
Description=ansible-pull
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/bin/ansible-pull \
-U https://gitea.kv.local/edge/edge-fleet.git \
-i localhost, \
-C main \
-d /var/lib/ansible-pull/repo \
--vault-password-file /etc/ansible/vault.pwd \
--verify-commit \
edge-pull.yml
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/ansible-pull.timer
[Unit]
Description=ansible-pull every 30 minutes (with jitter)
[Timer]
OnBootSec=2min
OnUnitActiveSec=30min
RandomizedDelaySec=10min
Persistent=true
[Install]
WantedBy=timers.target
Three details that matter:
--verify-commit: requires the Git tip to be GPG-signed. Without it, anyone with push access to the repo can run code on every device. Do not skip this.RandomizedDelaySec=10min: jitter prevents 5,000 devices from all hitting the Git server at the same minute. Without it your repo server crashes.Persistent=true: missed timers run at next boot. Without this, a device offline for 6 days never catches up.
The repo is a normal Ansible project with one entry-point playbook (edge-pull.yml) using --connection=local. Tasks gated by host facts (when: ansible_facts.hostname.startswith('kiosk-')) let you carve out groups without separate inventories.
For small fleets up to ~5,000 devices, this is enough. The Git server (Gitea, GitLab, or the AAP-bundled Hub for collections) is the only “infrastructure.” Add a webhook from the Git server to a metrics endpoint to track which devices have pulled which commit; that is your fleet status dashboard.
Signed manifests and rollback for ansible-pull
ansible-pull itself doesn’t roll back; if the playbook breaks, the device is broken. Engineer rollback explicitly:
# edge-pull.yml — checkpointed apply
---
- hosts: localhost
connection: local
tasks:
- name: Read currently applied commit hash
ansible.builtin.slurp:
src: /var/lib/ansible-pull/applied_commit
register: applied
ignore_errors: true
- name: Compute pending commit hash
ansible.builtin.command:
cmd: git -C /var/lib/ansible-pull/repo rev-parse HEAD
register: pending
changed_when: false
- name: Snapshot before applying (btrfs)
ansible.builtin.command:
cmd: btrfs subvolume snapshot / /.snapshots/pre-{{ pending.stdout[:7] }}
when: applied.content | default('') | b64decode | trim != pending.stdout
ignore_errors: true # not all devices have btrfs
- name: Run the actual configuration role
ansible.builtin.import_role:
name: kiosk_configure
- name: Smoke test
ansible.builtin.import_role:
name: kiosk_smoke
- name: Persist applied commit (only if smoke test passed)
ansible.builtin.copy:
dest: /var/lib/ansible-pull/applied_commit
content: "{{ pending.stdout }}\n"
- name: Trim old snapshots to keep only last 3
ansible.builtin.shell: |
ls -1t /.snapshots | tail -n +4 | xargs -r -I {} btrfs subvolume delete /.snapshots/{}
ignore_errors: true
Rollback is then a separate playbook triggered by a watchdog: if the device cannot reach the Git server for 24 hours and the smoke test is failing, the watchdog rolls back to the previous snapshot. This is the “deadman switch” pattern; it has saved more fleets than any other single mechanism.
Pattern 2: image-based updates with ostree / bootc
For fleets above ~5k devices, the right primitive is the bootable OS image. Instead of mutating the running OS (running dnf install on a Pi in the field), you publish a new immutable image, the device boots into it, and rolls back on failure. This is what Tesla, Volkswagen, Boeing, and every modern automotive/aerospace stack does at scale.
The Red Hat way is RHEL Image Mode (bootc), which uses OSTree as the on-device store and OCI container images as the build/distribution format. bootc is a small native runtime that knows how to switch between two deployments (current and pending) atomically.
# Containerfile — your edge OS image
FROM registry.redhat.io/rhel9/rhel-bootc:9.4
RUN dnf install -y \
podman \
systemd-container \
ansible-core \
python3-cryptography \
kiosk-app && \
dnf clean all
# Bake the kiosk app config into the image
COPY etc/kiosk/ /etc/kiosk/
# Enable services
RUN systemctl enable kiosk-app.service
# Bake the ansible pull config too
COPY etc/systemd/system/ansible-pull.service /etc/systemd/system/
COPY etc/systemd/system/ansible-pull.timer /etc/systemd/system/
RUN systemctl enable ansible-pull.timer
Build, sign, push:
podman build -t registry.kv.local/kiosk-os:1.4.2 -f Containerfile .
cosign sign --key /etc/cosign-edge.key registry.kv.local/kiosk-os:1.4.2
podman push registry.kv.local/kiosk-os:1.4.2
On each device:
bootc switch registry.kv.local/kiosk-os:1.4.2
bootc upgrade --check # show what would change
bootc upgrade --apply # stages new deployment
systemctl reboot # boots into new deployment
# if anything is broken:
bootc rollback # boots into previous deployment, atomic
The state machine: each device has two deployments on disk (current and pending). On reboot, GRUB boots the pending. If it fails to boot 3 times (or fails health checks within N minutes), GRUB falls back to the previous one automatically. You cannot brick a device with bootc, because the rollback is hardware-enforced via the firmware/bootloader. This is the property that makes the model viable at million-device scale.
Ansible’s role here is driving the image build, not running on each device:
# build-edge-image.yml — runs on the build host
- hosts: build_host
tasks:
- name: Render Containerfile from template
ansible.builtin.template:
src: Containerfile.j2
dest: /tmp/build/Containerfile
- name: Build image
containers.podman.podman_image:
name: "registry.kv.local/kiosk-os"
tag: "{{ release }}"
path: /tmp/build
push: true
push_args:
dest: "registry.kv.local/kiosk-os:{{ release }}"
- name: Sign with cosign
ansible.builtin.command:
cmd: cosign sign --key /etc/cosign-edge.key registry.kv.local/kiosk-os:{{ release }}
- name: Update fleet rollout config
ansible.builtin.uri:
url: https://fleet.kv.local/api/rollouts
method: POST
body_format: json
body:
target: kiosks
image: "registry.kv.local/kiosk-os:{{ release }}"
wave: canary
percentage: 1
Then a fleet operator (Rancher Fleet, Argo CD edge, custom controller) watches the rollout config and pushes the device-side bootc switch commands.
Pattern 3: Kubernetes-at-the-edge with fleet operators
For workloads that are themselves container-based — ML inference, video analytics, data pipelines — running k3s on each device and using a fleet operator is often the cleanest pattern.
k3s is a 60MB Kubernetes distribution that runs comfortably on 1GB-RAM ARM64 boxes. Each device runs a single-node k3s cluster (or a 3-node mini-cluster across nearby devices); workloads are pods deployed via a fleet operator.
The Ansible role is to:
- Install and configure k3s on each device (one-time, via the OS image).
- Manage the fleet operator’s manifests (in Git, applied across the fleet).
- Manage the device certificates / mTLS that authenticates the device to the operator.
# install k3s during image build (Pattern 2 + Pattern 3)
- name: Install k3s
ansible.builtin.shell: |
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.30.5+k3s1 sh -s - \
--token={{ vault_k3s_token }} \
--server=https://fleet-control.kv.local:6443
The fleet manifest is GitOps:
# fleet/kiosk-app/fleet.yaml
namespace: kiosk
helm:
chart: oci://registry.kv.local/charts/kiosk-app
version: 1.4.2
targets:
- name: canary
clusterSelector:
matchLabels:
wave: canary
- name: prod
clusterSelector:
matchLabels:
wave: prod
Devices labelled wave: canary get the new chart first; once metrics confirm health, you bump wave: prod to the same version. The label change is tracked in Git; the rollback is git revert.
Ansible orchestrates the rollout (e.g., flipping labels in waves) but does not need to reach each device directly. The control plane connects out; the device pulls in.
Connectivity reality: NAT, MQTT, and one-way trust
Most edge devices live behind carrier NAT (4G/5G/cable), which means no inbound connectivity from the control plane. Three viable bidirectional channels:
- HTTPS pull (every N minutes) — devices initiate. Used by
ansible-pull,bootc,fleet-agent, Argo CD edge. - MQTT/AMQP — devices maintain a long-lived connection to a broker; the control plane publishes commands. Used by industrial IoT, EdgeX, AWS IoT Core, Azure IoT Hub.
- WebSocket / gRPC stream — devices maintain a long-lived stream; bidirectional RPC. Used by Tailscale-style VPN agents, RancherD, KubeEdge.
Each has trade-offs. HTTPS pull is the simplest but high latency (your minimum response time is one poll interval). MQTT is real-time but adds a broker dependency. WebSocket gives RPC semantics but is heavyweight for very small devices.
Pick one and standardise. Mixing transports per device class makes the control plane brittle; uniform transport with class-specific topics/labels makes it manageable.
For the AAP-aware reader: AAP automation mesh nodes are not viable as edge agents. They are too heavy, too tightly coupled to the controller, and require always-on network. Edge devices need the patterns above; AAP can still be the operator-facing UI that triggers fleet rollouts via webhooks.
Identity and trust at the edge
Each device must be uniquely identifiable and authenticate cryptographically. The chain:
- Hardware root of trust — TPM 2.0 module on the board, with a manufacturer-issued Endorsement Key (EK).
- Device certificate — issued at provisioning by an internal CA, bound to the TPM’s EK or AIK. Stored in TPM-protected NVRAM.
- Workload secrets — short-lived tokens issued by the control plane after device cert validation. Rotated automatically.
The Ansible workflow:
# device-provision.yml — runs on the build host or first-boot enrolment
- name: Generate device certificate via internal CA
community.crypto.x509_certificate:
path: "/var/lib/devices/{{ device_serial }}.crt"
privatekey_path: "/var/lib/devices/{{ device_serial }}.key"
csr_path: "/var/lib/devices/{{ device_serial }}.csr"
provider: ownca
ownca_path: /etc/pki/edge-ca/ca.crt
ownca_privatekey_path: /etc/pki/edge-ca/ca.key
ownca_privatekey_passphrase: "{{ vault_ca_passphrase }}"
ownca_not_after: "+1095d" # 3 years
no_log: true
- name: Bake cert into device image at first boot
ansible.builtin.copy:
src: "/var/lib/devices/{{ device_serial }}.crt"
dest: "/etc/pki/device/cert.pem"
delegate_to: "{{ device_address }}"
no_log: true
Workload secrets (S3 keys, MQTT credentials) are not stored on the device; the device authenticates with its certificate and the control plane returns short-lived credentials each session. This pattern is what makes a stolen-device scenario manageable: revoke the cert, rotate workload creds, the lost device is locked out within minutes.
OTA security considerations
Any OTA update path is also an attack path. Defenses:
- All artefacts signed: Containerfile-built images signed with cosign; ansible-pull commits signed with GPG; manifests signed by the fleet operator’s CA.
- No public images allowed: device-side
bootcconfig refuses unsigned or wrong-issuer images. - CRL/OCSP enforced: revoked certs get blocked even before the next reboot.
- Two-person review on every fleet rollout (canary stage gated by survey).
- Rate limiting: a device cannot pull the same image more than N times per hour (defends against forced-rollback attack).
- Signed metrics: device-emitted metrics signed with device key, so you cannot fake “100k devices reporting healthy” from one compromised host.
Auditors love the signed-everything story. Auditors hate the “we trust whatever Git pushes” story. Make the right choice early.
Constrained-device pragmatics
A 1GB-RAM device cannot run an Ansible execution environment full of Python deps. Strategies:
- Strip the EE: build a minimal EE that contains only the collections actually used at the edge. Aim for under 100MB.
- Cache aggressively: each
ansible-pullreuses cached facts and the Git checkout from last run. - Avoid
packageat runtime: the OS comes via image;ansible-pullconfigures, never installs. This frees RAM and removes network dependency. - Use
gather_subset: !all,!facter,!ohai: limit fact gathering to essentials. - Use the
localconnection: never SSH to localhost; alwaysconnection: localin the entry playbook. - Watch memory: a single careless
with_itemsover a 100k-line file blows the RAM budget. Useloop_control: { label }, preferlineinfileover template-then-replace for small edits.
For genuinely tiny devices (microcontrollers under 256KB), Ansible is the wrong tool — you ship pre-built firmware over OTA frameworks like Mender, RAUC, or Zephyr’s MCUboot. Ansible lives one layer up: provisioning the gateway that talks to those firmware updaters.
Fleet operator: state machine and dashboards
The fleet operator (whether a Rancher Fleet, custom controller, or AAP workflow with EDA) tracks each device’s state machine:
[unenrolled] -- enrol --> [pending] -- ack --> [active]
|
rollout (canary) -------|
v
[pending-update] -- apply ok --> [active@new]
-- apply fail --> [active@old] (rollback)
|
retire ----------------|--> [retired]
Each transition emits a metric (Prometheus counter) and an event (Kafka, MQTT, or simple HTTP webhook). The dashboard answers:
- How many devices on each version?
- Which devices haven’t checked in in N hours?
- What is the rollout success rate per wave?
- Which devices are stuck in
pending-update?
Wave management is critical at scale: 1% canary for 24 hours, 5% wave for 48 hours, 25% wave for 72 hours, then full rollout. Bake this into the fleet operator config; do not let release managers override it manually.
Anti-patterns that destroy edge fleets
- Using ansible-playbook from a centralised control plane in push mode. SSH-to-50k-NAT’d-devices is a non-starter. Use pull-mode or image-based.
- Mutating the running OS in the field. A failed
dnf upgradeon a kiosk in a remote store ends with a truck roll. Use atomic image-based updates. - No randomised jitter on pull timers. All 5k devices hit Git at minute 0; Git falls over.
- Unsigned manifests / unsigned images. Anyone with push access owns the fleet.
- Persistent device credentials with broad scope. A stolen device == fleet compromise. Use short-lived tokens.
- One-shot rollback via “ssh in and fix it”. When the device is offline 23 hours a day, you cannot SSH in. Engineer rollback to fire automatically without human intervention.
- No TPM / no hardware root of trust. Your fleet identity scheme can be cloned by extracting the certificate from a flash dump.
- Skipping the “deadman switch” rollback. The device must roll back when it cannot reach the control plane for too long, not require remote action.
- Treating canary as optional. A bad image to 100% of fleet is a 24-hour incident; a bad image to 1% is a 1-hour incident.
Frequently asked questions
1. When is ansible-pull enough vs when do I need image-based updates?
Up to ~5k stable devices with reliable network and a config-only change set: ansible-pull is enough. Beyond that, or any case where you need to update kernel / glibc / firmware atomically: move to bootc/ostree image-based.
2. Can I run AAP at the edge? No. AAP is a datacentre product — controller, hub, automation mesh assume always-on network and ample resources. Use AAP as the back-of-house operator UI for the fleet operator (trigger rollouts, run reports), not as a runtime on each device.
3. How do I pin Ansible content for edge?
Build the EE once on the build host with all collection versions pinned and signed. Bake the EE into the device image (Pattern 2). Devices never resolve Galaxy, never pip install, never reach out for content at runtime.
4. What about bandwidth costs? A bootc upgrade typically transfers only the layers that changed (OCI delta). For a 200MB-app-change on a 4GB-image OS, the transfer is ~50MB. With waves and randomised jitter, fleet bandwidth at peak is manageable. Always cap concurrent rollouts per gateway, and prefer cellular off-peak windows for non-urgent updates.
5. How do I handle 4G outages during update?
The bootc upgrade process is resumable: the OCI fetcher caches partial layers. If connectivity drops mid-fetch, the next attempt resumes. Atomic apply happens only after full fetch + verification.
6. What about OT / industrial devices that cannot run Ansible at all? You don’t run Ansible on them. You run Ansible on the gateway that aggregates their data (the “edge gateway”, which is a Linux box). The gateway does protocol translation (Modbus/OPC-UA → MQTT) and applies updates to the OT devices via vendor-specific tooling. Ansible owns the gateway; vendor tooling owns the deepest leaf.
7. Are there security implications to using cosign / sigstore at the edge? Yes — the device must trust the cosign public keys, and the OCSP/CRL infrastructure must be reachable for revocation checks. Bake the public key bundle into the image; for revocation, use short-lived certs (renew every 24h) and reject expired without OCSP lookup.
8. How do I handle a fleet of mixed device types and OSes?
Tag the inventory by capability: arch=arm64, ram=2g, network=4g, os=rhel9, class=kiosk. Roles use when guards on tags. Image builds produce per-class images. Fleet operator targets devices by tag, not by serial number — a serial-number-by-serial-number rollout does not scale.
9. What’s the right way to monitor an edge fleet? Each device emits a small metrics blob to Prometheus pushgateway / OTel collector / MQTT topic on every check-in. The control plane aggregates by version, wave, region. Anomaly detection (a device that hasn’t checked in for 7 days) triggers the maintenance ticket flow. Avoid scraping every device; that defeats the pull-mode pattern.
10. What’s the single most underrated edge practice?
The deadman watchdog rollback. Every edge device should have a small program that monitors “have I successfully completed health checks recently?” and triggers a bootc rollback (or git checkout previous) if the answer is no. Without this, a regression deployed during your 4-minute connectivity window can leave a device unrecoverable until physical service. With it, the worst case is a slightly-out-of-date device.
Hands-on lab — ansible-pull with signed commits
This lab simulates a small edge fleet with a single device pulling a signed playbook from a local Git server.
Prerequisites: Linux box (or Pi), git, gpg, ansible-core ≥ 2.16.
mkdir -p edge-lab/{repo,device}
cd edge-lab
# 1. Local git repo with a tiny playbook
git init repo
cat > repo/edge-pull.yml << 'EOF'
- hosts: localhost
connection: local
tasks:
- debug:
msg: "Edge pull at {{ ansible_date_time.iso8601 }}, commit {{ lookup('env','COMMIT') | default('unknown') }}"
- copy:
dest: /tmp/edge-status
content: |
last-pull: {{ ansible_date_time.iso8601 }}
host: {{ ansible_facts.hostname }}
EOF
( cd repo && git add edge-pull.yml && git commit -S -m "v1: tiny edge play" )
# 2. Set up signing key
gpg --quick-gen-key 'edge-signer@kv.local' rsa4096 sign 1y
git -C repo config user.signingkey 'edge-signer@kv.local'
# 3. The device side: configure ansible-pull
cat > device/pull.sh << 'EOF'
#!/bin/bash
set -e
COMMIT=$(git -C /tmp/edge-repo rev-parse HEAD || echo unset)
exec ansible-pull \
-U ${PWD}/repo \
-i localhost, \
-d /tmp/edge-repo \
--verify-commit \
edge-pull.yml
EOF
chmod +x device/pull.sh
# 4. Run it
device/pull.sh
cat /tmp/edge-status
# 5. Try to push an unsigned commit and see the verify-commit fail
( cd repo && git commit --allow-empty -m "unsigned" )
device/pull.sh # should fail because tip is unsigned
The lab proves the signing chain end-to-end: the device pulls only commits signed by a trusted key. Extend it: add a systemd timer in device/, add a snapshot/rollback step in the playbook, add a watchdog that rolls back if /tmp/edge-status is older than N minutes.
Glossary
- Edge — distributed devices outside the datacentre, often constrained, often intermittent.
- Push mode — control plane connects out to host (default Ansible).
- Pull mode — host connects in to control plane (
ansible-pull). - Image-based update — entire OS replaced atomically; rollback by selecting previous image.
- bootc — OCI-image based bootable Linux runtime (Red Hat Image Mode for RHEL, Fedora bootc).
- OSTree — Git-like content-addressed filesystem that backs ostree-based atomic updates.
- k3s / MicroK8s — small Kubernetes distributions for edge.
- Fleet operator — controller (Rancher Fleet, Argo CD edge, custom) that drives manifests across many edge clusters/devices.
- TPM — Trusted Platform Module; hardware root of trust.
- Deadman switch — automatic rollback if device cannot reach control plane.
- Wave — slice of fleet (canary 1%, beta 5%, stable 25%, full).
Certification mapping
- EX374 — Workflow orchestration, RBAC, EDA (used to gate edge rollouts).
- CKA / CKAD — Kubernetes operator patterns directly applicable to fleet operators.
- AWS / Azure IoT specialty exams — broker / device shadow / OTA patterns map onto this lesson’s concepts.
Next steps
You now have an opinionated architecture for managing real edge fleets at any scale. The remaining specialist lessons cover:
- ITSM and ChatOps — wiring AAP and the fleet operator into ServiceNow/Jira/Slack so device rollouts go through controlled approval gates.
- Backup and storage automation — even at the edge, gateway data needs backup and lifecycle policy.
- Online database migrations — when an edge fleet’s data layer changes shape (e.g., from on-device SQLite to a cloud data lake).
- Observability — capping the lesson series with how to ingest fleet metrics into Prometheus/Grafana for executive dashboards.
If you only take one habit from this lesson: never mutate a running edge OS. Every change ships as a new signed image; every rollback is automatic and atomic. Once that property holds, every other edge problem becomes solvable.