The Cardinal Rule of Auto-Remediation
A self-healing script that runs without guardrails is just a faster way to take production down. Every public post-mortem you have read where “the auto-remediation made it worse” follows the same script: the healer detected a symptom, took an action, the action made the symptom look like it cleared, the underlying cause was still there, the symptom returned, the healer fired again — and the resource it was “healing” entered a restart loop, depended-on services started failing, the healer started firing on those, and within 4 minutes a single bad pod became a regional outage.
The discipline that prevents this:
| Layer | What it stops |
|---|---|
| Explicit Detect-Decide-Act loop | Conflating “broken” with “looks broken” |
| Idempotency keys | The same incident triggering the same fix multiple times in parallel |
| Blast-radius limits | A bug in the healer cascading across the fleet |
| Circuit breakers | The healer continuing to fire when its own actions are failing |
| Dry-run mode | Shipping un-tested healer logic to production |
| Audit log | “Why did the healer do that?” being unanswerable |
This lesson teaches each layer with shell scripts, a lib/heal.sh you can source, and two worked examples: a good healer that has saved sites, and a structurally identical bad one that nuked them.
The Detect-Decide-Act Loop
Every healer has three explicit phases:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ DETECT │───▶│ DECIDE │───▶│ ACT │
└──────────┘ └──────────┘ └──────────┘
▲ │
└────────────────────────────────┘
loop interval
- Detect: gather signal (metric scrape, health check, log query). Output: a fact like “queue_depth=12000”.
- Decide: apply policy. Output: an intent like “restart worker pid 12345” — or “no action”.
- Act: perform the intended action with idempotency, rate limit, and audit log.
The phases are separate functions, callable independently. This matters for testing: you can unit-test Decide with synthetic facts without ever Acting, and you can dry-run by replacing Act with a logger.
Why the Loop Must Have Three Phases (Not Two)
The naive form is “if condition then action” which collapses Detect and Decide into a single test. This works for trivial healers but fails as soon as policy gets non-trivial:
- “Restart worker if queue depth > 10000 and worker hasn’t been restarted in last 5 minutes and less than 3 workers are restarting fleet-wide right now.”
The three conditions are Decide logic. Forcing them into the Detect phase means you can’t reuse the detector for monitoring alerts; forcing them into Act means dry-run can’t show what would happen. Keep them separate.
Skeleton
#!/usr/bin/env bash
# heal-worker.sh — example healer skeleton
set -euo pipefail
source /usr/local/lib/heal.sh
readonly NAME=worker-restarter
readonly INTERVAL=60 # seconds
while true; do
fact=$(detect)
intent=$(decide "$fact")
if [[ -n "$intent" ]]; then
act "$intent"
fi
sleep "$INTERVAL"
done
detect() {
# Returns a fact line. Empty = no signal.
local depth
depth=$(redis-cli LLEN myapp:queue 2>/dev/null || echo 0)
printf 'queue_depth=%s\n' "$depth"
}
decide() {
local fact="$1"
local depth
depth=$(echo "$fact" | awk -F= '/queue_depth=/ {print $2}')
if (( depth > 10000 )); then
# Find the oldest stuck worker
local pid
pid=$(pgrep -of myapp-worker)
[[ -n "$pid" ]] && printf 'restart_worker pid=%s\n' "$pid"
fi
}
act() {
heal_act_with_guardrails "$1"
}
The heal_act_with_guardrails wraps idempotency + rate limit + circuit breaker + audit log. It’s what lib/heal.sh provides.
Idempotency Keys: One Incident, One Action
Without idempotency keys, two consecutive Detect cycles that observe the same problem trigger the same Act twice. For “restart worker”, that’s two restarts in 60 seconds — which can confuse a process supervisor and leave the worker in an unknown state.
The fix is an idempotency key derived from the incident not the cycle:
# Compute idempotency key from intent + a coarse time bucket
# Two identical intents in the same 5-minute bucket are deduped.
heal_idempotency_key() {
local intent="$1"
local bucket
bucket=$(( $(date +%s) / 300 ))
printf '%s\n' "$intent.$bucket" | sha256sum | cut -d' ' -f1
}
The key is then used to gate execution:
heal_act_with_guardrails() {
local intent="$1"
local key
key=$(heal_idempotency_key "$intent")
local marker="/var/lib/heal/$NAME/keys/$key"
if [[ -f "$marker" ]]; then
heal_log "SKIP duplicate intent: $intent (key=$key)"
return 0
fi
mkdir -p "$(dirname "$marker")"
: > "$marker"
# ...continue to act
}
Two adjacent cycles that observe the same problem hit the same key and the second is silently skipped. After the bucket rolls (5 minutes later), the key changes and the healer can fire again — which is the desired property: “fix once per 5 min, not on every cycle.”
Idempotency Across Restarts
The marker file persists across script restarts. This is on purpose: if your healer is restarted by systemd at 14:32 and saw an intent at 14:31, it shouldn’t re-fire that same intent in the same bucket. Persisted markers give you crash-safety.
GC the markers nightly:
# tmpfiles.d/heal.conf
d /var/lib/heal/*/keys 0750 heal heal -
e /var/lib/heal/*/keys - - - 7d
tmpfiles.d e directive deletes files older than 7 days. Sufficient for any sensible bucket size.
Blast-Radius Limits: The Rate Limit That Saves The Fleet
This is the most important guardrail in the entire lesson. Every healer must declare a blast-radius limit and refuse to exceed it.
Examples:
- “Restart at most 1 worker per minute, per host.”
- “Disable at most 3 services per 10 minutes, fleet-wide.”
- “Roll back at most 10% of pods per 5 minutes.”
The limit is enforced via a token bucket:
# heal_rate_limit_check NAME RATE BURST
# Returns 0 (allow) if a token is available, 1 (deny) otherwise.
heal_rate_limit_check() {
local name="$1" rate="$2" burst="$3"
local state="/var/lib/heal/$NAME/rate-$name"
local now last_refill tokens elapsed
now=$(date +%s)
if [[ -f "$state" ]]; then
last_refill=$(awk '{print $1}' "$state")
tokens=$(awk '{print $2}' "$state")
else
last_refill=$now
tokens=$burst
fi
elapsed=$(( now - last_refill ))
tokens=$(awk -v t="$tokens" -v e="$elapsed" -v r="$rate" -v b="$burst" \
'BEGIN { v = t + e * r; if (v > b) v = b; print v }')
awk -v t="$tokens" 'BEGIN { exit !(t >= 1) }' || {
printf '%d %s\n' "$now" "$tokens" > "$state"
return 1
}
tokens=$(awk -v t="$tokens" 'BEGIN { print t - 1 }')
printf '%d %s\n' "$now" "$tokens" > "$state"
return 0
}
Usage:
# Allow 1 restart per minute, with burst of 3 (catches up during quiet periods)
if heal_rate_limit_check "worker-restart" "0.0167" "3"; then
# Allowed — proceed with restart
systemctl restart myapp-worker.service
else
heal_log "Rate-limited: skipping worker restart"
fi
The token bucket auto-recovers: if the healer doesn’t fire for 10 minutes, the bucket refills to burst, ready for a small flurry. But sustained firing can’t exceed rate per second.
Fleet-Wide Rate Limits via a Shared Lock
Per-host limits aren’t enough when the healer runs on every host. If 100 hosts each “rate-limit to 1/min” and the trigger condition is fleet-wide, you get 100 actions per minute fleet-wide. Solution: a central coordinator (Redis, Consul KV, etcd) that issues fleet-wide tokens:
# Use Redis SETNX as a distributed lock. The lock has a TTL so it auto-expires.
heal_fleet_lock_acquire() {
local key="$1" ttl="$2"
local token
token=$(uuidgen)
if redis-cli SET "heal:lock:$key" "$token" EX "$ttl" NX | grep -q OK; then
printf '%s\n' "$token"
return 0
fi
return 1
}
heal_fleet_lock_release() {
local key="$1" token="$2"
# Lua: only release if we own the lock (avoids releasing someone else's lock if we're slow)
redis-cli EVAL "if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('DEL', KEYS[1]) else return 0 end" 1 "heal:lock:$key" "$token"
}
# Usage: only one host fleet-wide can act on this incident
if token=$(heal_fleet_lock_acquire "kafka-broker-restart" 300); then
systemctl restart kafka.service
heal_fleet_lock_release "kafka-broker-restart" "$token"
fi
The Lua script in release ensures we don’t accidentally delete someone else’s lock if we held ours past TTL. This is the classic correct Redis distributed lock pattern (Redlock without quorum, sufficient for healer coordination).
Circuit Breakers: Stop When Your Actions Aren’t Working
A circuit breaker disables the healer after N consecutive failed actions. This prevents the worst failure mode: the healer firing repeatedly because its actions don’t actually fix the problem (the worker keeps crashing on restart, but the healer keeps trying to restart it).
heal_circuit_breaker_check() {
local name="$1" max_failures="$2" cooldown="$3"
local state="/var/lib/heal/$NAME/cb-$name"
local now failures last_failure
now=$(date +%s)
if [[ ! -f "$state" ]]; then
return 0 # closed, allow
fi
failures=$(awk '{print $1}' "$state")
last_failure=$(awk '{print $2}' "$state")
if (( failures >= max_failures )); then
if (( now - last_failure < cooldown )); then
heal_log "Circuit OPEN for $name (failures=$failures, cooldown=$((cooldown - (now - last_failure)))s remaining)"
return 1
else
# Cooldown expired — half-open: allow one trial
heal_log "Circuit HALF-OPEN for $name"
return 0
fi
fi
return 0
}
heal_circuit_breaker_record() {
local name="$1" outcome="$2" # success | failure
local state="/var/lib/heal/$NAME/cb-$name"
local now=$(date +%s)
if [[ "$outcome" == "success" ]]; then
rm -f "$state" # reset on success (half-open → closed)
else
local failures=0
[[ -f "$state" ]] && failures=$(awk '{print $1}' "$state")
failures=$((failures + 1))
printf '%d %d\n' "$failures" "$now" > "$state"
fi
}
Usage with verification:
heal_act_with_circuit_breaker() {
local intent="$1"
if ! heal_circuit_breaker_check "worker-restart" 3 600; then
return 1 # circuit open, skip
fi
systemctl restart myapp-worker
sleep 10 # give it time to start
if systemctl is-active --quiet myapp-worker; then
heal_circuit_breaker_record "worker-restart" success
else
heal_circuit_breaker_record "worker-restart" failure
return 1
fi
}
After 3 consecutive failures, the circuit opens and stays open for 10 minutes. After cooldown, one trial action is allowed (half-open); success closes the circuit, failure re-opens it for another 10 minutes.
The numbers come from production experience: 3 failures means “this isn’t a transient — we’re in a real failure mode.” 10 minutes cooldown is enough for a human to investigate (and for monitoring alerts to wake someone).
Dry-Run Mode: The Discipline That Catches The Bad Healer Before It Lives
Every healer must have a --dry-run flag that runs Detect+Decide but replaces Act with a logger:
DRY_RUN=${DRY_RUN:-false}
heal_act() {
local intent="$1"
if $DRY_RUN; then
heal_log "DRY-RUN: would have acted on: $intent"
return 0
fi
# ...real action
}
Every new healer should run for 7 days in dry-run before live mode. Compare:
- Number of intents emitted in dry-run.
- Severity of each (read the audit log).
- Whether the count looks reasonable for known incidents.
If your healer dry-runs at 200 actions/day on a 50-host fleet, that’s almost certainly a bug — real healers fire 0-5 actions/day. Tune detection thresholds before going live.
The Three-Stage Rollout
- Dry-run, dev fleet (1 day) — catch logic bugs in your decide phase.
- Dry-run, prod fleet (7 days) — see real production patterns.
- Canary live, 1 host (3 days) — actually fire on one host, watch for unintended consequences.
- Live, full fleet — gradual rollout, 10% → 25% → 50% → 100% over a week.
This is overkill for trivial healers (e.g., “remove core files older than 7 days”) but it’s exactly right for anything that mutates running services.
Audit Log: The Question Is Always “Why?”
Every action must produce an audit record with enough context to answer “why did the healer do that?” three months from now.
heal_audit() {
local intent="$1" outcome="$2" reason="$3"
local audit_file="/var/log/heal/audit.jsonl"
mkdir -p "$(dirname "$audit_file")"
jq -nc \
--arg ts "$(date -Iseconds)" \
--arg host "$(hostname)" \
--arg name "$NAME" \
--arg intent "$intent" \
--arg outcome "$outcome" \
--arg reason "$reason" \
--arg fact "$LAST_FACT" \
'{ts:$ts, host:$host, healer:$name, intent:$intent, outcome:$outcome, reason:$reason, fact:$fact}' \
>> "$audit_file"
}
Sample output:
{"ts":"2026-06-22T14:32:01+00:00","host":"web-07","healer":"worker-restarter","intent":"restart_worker pid=12345","outcome":"success","reason":"queue_depth=12450 > 10000","fact":"queue_depth=12450,workers_alive=8"}
JSONL (one JSON object per line) is the right format because it streams to log shippers and indexes well in Loki / OpenSearch / CloudWatch. Always include:
- Timestamp (ISO 8601 with timezone).
- Hostname.
- Healer identifier.
- The intent (what was decided).
- The outcome (what happened).
- The reason (the fact that triggered the decide).
- The full fact (for after-the-fact analysis).
When a post-incident review asks “why was worker pid 12345 restarted at 14:32?” you grep the audit log and have a complete answer.
The Drop-In lib/heal.sh
# lib/heal.sh — sourced helpers for self-healing scripts.
#
# Required env:
# NAME — short healer identifier (lowercase, no spaces)
#
# Optional env:
# DRY_RUN — true|false, default false
# HEAL_STATE_DIR — default /var/lib/heal/$NAME
# HEAL_AUDIT_FILE — default /var/log/heal/audit.jsonl
set -o errexit -o nounset -o pipefail
: "${NAME:?NAME must be set}"
: "${DRY_RUN:=false}"
: "${HEAL_STATE_DIR:=/var/lib/heal/$NAME}"
: "${HEAL_AUDIT_FILE:=/var/log/heal/audit.jsonl}"
heal_log() {
printf '[%s] [%s] %s\n' "$(date -Iseconds)" "$NAME" "$*"
}
heal_init() {
mkdir -p "$HEAL_STATE_DIR/keys" "$HEAL_STATE_DIR/rate" "$HEAL_STATE_DIR/cb"
mkdir -p "$(dirname "$HEAL_AUDIT_FILE")"
}
heal_audit() {
local intent="$1" outcome="$2" reason="${3:-}"
jq -nc \
--arg ts "$(date -Iseconds)" \
--arg host "$(hostname)" \
--arg name "$NAME" \
--arg intent "$intent" \
--arg outcome "$outcome" \
--arg reason "$reason" \
'{ts:$ts, host:$host, healer:$name, intent:$intent, outcome:$outcome, reason:$reason}' \
>> "$HEAL_AUDIT_FILE"
}
# Idempotency: same intent + 5min bucket = same key
heal_idem_key() {
local intent="$1"
local bucket=$(( $(date +%s) / 300 ))
printf '%s.%d' "$intent" "$bucket" | sha256sum | cut -d' ' -f1
}
heal_idem_seen() {
local key="$1"
[[ -f "$HEAL_STATE_DIR/keys/$key" ]]
}
heal_idem_mark() {
local key="$1"
: > "$HEAL_STATE_DIR/keys/$key"
}
# Token-bucket rate limit. Args: name, rate (per sec), burst
heal_rate_check() {
local name="$1" rate="$2" burst="$3"
local state="$HEAL_STATE_DIR/rate/$name"
local now last_refill tokens elapsed
now=$(date +%s)
if [[ -f "$state" ]]; then
last_refill=$(awk '{print $1}' "$state")
tokens=$(awk '{print $2}' "$state")
else
last_refill=$now
tokens=$burst
fi
elapsed=$(( now - last_refill ))
tokens=$(awk -v t="$tokens" -v e="$elapsed" -v r="$rate" -v b="$burst" \
'BEGIN { v = t + e * r; if (v > b) v = b; print v }')
awk -v t="$tokens" 'BEGIN { exit !(t >= 1) }' || {
printf '%d %s\n' "$now" "$tokens" > "$state"
return 1
}
tokens=$(awk -v t="$tokens" 'BEGIN { print t - 1 }')
printf '%d %s\n' "$now" "$tokens" > "$state"
}
# Circuit breaker. Args: name, max_failures, cooldown_sec
heal_cb_check() {
local name="$1" max="$2" cooldown="$3"
local state="$HEAL_STATE_DIR/cb/$name"
local now=$(date +%s) failures last_fail
[[ -f "$state" ]] || return 0
failures=$(awk '{print $1}' "$state")
last_fail=$(awk '{print $2}' "$state")
if (( failures >= max )); then
if (( now - last_fail < cooldown )); then
heal_log "CIRCUIT-OPEN $name (failures=$failures)"
return 1
fi
heal_log "CIRCUIT-HALF-OPEN $name"
fi
}
heal_cb_record() {
local name="$1" outcome="$2"
local state="$HEAL_STATE_DIR/cb/$name"
if [[ "$outcome" == "success" ]]; then
rm -f "$state"
else
local f=0
[[ -f "$state" ]] && f=$(awk '{print $1}' "$state")
printf '%d %d\n' "$((f + 1))" "$(date +%s)" > "$state"
fi
}
# All-in-one wrapper. Args: intent_string, action_function, success_check_function, reason
heal_act_with_guardrails() {
local intent="$1" action_fn="$2" check_fn="$3" reason="${4:-}"
local key
heal_init
key=$(heal_idem_key "$intent")
if heal_idem_seen "$key"; then
heal_log "SKIP idempotent: $intent"
return 0
fi
if ! heal_rate_check "$intent" "${HEAL_RATE:-0.0167}" "${HEAL_BURST:-3}"; then
heal_log "SKIP rate-limited: $intent"
return 0
fi
if ! heal_cb_check "$intent" "${HEAL_CB_MAX:-3}" "${HEAL_CB_COOLDOWN:-600}"; then
return 0
fi
if $DRY_RUN; then
heal_log "DRY-RUN: $intent"
heal_audit "$intent" dry-run "$reason"
heal_idem_mark "$key"
return 0
fi
heal_log "ACTING: $intent"
if "$action_fn"; then
sleep 5 # let action settle
if "$check_fn"; then
heal_audit "$intent" success "$reason"
heal_cb_record "$intent" success
heal_idem_mark "$key"
heal_log "OK: $intent"
else
heal_audit "$intent" verify-failed "$reason"
heal_cb_record "$intent" failure
heal_log "FAIL verify: $intent"
return 1
fi
else
heal_audit "$intent" action-failed "$reason"
heal_cb_record "$intent" failure
heal_log "FAIL action: $intent"
return 1
fi
}
Worked Example: A Worker-Queue Healer
#!/usr/bin/env bash
# heal-worker-queue.sh — restart workers when queue depth exceeds threshold
set -euo pipefail
NAME=worker-queue-healer
HEAL_RATE=0.0167 # 1 per minute
HEAL_BURST=3
HEAL_CB_MAX=3
HEAL_CB_COOLDOWN=600
source /usr/local/lib/heal.sh
readonly INTERVAL=60
readonly THRESHOLD=10000
detect() {
local depth alive
depth=$(redis-cli LLEN myapp:queue 2>/dev/null || echo 0)
alive=$(pgrep -c -f myapp-worker 2>/dev/null || echo 0)
printf 'queue_depth=%s,workers_alive=%s\n' "$depth" "$alive"
}
decide() {
local fact="$1"
local depth alive
depth=$(echo "$fact" | sed 's/.*queue_depth=\([0-9]*\).*/\1/')
alive=$(echo "$fact" | sed 's/.*workers_alive=\([0-9]*\).*/\1/')
if (( depth > THRESHOLD )) && (( alive > 0 )); then
local pid
pid=$(pgrep -of myapp-worker)
[[ -n "$pid" ]] && printf 'restart_worker pid=%s' "$pid"
fi
}
action_restart_worker() {
systemctl restart myapp-worker.service
}
check_worker_alive() {
systemctl is-active --quiet myapp-worker.service
}
heal_init
while true; do
fact=$(detect)
intent=$(decide "$fact")
if [[ -n "$intent" ]]; then
heal_act_with_guardrails "$intent" action_restart_worker check_worker_alive "$fact"
fi
sleep "$INTERVAL"
done
This healer will:
- Fire at most once per minute (rate limit).
- Skip duplicate intents within a 5-minute bucket (idempotency).
- Open the circuit after 3 consecutive failed restarts and stay open for 10 min.
- Verify the worker is alive after restart before declaring success.
- Audit-log every decision and outcome.
A Tale of Two Healers
The Good Healer (saved a site)
Site has Redis-backed queue. Workers occasionally OOM-kill, leaving queue stuck. Symptoms: queue_depth spikes, no workers consume.
The healer above runs once a minute. When queue exceeds 10k AND alive_workers > 0 (workers exist but stuck), it restarts the worker service. Rate limit ensures at most 1 restart per minute; circuit breaker opens after 3 fails (which would indicate a deploy-broken binary, not a stuck worker — needs human intervention). In 18 months, it has fired ~40 times, every fire was correct, MTTR for the stuck-worker class went from 25 minutes to under 2 minutes.
The Bad Healer (nuked a site)
Same site, earlier version. The healer was: “if queue_depth > 10000, restart worker.” No rate limit, no idempotency, no circuit breaker.
Real incident timeline:
- 14:30:00 — workers genuinely OOM. Queue starts climbing.
- 14:30:30 — queue_depth = 10500. Healer restarts workers.
- 14:30:45 — workers come up, start draining, but a coordinator dependency (database connection pool) is now exhausted because new workers each open a fresh pool.
- 14:31:00 — queue is still > 10000 (drain rate slow due to pool exhaustion). Healer restarts again. Database pool further hammered.
- 14:31:30 — pool exhaustion causes worker startup to time out. Workers crash on boot. Healer restarts again.
- 14:32:00 — every worker restart fails to come up. queue still > 10000. Healer fires every cycle.
- 14:35:00 — coordinator service falls over from connection storm. Site is down.
- 14:50:00 — engineer notices, kills the healer, resets the database, brings workers up cleanly. Site recovers at 15:05.
What the good healer would have done differently:
- Rate limit kicks in at 14:31. No further action.
- Circuit breaker opens at 14:31:30 after 3 fails. Healer stops firing.
- Audit log shows “verify-failed” outcomes — engineer is paged because monitoring alerts on circuit-open.
- Site degraded but not down. MTTR ~5 min instead of 35.
The same code, with guardrails, has a different outcome. This is the entire reason lib/heal.sh exists.
Healer Liveness: The Healer Itself Must Be Monitored
A healer that crashes and stops running is silently worse than no healer because operators may have removed manual procedures. Every healer must:
- Emit a heartbeat to the textfile collector (covered in L34).
- Have a Prometheus alert if the heartbeat stops.
- Run under systemd with
Restart=on-failure.
[Unit]
Description=Worker queue healer
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/heal-worker-queue.sh
Restart=on-failure
RestartSec=10
WatchdogSec=120
NotifyAccess=main
[Install]
WantedBy=multi-user.target
WatchdogSec=120 requires the script to call systemd-notify WATCHDOG=1 at least every 2 minutes; if it stops, systemd kills and restarts. Combined with Restart=on-failure, the healer self-recovers from script bugs.
The 8 Footguns
1. No Idempotency Means Multiple Fires Per Incident
Two cycles 60 seconds apart see the same incident, fire twice. Workers get restart-restarted in 60 seconds, often confusing the supervisor. Fix: idempotency keys with coarse buckets.
2. No Rate Limit Means Cascading Damage
Already covered in the bad-healer example. Fix: every healer declares a rate limit; every action goes through heal_rate_check.
3. No Circuit Breaker Means Infinite Loop Of Failure
Action fails, problem persists, healer fires again. Logs fill, the audit log fills, eventually disk fills. Fix: circuit breaker with explicit max-failures + cooldown.
4. Action Verification Skipped
You restart the service but never check it came back up. The healer thinks it succeeded but the service is dead. Fix: every action has an explicit verification function called after a settling sleep.
5. Detect Phase Includes Decide Logic
Detect is redis-cli LLEN; if redis-cli itself is failing because Redis is down, Detect returns “0” (a falsy value), and Decide concludes “no problem.” But there is a problem — Redis is down! Fix: Detect must return errors as facts, not silently return zero. local depth=$(redis-cli LLEN myapp:queue) || depth="ERROR".
6. Forgetting To heal_init
The heal_init function creates the state directories. Forgetting it means file writes fail silently (errexit catches them, but error is “no such directory” which is opaque). Fix: heal_init at the top of every healer’s main, before the loop.
7. Running Two Instances Of The Same Healer
Without a PID file or systemd lock, an admin running a manual instance of heal-worker-queue.sh while the systemd one is also running gives you 2× the actions. Fix: flock-based singleton:
exec 200>/var/run/heal-worker-queue.lock
flock -n 200 || { heal_log "Already running"; exit 1; }
8. Using bash Math On Floats For Token Bucket
bash arithmetic (( ... )) is integer-only. Trying (( tokens >= 0.5 )) is a syntax error or comparison fail. Fix: use awk for float math (as in lib/heal.sh above) or use millisecond-resolution integer tokens.
Quick-Reference Card
LOOP STRUCTURE
Detect → fact (a measurement, not a decision)
Decide → intent (a directive, not an action)
Act → operation (with guardrails)
Each phase is a separate function.
GUARDRAILS (in order applied)
1. Idempotency key → skip if same intent in same bucket
2. Rate limit → token bucket, e.g. 1/min with burst 3
3. Circuit breaker → open after 3 fails, 10min cooldown
4. Dry-run → log instead of act if DRY_RUN=true
5. Action with verification → action then check then record outcome
6. Audit log → JSONL with ts, host, healer, intent, outcome, reason
ROLLOUT
1. Dev fleet dry-run, 1 day
2. Prod fleet dry-run, 7 days
3. Single-host live (canary), 3 days
4. Gradual fleet rollout 10% → 100%
LIVENESS
Heartbeat to textfile collector
Prometheus alert on heartbeat stale
systemd Restart=on-failure + WatchdogSec
flock singleton to prevent double-runs
NUMBERS THAT WORK
Bucket size 5 min (idempotency)
Rate 1/min, burst 3 (most healers)
Circuit max 3 fails, cooldown 10 min
Verification settle 5-10 sec before checking
What’s Next
You can now build healers that detect symptoms, decide on remediation, act with bounded blast-radius, and audit every action. The next operational frontier is migration scripts: scripts that transform data, move it between systems, and must be safely re-runnable when something fails midway through. Migration is the ultimate test of idempotency — a half-completed migration must complete cleanly when retried, never duplicate, never lose rows.
In the next lesson — Migration Scripts: Data Transformations, ETL From Shell & Idempotent Re-Runs — we’ll build lib/migrate.sh covering checkpoint files for resumable migrations, the watermark pattern for incremental ETL, dry-run with row-count diff, transactional staging tables, and the disciplined back-out plan every migration needs before it goes live.