Shell Self-Healing Scripts: Detect-Decide-Act Loops, Blast-Radius Limits, Circuit Breakers & The Discipline That Stops Auto-Remediation From Becoming The Outage

The Cardinal Rule of Auto-Remediation

A self-healing script that runs without guardrails is just a faster way to take production down. Every public post-mortem you have read where “the auto-remediation made it worse” follows the same script: the healer detected a symptom, took an action, the action made the symptom look like it cleared, the underlying cause was still there, the symptom returned, the healer fired again — and the resource it was “healing” entered a restart loop, depended-on services started failing, the healer started firing on those, and within 4 minutes a single bad pod became a regional outage.

The discipline that prevents this:

Layer	What it stops
Explicit Detect-Decide-Act loop	Conflating “broken” with “looks broken”
Idempotency keys	The same incident triggering the same fix multiple times in parallel
Blast-radius limits	A bug in the healer cascading across the fleet
Circuit breakers	The healer continuing to fire when its own actions are failing
Dry-run mode	Shipping un-tested healer logic to production
Audit log	“Why did the healer do that?” being unanswerable

This lesson teaches each layer with shell scripts, a lib/heal.sh you can source, and two worked examples: a good healer that has saved sites, and a structurally identical bad one that nuked them.

The Detect-Decide-Act Loop

Every healer has three explicit phases:

┌──────────┐    ┌──────────┐    ┌──────────┐
│  DETECT  │───▶│  DECIDE  │───▶│   ACT    │
└──────────┘    └──────────┘    └──────────┘
     ▲                                │
     └────────────────────────────────┘
              loop interval

Detect: gather signal (metric scrape, health check, log query). Output: a fact like “queue_depth=12000”.
Decide: apply policy. Output: an intent like “restart worker pid 12345” — or “no action”.
Act: perform the intended action with idempotency, rate limit, and audit log.

The phases are separate functions, callable independently. This matters for testing: you can unit-test Decide with synthetic facts without ever Acting, and you can dry-run by replacing Act with a logger.

Why the Loop Must Have Three Phases (Not Two)

The naive form is “if condition then action” which collapses Detect and Decide into a single test. This works for trivial healers but fails as soon as policy gets non-trivial:

“Restart worker if queue depth > 10000 and worker hasn’t been restarted in last 5 minutes and less than 3 workers are restarting fleet-wide right now.”

The three conditions are Decide logic. Forcing them into the Detect phase means you can’t reuse the detector for monitoring alerts; forcing them into Act means dry-run can’t show what would happen. Keep them separate.

Skeleton

#!/usr/bin/env bash
# heal-worker.sh — example healer skeleton
set -euo pipefail

source /usr/local/lib/heal.sh

readonly NAME=worker-restarter
readonly INTERVAL=60   # seconds

while true; do
  fact=$(detect)
  intent=$(decide "$fact")
  if [[ -n "$intent" ]]; then
    act "$intent"
  fi
  sleep "$INTERVAL"
done

detect() {
  # Returns a fact line. Empty = no signal.
  local depth
  depth=$(redis-cli LLEN myapp:queue 2>/dev/null || echo 0)
  printf 'queue_depth=%s\n' "$depth"
}

decide() {
  local fact="$1"
  local depth
  depth=$(echo "$fact" | awk -F= '/queue_depth=/ {print $2}')
  if (( depth > 10000 )); then
    # Find the oldest stuck worker
    local pid
    pid=$(pgrep -of myapp-worker)
    [[ -n "$pid" ]] && printf 'restart_worker pid=%s\n' "$pid"
  fi
}

act() {
  heal_act_with_guardrails "$1"
}

The heal_act_with_guardrails wraps idempotency + rate limit + circuit breaker + audit log. It’s what lib/heal.sh provides.

Idempotency Keys: One Incident, One Action

Without idempotency keys, two consecutive Detect cycles that observe the same problem trigger the same Act twice. For “restart worker”, that’s two restarts in 60 seconds — which can confuse a process supervisor and leave the worker in an unknown state.

The fix is an idempotency key derived from the incident not the cycle:

# Compute idempotency key from intent + a coarse time bucket
# Two identical intents in the same 5-minute bucket are deduped.
heal_idempotency_key() {
  local intent="$1"
  local bucket
  bucket=$(( $(date +%s) / 300 ))
  printf '%s\n' "$intent.$bucket" | sha256sum | cut -d' ' -f1
}

The key is then used to gate execution:

heal_act_with_guardrails() {
  local intent="$1"
  local key
  key=$(heal_idempotency_key "$intent")
  local marker="/var/lib/heal/$NAME/keys/$key"

  if [[ -f "$marker" ]]; then
    heal_log "SKIP duplicate intent: $intent (key=$key)"
    return 0
  fi

  mkdir -p "$(dirname "$marker")"
  : > "$marker"
  # ...continue to act
}

Two adjacent cycles that observe the same problem hit the same key and the second is silently skipped. After the bucket rolls (5 minutes later), the key changes and the healer can fire again — which is the desired property: “fix once per 5 min, not on every cycle.”

Idempotency Across Restarts

The marker file persists across script restarts. This is on purpose: if your healer is restarted by systemd at 14:32 and saw an intent at 14:31, it shouldn’t re-fire that same intent in the same bucket. Persisted markers give you crash-safety.

GC the markers nightly:

# tmpfiles.d/heal.conf
d /var/lib/heal/*/keys 0750 heal heal -
e /var/lib/heal/*/keys - - - 7d

tmpfiles.d e directive deletes files older than 7 days. Sufficient for any sensible bucket size.

Blast-Radius Limits: The Rate Limit That Saves The Fleet

This is the most important guardrail in the entire lesson. Every healer must declare a blast-radius limit and refuse to exceed it.

Examples:

“Restart at most 1 worker per minute, per host.”
“Disable at most 3 services per 10 minutes, fleet-wide.”
“Roll back at most 10% of pods per 5 minutes.”

The limit is enforced via a token bucket:

# heal_rate_limit_check NAME RATE BURST
# Returns 0 (allow) if a token is available, 1 (deny) otherwise.
heal_rate_limit_check() {
  local name="$1" rate="$2" burst="$3"
  local state="/var/lib/heal/$NAME/rate-$name"
  local now last_refill tokens elapsed

  now=$(date +%s)
  if [[ -f "$state" ]]; then
    last_refill=$(awk '{print $1}' "$state")
    tokens=$(awk '{print $2}' "$state")
  else
    last_refill=$now
    tokens=$burst
  fi

  elapsed=$(( now - last_refill ))
  tokens=$(awk -v t="$tokens" -v e="$elapsed" -v r="$rate" -v b="$burst" \
    'BEGIN { v = t + e * r; if (v > b) v = b; print v }')

  awk -v t="$tokens" 'BEGIN { exit !(t >= 1) }' || {
    printf '%d %s\n' "$now" "$tokens" > "$state"
    return 1
  }

  tokens=$(awk -v t="$tokens" 'BEGIN { print t - 1 }')
  printf '%d %s\n' "$now" "$tokens" > "$state"
  return 0
}

Usage:

# Allow 1 restart per minute, with burst of 3 (catches up during quiet periods)
if heal_rate_limit_check "worker-restart" "0.0167" "3"; then
  # Allowed — proceed with restart
  systemctl restart myapp-worker.service
else
  heal_log "Rate-limited: skipping worker restart"
fi

The token bucket auto-recovers: if the healer doesn’t fire for 10 minutes, the bucket refills to burst, ready for a small flurry. But sustained firing can’t exceed rate per second.

Fleet-Wide Rate Limits via a Shared Lock

Per-host limits aren’t enough when the healer runs on every host. If 100 hosts each “rate-limit to 1/min” and the trigger condition is fleet-wide, you get 100 actions per minute fleet-wide. Solution: a central coordinator (Redis, Consul KV, etcd) that issues fleet-wide tokens:

# Use Redis SETNX as a distributed lock. The lock has a TTL so it auto-expires.
heal_fleet_lock_acquire() {
  local key="$1" ttl="$2"
  local token
  token=$(uuidgen)
  if redis-cli SET "heal:lock:$key" "$token" EX "$ttl" NX | grep -q OK; then
    printf '%s\n' "$token"
    return 0
  fi
  return 1
}

heal_fleet_lock_release() {
  local key="$1" token="$2"
  # Lua: only release if we own the lock (avoids releasing someone else's lock if we're slow)
  redis-cli EVAL "if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('DEL', KEYS[1]) else return 0 end" 1 "heal:lock:$key" "$token"
}

# Usage: only one host fleet-wide can act on this incident
if token=$(heal_fleet_lock_acquire "kafka-broker-restart" 300); then
  systemctl restart kafka.service
  heal_fleet_lock_release "kafka-broker-restart" "$token"
fi

The Lua script in release ensures we don’t accidentally delete someone else’s lock if we held ours past TTL. This is the classic correct Redis distributed lock pattern (Redlock without quorum, sufficient for healer coordination).

Circuit Breakers: Stop When Your Actions Aren’t Working

A circuit breaker disables the healer after N consecutive failed actions. This prevents the worst failure mode: the healer firing repeatedly because its actions don’t actually fix the problem (the worker keeps crashing on restart, but the healer keeps trying to restart it).

heal_circuit_breaker_check() {
  local name="$1" max_failures="$2" cooldown="$3"
  local state="/var/lib/heal/$NAME/cb-$name"
  local now failures last_failure

  now=$(date +%s)
  if [[ ! -f "$state" ]]; then
    return 0   # closed, allow
  fi

  failures=$(awk '{print $1}' "$state")
  last_failure=$(awk '{print $2}' "$state")

  if (( failures >= max_failures )); then
    if (( now - last_failure < cooldown )); then
      heal_log "Circuit OPEN for $name (failures=$failures, cooldown=$((cooldown - (now - last_failure)))s remaining)"
      return 1
    else
      # Cooldown expired — half-open: allow one trial
      heal_log "Circuit HALF-OPEN for $name"
      return 0
    fi
  fi
  return 0
}

heal_circuit_breaker_record() {
  local name="$1" outcome="$2"   # success | failure
  local state="/var/lib/heal/$NAME/cb-$name"
  local now=$(date +%s)
  if [[ "$outcome" == "success" ]]; then
    rm -f "$state"   # reset on success (half-open → closed)
  else
    local failures=0
    [[ -f "$state" ]] && failures=$(awk '{print $1}' "$state")
    failures=$((failures + 1))
    printf '%d %d\n' "$failures" "$now" > "$state"
  fi
}

Usage with verification:

heal_act_with_circuit_breaker() {
  local intent="$1"
  if ! heal_circuit_breaker_check "worker-restart" 3 600; then
    return 1   # circuit open, skip
  fi

  systemctl restart myapp-worker
  sleep 10   # give it time to start
  if systemctl is-active --quiet myapp-worker; then
    heal_circuit_breaker_record "worker-restart" success
  else
    heal_circuit_breaker_record "worker-restart" failure
    return 1
  fi
}

After 3 consecutive failures, the circuit opens and stays open for 10 minutes. After cooldown, one trial action is allowed (half-open); success closes the circuit, failure re-opens it for another 10 minutes.

The numbers come from production experience: 3 failures means “this isn’t a transient — we’re in a real failure mode.” 10 minutes cooldown is enough for a human to investigate (and for monitoring alerts to wake someone).

Dry-Run Mode: The Discipline That Catches The Bad Healer Before It Lives

Every healer must have a --dry-run flag that runs Detect+Decide but replaces Act with a logger:

DRY_RUN=${DRY_RUN:-false}

heal_act() {
  local intent="$1"
  if $DRY_RUN; then
    heal_log "DRY-RUN: would have acted on: $intent"
    return 0
  fi
  # ...real action
}

Every new healer should run for 7 days in dry-run before live mode. Compare:

Number of intents emitted in dry-run.
Severity of each (read the audit log).
Whether the count looks reasonable for known incidents.

If your healer dry-runs at 200 actions/day on a 50-host fleet, that’s almost certainly a bug — real healers fire 0-5 actions/day. Tune detection thresholds before going live.

The Three-Stage Rollout

Dry-run, dev fleet (1 day) — catch logic bugs in your decide phase.
Dry-run, prod fleet (7 days) — see real production patterns.
Canary live, 1 host (3 days) — actually fire on one host, watch for unintended consequences.
Live, full fleet — gradual rollout, 10% → 25% → 50% → 100% over a week.

This is overkill for trivial healers (e.g., “remove core files older than 7 days”) but it’s exactly right for anything that mutates running services.

Audit Log: The Question Is Always “Why?”

Every action must produce an audit record with enough context to answer “why did the healer do that?” three months from now.

heal_audit() {
  local intent="$1" outcome="$2" reason="$3"
  local audit_file="/var/log/heal/audit.jsonl"
  mkdir -p "$(dirname "$audit_file")"
  jq -nc \
    --arg ts "$(date -Iseconds)" \
    --arg host "$(hostname)" \
    --arg name "$NAME" \
    --arg intent "$intent" \
    --arg outcome "$outcome" \
    --arg reason "$reason" \
    --arg fact "$LAST_FACT" \
    '{ts:$ts, host:$host, healer:$name, intent:$intent, outcome:$outcome, reason:$reason, fact:$fact}' \
    >> "$audit_file"
}

Sample output:

{"ts":"2026-06-22T14:32:01+00:00","host":"web-07","healer":"worker-restarter","intent":"restart_worker pid=12345","outcome":"success","reason":"queue_depth=12450 > 10000","fact":"queue_depth=12450,workers_alive=8"}

JSONL (one JSON object per line) is the right format because it streams to log shippers and indexes well in Loki / OpenSearch / CloudWatch. Always include:

Timestamp (ISO 8601 with timezone).
Hostname.
Healer identifier.
The intent (what was decided).
The outcome (what happened).
The reason (the fact that triggered the decide).
The full fact (for after-the-fact analysis).

When a post-incident review asks “why was worker pid 12345 restarted at 14:32?” you grep the audit log and have a complete answer.

The Drop-In `lib/heal.sh`

# lib/heal.sh — sourced helpers for self-healing scripts.
#
# Required env:
#   NAME          — short healer identifier (lowercase, no spaces)
#
# Optional env:
#   DRY_RUN       — true|false, default false
#   HEAL_STATE_DIR — default /var/lib/heal/$NAME
#   HEAL_AUDIT_FILE — default /var/log/heal/audit.jsonl

set -o errexit -o nounset -o pipefail

: "${NAME:?NAME must be set}"
: "${DRY_RUN:=false}"
: "${HEAL_STATE_DIR:=/var/lib/heal/$NAME}"
: "${HEAL_AUDIT_FILE:=/var/log/heal/audit.jsonl}"

heal_log() {
  printf '[%s] [%s] %s\n' "$(date -Iseconds)" "$NAME" "$*"
}

heal_init() {
  mkdir -p "$HEAL_STATE_DIR/keys" "$HEAL_STATE_DIR/rate" "$HEAL_STATE_DIR/cb"
  mkdir -p "$(dirname "$HEAL_AUDIT_FILE")"
}

heal_audit() {
  local intent="$1" outcome="$2" reason="${3:-}"
  jq -nc \
    --arg ts "$(date -Iseconds)" \
    --arg host "$(hostname)" \
    --arg name "$NAME" \
    --arg intent "$intent" \
    --arg outcome "$outcome" \
    --arg reason "$reason" \
    '{ts:$ts, host:$host, healer:$name, intent:$intent, outcome:$outcome, reason:$reason}' \
    >> "$HEAL_AUDIT_FILE"
}

# Idempotency: same intent + 5min bucket = same key
heal_idem_key() {
  local intent="$1"
  local bucket=$(( $(date +%s) / 300 ))
  printf '%s.%d' "$intent" "$bucket" | sha256sum | cut -d' ' -f1
}

heal_idem_seen() {
  local key="$1"
  [[ -f "$HEAL_STATE_DIR/keys/$key" ]]
}

heal_idem_mark() {
  local key="$1"
  : > "$HEAL_STATE_DIR/keys/$key"
}

# Token-bucket rate limit. Args: name, rate (per sec), burst
heal_rate_check() {
  local name="$1" rate="$2" burst="$3"
  local state="$HEAL_STATE_DIR/rate/$name"
  local now last_refill tokens elapsed

  now=$(date +%s)
  if [[ -f "$state" ]]; then
    last_refill=$(awk '{print $1}' "$state")
    tokens=$(awk '{print $2}' "$state")
  else
    last_refill=$now
    tokens=$burst
  fi

  elapsed=$(( now - last_refill ))
  tokens=$(awk -v t="$tokens" -v e="$elapsed" -v r="$rate" -v b="$burst" \
    'BEGIN { v = t + e * r; if (v > b) v = b; print v }')

  awk -v t="$tokens" 'BEGIN { exit !(t >= 1) }' || {
    printf '%d %s\n' "$now" "$tokens" > "$state"
    return 1
  }
  tokens=$(awk -v t="$tokens" 'BEGIN { print t - 1 }')
  printf '%d %s\n' "$now" "$tokens" > "$state"
}

# Circuit breaker. Args: name, max_failures, cooldown_sec
heal_cb_check() {
  local name="$1" max="$2" cooldown="$3"
  local state="$HEAL_STATE_DIR/cb/$name"
  local now=$(date +%s) failures last_fail
  [[ -f "$state" ]] || return 0
  failures=$(awk '{print $1}' "$state")
  last_fail=$(awk '{print $2}' "$state")
  if (( failures >= max )); then
    if (( now - last_fail < cooldown )); then
      heal_log "CIRCUIT-OPEN $name (failures=$failures)"
      return 1
    fi
    heal_log "CIRCUIT-HALF-OPEN $name"
  fi
}

heal_cb_record() {
  local name="$1" outcome="$2"
  local state="$HEAL_STATE_DIR/cb/$name"
  if [[ "$outcome" == "success" ]]; then
    rm -f "$state"
  else
    local f=0
    [[ -f "$state" ]] && f=$(awk '{print $1}' "$state")
    printf '%d %d\n' "$((f + 1))" "$(date +%s)" > "$state"
  fi
}

# All-in-one wrapper. Args: intent_string, action_function, success_check_function, reason
heal_act_with_guardrails() {
  local intent="$1" action_fn="$2" check_fn="$3" reason="${4:-}"
  local key

  heal_init
  key=$(heal_idem_key "$intent")
  if heal_idem_seen "$key"; then
    heal_log "SKIP idempotent: $intent"
    return 0
  fi

  if ! heal_rate_check "$intent" "${HEAL_RATE:-0.0167}" "${HEAL_BURST:-3}"; then
    heal_log "SKIP rate-limited: $intent"
    return 0
  fi

  if ! heal_cb_check "$intent" "${HEAL_CB_MAX:-3}" "${HEAL_CB_COOLDOWN:-600}"; then
    return 0
  fi

  if $DRY_RUN; then
    heal_log "DRY-RUN: $intent"
    heal_audit "$intent" dry-run "$reason"
    heal_idem_mark "$key"
    return 0
  fi

  heal_log "ACTING: $intent"
  if "$action_fn"; then
    sleep 5   # let action settle
    if "$check_fn"; then
      heal_audit "$intent" success "$reason"
      heal_cb_record "$intent" success
      heal_idem_mark "$key"
      heal_log "OK: $intent"
    else
      heal_audit "$intent" verify-failed "$reason"
      heal_cb_record "$intent" failure
      heal_log "FAIL verify: $intent"
      return 1
    fi
  else
    heal_audit "$intent" action-failed "$reason"
    heal_cb_record "$intent" failure
    heal_log "FAIL action: $intent"
    return 1
  fi
}

Worked Example: A Worker-Queue Healer

#!/usr/bin/env bash
# heal-worker-queue.sh — restart workers when queue depth exceeds threshold
set -euo pipefail

NAME=worker-queue-healer
HEAL_RATE=0.0167   # 1 per minute
HEAL_BURST=3
HEAL_CB_MAX=3
HEAL_CB_COOLDOWN=600

source /usr/local/lib/heal.sh

readonly INTERVAL=60
readonly THRESHOLD=10000

detect() {
  local depth alive
  depth=$(redis-cli LLEN myapp:queue 2>/dev/null || echo 0)
  alive=$(pgrep -c -f myapp-worker 2>/dev/null || echo 0)
  printf 'queue_depth=%s,workers_alive=%s\n' "$depth" "$alive"
}

decide() {
  local fact="$1"
  local depth alive
  depth=$(echo "$fact" | sed 's/.*queue_depth=\([0-9]*\).*/\1/')
  alive=$(echo "$fact" | sed 's/.*workers_alive=\([0-9]*\).*/\1/')

  if (( depth > THRESHOLD )) && (( alive > 0 )); then
    local pid
    pid=$(pgrep -of myapp-worker)
    [[ -n "$pid" ]] && printf 'restart_worker pid=%s' "$pid"
  fi
}

action_restart_worker() {
  systemctl restart myapp-worker.service
}

check_worker_alive() {
  systemctl is-active --quiet myapp-worker.service
}

heal_init

while true; do
  fact=$(detect)
  intent=$(decide "$fact")
  if [[ -n "$intent" ]]; then
    heal_act_with_guardrails "$intent" action_restart_worker check_worker_alive "$fact"
  fi
  sleep "$INTERVAL"
done

This healer will:

Fire at most once per minute (rate limit).
Skip duplicate intents within a 5-minute bucket (idempotency).
Open the circuit after 3 consecutive failed restarts and stay open for 10 min.
Verify the worker is alive after restart before declaring success.
Audit-log every decision and outcome.

A Tale of Two Healers

The Good Healer (saved a site)

Site has Redis-backed queue. Workers occasionally OOM-kill, leaving queue stuck. Symptoms: queue_depth spikes, no workers consume.

The healer above runs once a minute. When queue exceeds 10k AND alive_workers > 0 (workers exist but stuck), it restarts the worker service. Rate limit ensures at most 1 restart per minute; circuit breaker opens after 3 fails (which would indicate a deploy-broken binary, not a stuck worker — needs human intervention). In 18 months, it has fired ~40 times, every fire was correct, MTTR for the stuck-worker class went from 25 minutes to under 2 minutes.

The Bad Healer (nuked a site)

Same site, earlier version. The healer was: “if queue_depth > 10000, restart worker.” No rate limit, no idempotency, no circuit breaker.

Real incident timeline:

14:30:00 — workers genuinely OOM. Queue starts climbing.
14:30:30 — queue_depth = 10500. Healer restarts workers.
14:30:45 — workers come up, start draining, but a coordinator dependency (database connection pool) is now exhausted because new workers each open a fresh pool.
14:31:00 — queue is still > 10000 (drain rate slow due to pool exhaustion). Healer restarts again. Database pool further hammered.
14:31:30 — pool exhaustion causes worker startup to time out. Workers crash on boot. Healer restarts again.
14:32:00 — every worker restart fails to come up. queue still > 10000. Healer fires every cycle.
14:35:00 — coordinator service falls over from connection storm. Site is down.
14:50:00 — engineer notices, kills the healer, resets the database, brings workers up cleanly. Site recovers at 15:05.

What the good healer would have done differently:

Rate limit kicks in at 14:31. No further action.
Circuit breaker opens at 14:31:30 after 3 fails. Healer stops firing.
Audit log shows “verify-failed” outcomes — engineer is paged because monitoring alerts on circuit-open.
Site degraded but not down. MTTR ~5 min instead of 35.

The same code, with guardrails, has a different outcome. This is the entire reason lib/heal.sh exists.

Healer Liveness: The Healer Itself Must Be Monitored

A healer that crashes and stops running is silently worse than no healer because operators may have removed manual procedures. Every healer must:

Emit a heartbeat to the textfile collector (covered in L34).
Have a Prometheus alert if the heartbeat stops.
Run under systemd with Restart=on-failure.

[Unit]
Description=Worker queue healer
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/heal-worker-queue.sh
Restart=on-failure
RestartSec=10
WatchdogSec=120
NotifyAccess=main

[Install]
WantedBy=multi-user.target

WatchdogSec=120 requires the script to call systemd-notify WATCHDOG=1 at least every 2 minutes; if it stops, systemd kills and restarts. Combined with Restart=on-failure, the healer self-recovers from script bugs.

The 8 Footguns

1. No Idempotency Means Multiple Fires Per Incident

Two cycles 60 seconds apart see the same incident, fire twice. Workers get restart-restarted in 60 seconds, often confusing the supervisor. Fix: idempotency keys with coarse buckets.

2. No Rate Limit Means Cascading Damage

Already covered in the bad-healer example. Fix: every healer declares a rate limit; every action goes through heal_rate_check.

3. No Circuit Breaker Means Infinite Loop Of Failure

Action fails, problem persists, healer fires again. Logs fill, the audit log fills, eventually disk fills. Fix: circuit breaker with explicit max-failures + cooldown.

4. Action Verification Skipped

You restart the service but never check it came back up. The healer thinks it succeeded but the service is dead. Fix: every action has an explicit verification function called after a settling sleep.

5. Detect Phase Includes Decide Logic

Detect is redis-cli LLEN; if redis-cli itself is failing because Redis is down, Detect returns “0” (a falsy value), and Decide concludes “no problem.” But there is a problem — Redis is down! Fix: Detect must return errors as facts, not silently return zero. local depth=$(redis-cli LLEN myapp:queue) || depth="ERROR".

6. Forgetting To `heal_init`

The heal_init function creates the state directories. Forgetting it means file writes fail silently (errexit catches them, but error is “no such directory” which is opaque). Fix: heal_init at the top of every healer’s main, before the loop.

7. Running Two Instances Of The Same Healer

Without a PID file or systemd lock, an admin running a manual instance of heal-worker-queue.sh while the systemd one is also running gives you 2× the actions. Fix: flock-based singleton:

exec 200>/var/run/heal-worker-queue.lock
flock -n 200 || { heal_log "Already running"; exit 1; }

8. Using `bash` Math On Floats For Token Bucket

bash arithmetic (( ... )) is integer-only. Trying (( tokens >= 0.5 )) is a syntax error or comparison fail. Fix: use awk for float math (as in lib/heal.sh above) or use millisecond-resolution integer tokens.

Quick-Reference Card

LOOP STRUCTURE
  Detect → fact (a measurement, not a decision)
  Decide → intent (a directive, not an action)
  Act    → operation (with guardrails)
  Each phase is a separate function.

GUARDRAILS (in order applied)
  1. Idempotency key → skip if same intent in same bucket
  2. Rate limit → token bucket, e.g. 1/min with burst 3
  3. Circuit breaker → open after 3 fails, 10min cooldown
  4. Dry-run → log instead of act if DRY_RUN=true
  5. Action with verification → action then check then record outcome
  6. Audit log → JSONL with ts, host, healer, intent, outcome, reason

ROLLOUT
  1. Dev fleet dry-run, 1 day
  2. Prod fleet dry-run, 7 days
  3. Single-host live (canary), 3 days
  4. Gradual fleet rollout 10% → 100%

LIVENESS
  Heartbeat to textfile collector
  Prometheus alert on heartbeat stale
  systemd Restart=on-failure + WatchdogSec
  flock singleton to prevent double-runs

NUMBERS THAT WORK
  Bucket size 5 min (idempotency)
  Rate 1/min, burst 3 (most healers)
  Circuit max 3 fails, cooldown 10 min
  Verification settle 5-10 sec before checking

What’s Next

You can now build healers that detect symptoms, decide on remediation, act with bounded blast-radius, and audit every action. The next operational frontier is migration scripts: scripts that transform data, move it between systems, and must be safely re-runnable when something fails midway through. Migration is the ultimate test of idempotency — a half-completed migration must complete cleanly when retried, never duplicate, never lose rows.

In the next lesson — Migration Scripts: Data Transformations, ETL From Shell & Idempotent Re-Runs — we’ll build lib/migrate.sh covering checkpoint files for resumable migrations, the watermark pattern for incremental ETL, dry-run with row-count diff, transactional staging tables, and the disciplined back-out plan every migration needs before it goes live.

Shell Self-Healing Scripts: Detect-Decide-Act Loops, Blast-Radius Limits, Circuit Breakers & The Discipline That Stops Auto-Remediation From Becoming The Outage

The Cardinal Rule of Auto-Remediation

The Detect-Decide-Act Loop

Why the Loop Must Have Three Phases (Not Two)

Skeleton

Idempotency Keys: One Incident, One Action

Idempotency Across Restarts

Blast-Radius Limits: The Rate Limit That Saves The Fleet

Fleet-Wide Rate Limits via a Shared Lock

Circuit Breakers: Stop When Your Actions Aren’t Working

Dry-Run Mode: The Discipline That Catches The Bad Healer Before It Lives

The Three-Stage Rollout

Audit Log: The Question Is Always “Why?”

The Drop-In `lib/heal.sh`

Worked Example: A Worker-Queue Healer

A Tale of Two Healers

The Good Healer (saved a site)

The Bad Healer (nuked a site)

Healer Liveness: The Healer Itself Must Be Monitored

The 8 Footguns

1. No Idempotency Means Multiple Fires Per Incident

2. No Rate Limit Means Cascading Damage

3. No Circuit Breaker Means Infinite Loop Of Failure

4. Action Verification Skipped

5. Detect Phase Includes Decide Logic

6. Forgetting To `heal_init`

7. Running Two Instances Of The Same Healer

8. Using `bash` Math On Floats For Token Bucket

Quick-Reference Card

What’s Next

Written by Vinod

Comments

Shell Self-Healing Scripts: Detect-Decide-Act Loops, Blast-Radius Limits, Circuit Breakers & The Discipline That Stops Auto-Remediation From Becoming The Outage

The Cardinal Rule of Auto-Remediation

The Detect-Decide-Act Loop

Why the Loop Must Have Three Phases (Not Two)

Skeleton

Idempotency Keys: One Incident, One Action

Idempotency Across Restarts

Blast-Radius Limits: The Rate Limit That Saves The Fleet

Fleet-Wide Rate Limits via a Shared Lock

Circuit Breakers: Stop When Your Actions Aren’t Working

Dry-Run Mode: The Discipline That Catches The Bad Healer Before It Lives

The Three-Stage Rollout

Audit Log: The Question Is Always “Why?”

The Drop-In lib/heal.sh

Worked Example: A Worker-Queue Healer

A Tale of Two Healers

The Good Healer (saved a site)

The Bad Healer (nuked a site)

Healer Liveness: The Healer Itself Must Be Monitored

The 8 Footguns

1. No Idempotency Means Multiple Fires Per Incident

2. No Rate Limit Means Cascading Damage

3. No Circuit Breaker Means Infinite Loop Of Failure

4. Action Verification Skipped

5. Detect Phase Includes Decide Logic

6. Forgetting To heal_init

7. Running Two Instances Of The Same Healer

8. Using bash Math On Floats For Token Bucket

Quick-Reference Card

What’s Next

Written by Vinod

Comments

The Drop-In `lib/heal.sh`

6. Forgetting To `heal_init`

8. Using `bash` Math On Floats For Token Bucket