Building a Chaos Engineering Program: Hypotheses, Fault Injection, and Game Days

Most teams discover their failure modes the same way: at 3 a.m., from a customer, during the worst possible traffic. Chaos engineering inverts that. Instead of waiting for a dependency to fail and hoping the system degrades gracefully, you inject the failure deliberately, during business hours, with a hypothesis written down and an abort button within reach. The point is not to break things – breaking things is easy. The point is to falsify a specific belief about how your system behaves under stress, and to do it before the failure picks the time and place itself.

This article walks through standing up a chaos program from first experiment to continuous, automated, organization-wide practice. It is opinionated about the order: hypothesis discipline first, blast-radius controls second, production last.

Step 1 – Start from the steady-state hypothesis, not the fault

The single most common mistake is leading with the fault: “let’s kill a pod and see what happens.” That is not an experiment; it is vandalism with extra steps. A chaos experiment is a scientific one, and a scientific experiment needs a measurable hypothesis you can falsify.

The four principles from the Principles of Chaos Engineering are worth internalizing because they impose exactly this discipline:

Define steady state as a measurable output that indicates normal behavior – a business or system metric, not an internal implementation detail.
Hypothesize that steady state continues in both the control group (no fault) and the experimental group (fault injected).
Introduce real-world events – the failures that actually happen: dependency latency, instance loss, an unreachable region.
Try to disprove the hypothesis by looking for a difference in steady state between control and experimental groups.

Steady state must be a metric the business recognizes, measured at a high enough level that it survives normal noise. Good steady-state signals: successful checkouts per minute, p99 request latency, the ratio of 2xx to 5xx at the edge. Bad ones: CPU on a single node, queue depth on one consumer – these are causes, not symptoms, and they wander on their own.

The hypothesis is always stated in the positive and as a non-event: “When we add 200 ms of latency to the payments dependency, checkout success rate stays within 2% of its 7-day baseline.” If the experiment fails to disprove that, you have learned the system is resilient to that fault. If it disproves it, you have found a weakness on your terms – which is the entire point.

Step 2 – Build a reusable experiment template

Every experiment your organization runs should carry the same fields. Standardizing this turns chaos from a hero activity into a repeatable engineering practice and makes review trivial. Capture it as code – a YAML document checked into the service repo next to the runbook.

# experiments/payments-latency.yaml
experiment:
  id: chaos-payments-latency-200ms
  owner: payments-platform
  reviewers: [sre-oncall, payments-lead]

  # 1. The belief we are trying to falsify.
  steady_state:
    metric: checkout_success_rate
    query: "sum(rate(checkout_total{status='ok'}[5m])) / sum(rate(checkout_total[5m]))"
    tolerance: ">= 0.98"          # within 2% of the 7-day baseline of ~1.0

  hypothesis: >
    Injecting 200ms p50 latency into calls to the payments service keeps
    checkout_success_rate within tolerance, because the order service retries
    with a 1.5s budget and a circuit breaker isolates a fully degraded dependency.

  # 2. Hard limit on who/what can be affected.
  blast_radius:
    environment: staging          # promote to prod only after this passes
    scope: "service=payments, deployment=canary"
    traffic_percent: 5            # only 5% of canary fleet
    max_duration: 10m

  # 3. The conditions that immediately stop the experiment.
  abort_conditions:
    - "checkout_success_rate < 0.95 for 1m"
    - "edge_5xx_rate > 0.01 for 30s"
    - "manual_halt == true"       # the big red button

  fault:
    type: latency
    target: payments-service
    latency_ms: 200
    jitter_ms: 50

Three fields make or break safety. Blast radius is the explicit cap on what can be harmed – environment, a label selector, a traffic percentage, and a maximum duration. Abort conditions (sometimes called the “stop conditions” or “halt”) are automated kill switches that end the experiment the instant steady state degrades past a hard floor that is worse than the tolerance. And max_duration guarantees the fault is never left running because someone went to lunch – the experiment cleans up after itself.

Step 3 – Run in pre-production first, and earn the right to production

Chaos engineering’s reputation problem comes from people who read “you must experiment in production” and skipped the part where Netflix had years of tooling, a mature steady-state model, and automated aborts before they got there. Production is where the real unknowns live – realistic traffic, real data volumes, real concurrency – and pre-production environments lie to you in subtle ways. But you walk there.

The maturity ladder I use:

Stage	Environment	Fault scope	Trigger
0	Local / CI	Single dependency, mocked	Pull request
1	Staging	One service, synthetic load	Manual game day
2	Staging	Multi-service, AZ failure	Scheduled
3	Production	Single instance / pod, off-peak	Manual, supervised
4	Production	Continuous, small blast radius	Automated, business hours

You do not skip rungs. Each rung must pass cleanly – no surprises, aborts behaving correctly, observability showing exactly what happened – before the next. The fastest way to get chaos engineering banned at a company is to cause a customer-visible incident on your second experiment. Build credibility in staging, prove the abort path works, then ask for production.

Step 4 – Inject the four fault classes

Real-world events fall into a small number of classes. Cover these and you cover the overwhelming majority of incidents. I’ll show platform-native tooling where it exists and primitives where it doesn’t, because both have their place.

Latency

Slow is the new down. A dependency that returns errors trips your circuit breaker cleanly; one that gets slow ties up connection pools and threads and takes the caller down with it. Inject latency with tc/netem on Linux for a host-level test:

# Add 200ms +/- 50ms of delay to all egress on eth0.
sudo tc qdisc add dev eth0 root netem delay 200ms 50ms distribution normal

# Always know your rollback before you start the fault.
sudo tc qdisc del dev eth0 root netem

For application-aware, targeted latency against a single dependency, a proxy like Toxiproxy is cleaner because it scopes to one upstream instead of the whole NIC:

# Point the app at Toxiproxy, then add a latency toxic to the payments upstream.
toxiproxy-cli create payments -l 127.0.0.1:5432 -u payments.internal:5432
toxiproxy-cli toxic add payments -t latency -a latency=200 -a jitter=50

Errors

Force a dependency to return failures and verify the caller degrades rather than cascades. On managed platforms, AWS Fault Injection Service (FIS) can inject API errors and throttling directly at the control plane – here, server-side errors on every S3 call from a tagged role:

{
  "actions": {
    "InjectS3Errors": {
      "actionId": "aws:s3:bucket-pause-replication",
      "description": "placeholder -- see fault-specific actions below"
    }
  }
}

More commonly you script the experiment with the FIS API; the template references reusable actions such as aws:fis:inject-api-internal-error for supported services and aws:ec2:stop-instances for instance loss. Start a run and FIS handles the stop conditions you attach via CloudWatch alarms:

aws fis start-experiment \
  --experiment-template-id EXT123abc \
  --tags Name=payments-error-injection

Resource exhaustion

CPU, memory, and disk pressure expose missing limits, bad autoscaling thresholds, and OOM-kill behavior. stress-ng is the right primitive:

# Saturate 2 CPUs and consume 1 GB of RAM for 60 seconds, then exit cleanly.
stress-ng --cpu 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief

# Fill a disk-backed path to test how the service handles a full volume.
stress-ng --hdd 1 --hdd-bytes 90% --timeout 60s

In Kubernetes, do this declaratively with a chaos operator. LitmusChaos and Chaos Mesh both express faults as custom resources the controller reconciles and – critically – reverts when the experiment’s duration expires:

# Chaos Mesh: 80% CPU burn on 1 pod of the payments deployment for 5 minutes.
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: payments-cpu-burn
  namespace: payments
spec:
  mode: one                      # blast radius: a single pod
  selector:
    namespaces: [payments]
    labelSelectors:
      app: payments
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: 5m                   # auto-revert -- no orphaned fault

Dependency loss

The hardest and most valuable: make a dependency completely unreachable and confirm the system fails the right way – shedding load, serving cached or default data, or failing fast with a clear error, not hanging. Drop the traffic outright with a network-partition fault:

# Chaos Mesh: partition the payments pods from the database for 2 minutes.
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: payments-db-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces: [payments]
    labelSelectors: { app: payments }
  direction: to
  target:
    mode: all
    selector:
      namespaces: [data]
      labelSelectors: { app: postgres }
  duration: 2m

Step 5 – Simulate region and AZ failure for real DR validation

A documented DR plan that has never been exercised is a hypothesis, not a capability. Availability-zone and region experiments are where chaos engineering pays for itself, because they validate the most expensive and least-tested part of your architecture.

For AZ failure, AWS FIS ships a purpose-built action that disrupts connectivity for a whole zone so you can confirm your service rebalances and your multi-AZ data tier holds:

{
  "actions": {
    "DisruptAZ": {
      "actionId": "aws:network:disrupt-connectivity",
      "parameters": { "scope": "availability-zone", "duration": "PT10M" },
      "targets": { "Subnets": "az-b-subnets" }
    }
  },
  "stopConditions": [
    { "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:...:alarm/checkout-success-floor" }
  ]
}

On Azure, Chaos Studio exposes a Virtual Machine Scale Set shutdown fault that you can scope to a single zone, plus network-disconnect faults for dependency loss. The same template discipline applies – selector, duration, and an attached monitor that halts the run.

For the full region-loss test, do not start with a real one. Begin by simulating the symptoms: black-hole all traffic to the region’s endpoints at your global load balancer, or use DNS to pull a region out of rotation, and verify failover RTO against your stated objective. Only once that is boringly reliable do you graduate to actually disabling the region’s capacity. The metric to assert is not “did it fail over” but “did steady state hold within RTO/RPO” – measured at the edge, on the business signal.

A region experiment without a measured RTO assertion is theater. The hypothesis is concrete: “When us-east-1 becomes unreachable, global checkout_success_rate recovers to >= 0.98 within 120 seconds (our stated RTO), with RPO <= 5 seconds of orders.” Pass or fail, in numbers.

Step 6 – Automate continuous chaos and wire it into CI

Manual game days find weaknesses once. Continuous, automated experiments catch regressions – the resilience you had last quarter that a refactor quietly removed. The goal is to make chaos a standing check, not an event.

Two integration points matter. First, a CI gate: a small, fast, fully isolated experiment that runs on every change to a critical service – inject latency or a dependency error against an ephemeral environment and assert steady state holds. This is where Chaos Toolkit fits well; its experiments are JSON/YAML with a non-zero exit on failed steady state, so the pipeline fails like any other test:

# .github/workflows/chaos-gate.yml
name: chaos-gate
on: [pull_request]
jobs:
  resilience:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install chaostoolkit chaostoolkit-prometheus
      # Probes steady state BEFORE, injects the fault, re-probes AFTER.
      # Exits non-zero if the after-probe breaks tolerance -> PR blocked.
      - run: chaos run experiments/payments-latency.json

Second, a scheduled production cadence at progressively larger blast radius, running during business hours when engineers are watching – the opposite of the instinct to run risky things overnight. Off-hours chaos means an undetected failure smolders until morning; business-hours chaos means a human sees the abort fire in real time. Schedule via your orchestrator (FIS supports aws:fis:wait and experiment schedules; Chaos Mesh has a Schedule CRD with a cron field) and let the automated stop conditions be the safety net.

Step 7 – Measure, feed back, and track resilience debt

An experiment that finds a weakness and does not produce a tracked action item was a waste of risk. Close the loop with three outputs every time:

A verdict against the hypothesis – disproved or not – with the steady-state graphs for control and experimental windows attached. Numbers, not vibes.
A findings record for anything surprising: a missing timeout, an autoscaler that reacted too slowly, an alert that never fired. Each becomes a ticket with an owner.
A resilience-debt entry when you find a weakness you choose not to fix immediately. This is the most underused practice in the whole discipline. Resilience debt is tracked exactly like tech debt – logged, prioritized, and visible – so “we know region failover is 40 seconds over RTO” is a deliberate, owned decision rather than a landmine.

Track resilience debt where leadership already looks. A simple risk register works:

ID     Finding                              Severity  Owner       RTO impact   Status
RD-12  Order cache has no TTL fallback      High      orders      +90s        In progress
RD-19  AZ-b drain rebalances in 38s (>30s)  Medium    platform    +8s         Accepted
RD-23  Payments breaker opens too late      High      payments    cascade     Fixed (PR-441)

The trend that matters over quarters is mean time to detection and recovery falling while blast radius and experiment frequency rise. If experiments stop finding anything, you are either highly resilient or running experiments that are too timid – usually the latter, so escalate the fault classes.

Verify

Confirm the program – not just one experiment – actually works:

# 1. The fault was applied. With Chaos Mesh, the controller records events.
kubectl get networkchaos payments-db-partition -n payments -o yaml \
  | grep -A5 "status:"

# 2. The fault auto-reverted on schedule (no orphaned faults left running).
kubectl get networkchaos,stresschaos -A      # expect none "Running" past duration

# 3. The abort path fires. Force steady state below the floor and confirm
#    the experiment halts itself -- this is the test that earns prod access.
#    Watch the experiment status flip to halted/finished when the alarm trips.

# 4. Steady state was actually measured, not assumed. Query the same metric
#    the experiment used, over the experiment window.
curl -s "http://prometheus:9090/api/v1/query?query=\
sum(rate(checkout_total{status='ok'}[5m]))/sum(rate(checkout_total[5m]))" | jq '.data.result'

A green experiment with no steady-state graph is unverified. The deliverable is always the before/after on the business signal – if you cannot produce that, you ran a fault, not an experiment.

Enterprise scenario

A retail platform team I worked with had a textbook multi-AZ architecture on EKS across three zones, with an Aurora cluster spanning all three. Every architecture review checked the box. Then a real AZ impairment hit and checkout dropped 60% for eleven minutes – far past their 30-second RTO – before recovering on its own. The architecture was correct on paper; the behavior under partial failure was not what anyone believed.

The constraint was that they could not reproduce the failure on demand to debug it, and leadership was understandably nervous about deliberately breaking production again. So they built up to it. In staging they used FIS aws:network:disrupt-connectivity scoped to one AZ’s subnets, with a CloudWatch alarm on synthetic checkout success wired as a stop condition. The first run reproduced the symptom immediately and revealed two compounding faults the design review had missed. First, their Kubernetes pod readiness probes had a 30-second periodSeconds with a failureThreshold of 3 – so endpoints in the dead zone stayed in rotation for up to 90 seconds, and the service kept routing to pods it could not reach. Second, the JDBC pool’s connection-validation timeout was longer than the request timeout, so threads blocked on dead connections instead of failing fast and being replaced.

The fix was unglamorous and exactly what chaos is for: tighten the readiness probe so dead endpoints leave rotation in under 10 seconds, and set the connection validation timeout below the request timeout so a partitioned database connection fails fast.

# Readiness probe tuned so a partitioned pod leaves the Service quickly.
readinessProbe:
  httpGet: { path: /healthz/ready, port: 8080 }
  periodSeconds: 3          # was 30
  failureThreshold: 2       # was 3  -> out of rotation in ~6s, not ~90s
  timeoutSeconds: 2

They added the AZ-disruption experiment to a monthly schedule, first in staging, then – once three consecutive runs held RTO – supervised in production off-peak, then automated during business hours. The next real AZ event, two quarters later, was a non-incident: checkout dipped 4% and recovered in 19 seconds, comfortably inside RTO. The experiment that once reproduced an 11-minute outage now runs unattended every week and asserts checkout_success_rate >= 0.98 the whole time.

Building a Chaos Engineering Program: Hypotheses, Fault Injection, and Game Days

Step 1 – Start from the steady-state hypothesis, not the fault

Step 2 – Build a reusable experiment template

Step 3 – Run in pre-production first, and earn the right to production

Step 4 – Inject the four fault classes

Latency

Errors

Resource exhaustion

Dependency loss

Step 5 – Simulate region and AZ failure for real DR validation

Step 6 – Automate continuous chaos and wire it into CI

Step 7 – Measure, feed back, and track resilience debt

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

API Gateway and Backend-for-Frontend Patterns: Aggregation, Composition, and Versioning

Implementing Backpressure and Flow Control in High-Throughput Streaming Systems

Cell-Based Architecture: Containing Blast Radius with Bulkheads and Shuffle Sharding