Ansible for Disaster Recovery, In Depth — RPO/RTO Engineering, Site Failover and Cross-Region Runbooks
Disaster recovery is the operations discipline most likely to be rehearsed on paper and broken in production. Every regulated company has a DR plan. Almost no regulated company can execute it inside the recovery window the plan claims, because the plan is a Word document and the failover is a 600-step manual procedure that depends on the one engineer who happened to write it.
Ansible fixes that, but only if you treat DR the way you treat any other production system: with versioned code, signed artefacts, automated validation, and frequent rehearsal. This lesson is the specialist-tier guide to engineering DR with Ansible — translating recovery objectives into runbooks, validating replication, orchestrating cross-region cutovers from the Ansible Automation Platform (AAP), and proving recovery on a cadence that satisfies auditors and operators alike.
We will not cover the marketing distinction between “DR” and “BC/BCP”. We will cover the engineering: how to make a failover happen, how to know it worked, how to fail back without losing data, and how to keep the runbook honest as the underlying systems change.
Position in the curriculum. This lesson assumes Tier 1–4 fluency: roles, dynamic inventory, AAP workflows, vault, no_log discipline, at least one cloud collection, and the Tier 5 compliance lesson (because DR runbooks are themselves compliance evidence).
What “DR” really means in the Ansible context
DR is the discipline of continuing to operate a service after a fault that exceeds the local fault domain. “Local fault domain” here is whatever your steady-state HA was designed for — a node, a rack, a switch, a rack row, a power feed. Anything that takes out more than that is a DR event: a regional outage, a fibre cut, a ransomware encryption, a cloud-provider-wide control plane failure, a fire, a flood, a blast radius from a misapplied automation script.
DR has two non-negotiable numbers, set by the business:
- RPO — Recovery Point Objective: the maximum acceptable data loss measured in time. RPO=0 means no data loss is acceptable (synchronous replication). RPO=15min means we can lose up to 15 minutes of data.
- RTO — Recovery Time Objective: the maximum acceptable downtime measured in time. RTO=0 means service must be continuously available (active-active). RTO=4h means we have 4 hours from declaration of disaster to restored service.
There is a third number that often gets ignored: MTPD — Maximum Tolerable Period of Disruption, the point at which the business itself fails. RTO must always be less than MTPD. RPO and RTO together define the DR architecture pattern you must build:
| Pattern | Typical RPO | Typical RTO | Cost | When to use |
|---|---|---|---|---|
| Backup & restore | Hours–days | Hours–days | Low | Non-critical apps, archive, dev/test |
| Pilot light | Minutes | 30–60min | Medium | Core data replicated, app stack scaled-out on demand |
| Warm standby | Minutes | 5–15min | High | Smaller capacity running hot, scaled up on failover |
| Active-passive | Seconds | 1–5min | High | Full capacity in DR site, idle until failover |
| Active-active | 0 (sync) | 0 (transparent) | Very high | Mission-critical financial/medical, cross-region traffic |
Ansible’s job in DR is orchestration, not storage. The data replication itself is done by storage arrays (SnapMirror, Storage Replica, EBS replication), database engines (Postgres streaming, MySQL group replication, MongoDB replica sets), or block-level tools (Veeam, Zerto, AWS DRS). Ansible’s job is to:
- Validate that replication is healthy and within RPO every minute of every day.
- Promote the DR copy when a disaster is declared.
- Reconfigure networking, DNS, certificates and identity to make the DR copy reachable.
- Restart application stacks in the DR site in the correct order.
- Verify that the failed-over service actually works, end-to-end.
- Fail back cleanly when the primary site recovers, without data loss.
- Test the whole flow on a non-disruptive cadence — quarterly or better.
If Ansible owns those seven jobs, your DR plan is code. If not, your DR plan is a wiki page that no one has read in 18 months.
The DR architecture this lesson assumes
For concreteness, the rest of this lesson assumes a representative dual-region hybrid pattern that maps onto most enterprise environments:
- Primary site: on-prem datacentre (DC1) with vSphere, NetApp storage, an Active Directory forest, and a Postgres cluster that backs the application.
- DR site: AWS region
eu-west-1, with EC2/EBS for compute/storage, RDS Postgres as a read-replica that can be promoted, and Route 53 for DNS. - Replication: NetApp SnapMirror (storage), AWS DMS or Postgres logical replication (database), and
aws s3 syncor Veeam Cloud Connect (file/object). - Control plane: AAP (Automation Controller + automation mesh), with one control node in the primary site, one in DR, and a hop node in each, so workflows can run from either side. Every execution environment is signed and pinned.
- Identity: AD with replication to a DR domain controller, and AWS IAM Identity Center federated to AD via SAML.
This is a pilot-light/warm-standby hybrid: the database is hot, application servers are scaled to zero in DR until failover. RPO target is 5 minutes; RTO target is 30 minutes. Those numbers are what we will engineer to.
The DR repository layout
Treat the DR runbook as a top-level Ansible repository, not a side project of the main one. The blast radius of a DR run is the entire estate; it deserves its own roles, its own review process, and its own test matrix.
dr-orchestration/
├── ansible.cfg
├── collections/requirements.yml # pinned: amazon.aws, community.aws,
│ # community.postgresql, ansible.posix
├── inventory/
│ ├── primary/ # DC1 (vSphere)
│ │ ├── hosts.yml
│ │ └── group_vars/
│ └── dr/ # AWS eu-west-1
│ ├── aws_ec2.yml # dynamic inventory plugin
│ └── group_vars/
├── group_vars/
│ └── all/
│ ├── rpo_rto.yml # the SLOs (single source of truth)
│ └── vault.yml # encrypted secrets
├── playbooks/
│ ├── 00-prereq-validate.yml
│ ├── 10-replication-health.yml # runs every minute via AAP schedule
│ ├── 20-declare-disaster.yml # human-gated, requires approval
│ ├── 30-failover.yml # the actual cutover
│ ├── 40-validate-failover.yml # synthetic transactions
│ ├── 50-cutover-dns.yml # last step — irreversible-feeling
│ ├── 60-failback-prep.yml
│ ├── 70-failback.yml
│ └── 99-game-day.yml # non-destructive test
└── roles/
├── replication_check_netapp/
├── replication_check_postgres/
├── replication_check_s3/
├── promote_rds/
├── scale_asg/
├── reconfigure_dns/
├── reconfigure_ad_trust/
├── reconfigure_certs/
├── app_smoke_test/
└── dr_evidence/
The split between playbooks/10-replication-health.yml (continuous) and playbooks/30-failover.yml (event) is fundamental. You cannot fail over to a replica that is broken, and you do not want to discover the replica is broken during a real disaster. The continuous playbook runs every minute and screams when replication lag exceeds RPO.
RPO engineering: continuous replication validation
The first runbook to write is not the failover — it is the replication health check. Without this, every other runbook is a guess.
# playbooks/10-replication-health.yml
---
- name: Replication health (must stay green)
hosts: localhost
gather_facts: false
vars:
rpo_db_seconds: 300 # 5 minutes
rpo_storage_seconds: 600 # 10 minutes
rpo_object_seconds: 900 # 15 minutes
tasks:
- name: Postgres replication lag (logical replication slot)
ansible.builtin.shell: |
psql -h primary-db.kv.local -U dr_observer -d app -At -c "
SELECT EXTRACT(EPOCH FROM
(now() - pg_last_xact_replay_timestamp()))::int
FROM pg_stat_replication
WHERE application_name = 'dr_replica'
"
register: pg_lag
changed_when: false
no_log: true
- name: Fail loudly if Postgres RPO breached
ansible.builtin.fail:
msg: "Postgres replication lag {{ pg_lag.stdout }}s > RPO {{ rpo_db_seconds }}s"
when: pg_lag.stdout | int > rpo_db_seconds
- name: NetApp SnapMirror lag
netapp.ontap.na_ontap_rest_info:
hostname: "{{ netapp_host }}"
username: "{{ netapp_user }}"
password: "{{ netapp_pass }}"
gather_subset: snapmirror_info
register: sm
no_log: true
- name: Fail loudly if storage RPO breached
ansible.builtin.fail:
msg: "SnapMirror lag {{ item.lag_time }} on {{ item.destination_path }}"
loop: "{{ sm.ontap_info.snapmirror_info.records }}"
when: (item.lag_time | community.general.iso8601_to_seconds) > rpo_storage_seconds
- name: S3 cross-region replication metric
amazon.aws.cloudwatch_metric_statistics:
namespace: AWS/S3
metric_name: ReplicationLatency
dimensions:
SourceBucket: "{{ src_bucket }}"
DestinationBucket: "{{ dst_bucket }}"
period: 60
statistics: [Maximum]
start_time: "{{ '%Y-%m-%dT%H:%M:%SZ' | strftime((ansible_date_time.epoch | int) - 300) }}"
end_time: "{{ ansible_date_time.iso8601 }}"
register: s3_lag
- name: Push DR health to Prometheus pushgateway
ansible.builtin.uri:
url: "http://pushgw.kv.local:9091/metrics/job/dr_health"
method: POST
body: |
dr_pg_lag_seconds {{ pg_lag.stdout }}
dr_sm_lag_seconds {{ sm_lag_max }}
dr_s3_lag_seconds {{ s3_lag_max }}
dr_health_check_unixtime {{ ansible_date_time.epoch }}
status_code: 200
Run this from AAP every minute on a schedule. Wire the Prometheus metrics into Alertmanager: a single missed minute is ignored; three consecutive breaches paste a P1 to PagerDuty. The breach itself is your early warning that DR cannot meet RPO right now — long before any actual disaster.
RTO engineering: the failover runbook
The failover playbook is the most important Ansible file in your estate. Treat it accordingly: small, idempotent, modular, and gated by approval.
# playbooks/30-failover.yml
---
- name: DR failover — primary -> DR
hosts: localhost
gather_facts: false
vars_prompt:
- name: confirm
prompt: "Type FAILOVER to proceed. This will promote DR and stop primary writes."
private: false
pre_tasks:
- name: Abort unless explicitly confirmed
ansible.builtin.assert:
that: confirm == "FAILOVER"
fail_msg: "DR failover not confirmed; aborting."
tasks:
- name: 1. Stop writes to primary (fence)
ansible.builtin.import_role:
name: fence_primary
tags: [fence]
- name: 2. Final replication catch-up window
ansible.builtin.import_role:
name: replication_check_postgres
vars:
rpo_db_seconds: 60
tags: [validate]
- name: 3. Promote RDS read-replica to primary
ansible.builtin.import_role:
name: promote_rds
tags: [promote]
- name: 4. Scale up DR application tier
ansible.builtin.import_role:
name: scale_asg
vars:
target_capacity: "{{ primary_capacity }}"
tags: [compute]
- name: 5. Reconfigure AD trust + DNS (Route 53)
ansible.builtin.import_role:
name: reconfigure_dns
tags: [dns]
- name: 6. Re-issue certificates pointing at DR endpoints
ansible.builtin.import_role:
name: reconfigure_certs
tags: [tls]
- name: 7. Smoke test
ansible.builtin.import_role:
name: app_smoke_test
tags: [verify]
- name: 8. Record evidence
ansible.builtin.import_role:
name: dr_evidence
vars:
event: failover
tags: [evidence]
Note the structure: each step is a role, each step is tagged, each step is idempotent, and each step is named with the order it must execute. You can re-run this playbook safely — that is the point of idempotency in DR. If step 4 (scale ASG) fails because of an EC2 capacity issue and the operator fixes it manually, re-running the playbook will see the ASG already scaled and skip to step 5. This is the difference between a runbook that survives a real disaster and a runbook that breaks the moment reality intrudes.
The vars_prompt is deliberately a typed phrase, not a y/n. Tired engineers at 3am hit y by reflex.
The fence: stop primary writes before promoting
This is the step that operators most often skip in tabletop exercises and most often regret in real failovers. You must stop the primary from accepting writes before promoting the replica, or you will end up with a split brain and no clean way to merge the two timelines back.
# roles/fence_primary/tasks/main.yml
---
- name: Disable primary load balancer (HAProxy/F5)
community.general.haproxy:
socket: /var/lib/haproxy/stats
state: disabled
backend: "{{ item.backend }}"
host: "{{ item.host }}"
loop: "{{ primary_app_hosts }}"
delegate_to: lb-primary.kv.local
- name: Reject writes at primary Postgres (read-only)
community.postgresql.postgresql_set:
name: default_transaction_read_only
value: "on"
login_host: "{{ primary_pg_host }}"
login_user: postgres
no_log: true
- name: Tell ALL primary app hosts to stop ingesting
ansible.builtin.systemd:
name: kv-app
state: stopped
delegate_to: "{{ item }}"
loop: "{{ groups['primary_app'] }}"
ignore_errors: true # primary may already be unreachable
- name: Flag the storage volume as fenced (NetApp)
netapp.ontap.na_ontap_volume:
state: present
name: "{{ item }}"
is_online: false
hostname: "{{ netapp_host }}"
username: "{{ netapp_user }}"
password: "{{ netapp_pass }}"
loop: "{{ primary_volumes }}"
no_log: true
ignore_errors: true on the systemd stop is intentional: in a real disaster the primary may be unreachable. The fence still works because the Postgres default_transaction_read_only=on rejects writes from any client that does manage to connect, and the LB drains traffic at the network edge.
Promotion, scale-up, and DNS cutover
These are the three steps that take the most clock time. Engineer each one to stream progress to the operator rather than sit silently for 10 minutes.
# roles/promote_rds/tasks/main.yml
---
- name: Promote RDS read replica
amazon.aws.rds_instance:
id: "{{ dr_db_identifier }}"
state: present
promote: true
region: "{{ dr_region }}"
wait: true
wait_timeout: 600
register: rds_promo
- name: Verify Postgres accepts writes
community.postgresql.postgresql_query:
db: app
login_host: "{{ rds_promo.endpoint.address }}"
login_user: "{{ pg_user }}"
login_password: "{{ pg_pass }}"
query: "CREATE TABLE IF NOT EXISTS dr_canary (ts timestamptz); INSERT INTO dr_canary VALUES (now());"
no_log: true
- name: Update Postgres connection string in AWS Secrets Manager
amazon.aws.secretsmanager_secret:
name: app/db/url
secret_type: string
secret: "postgresql://{{ pg_user }}:{{ pg_pass }}@{{ rds_promo.endpoint.address }}:5432/app"
region: "{{ dr_region }}"
state: present
no_log: true
The “canary” insert is your proof that promotion worked. If the role completes without that insert succeeding, you do not have a working DR — you have a half-promoted replica, and step 4 (scale ASG) will start launching application servers that immediately fail to connect.
# roles/scale_asg/tasks/main.yml
---
- name: Scale ASG to primary capacity
amazon.aws.autoscaling_group:
name: "{{ dr_asg_name }}"
region: "{{ dr_region }}"
desired_capacity: "{{ target_capacity }}"
min_size: "{{ target_capacity }}"
max_size: "{{ (target_capacity | int * 1.5) | round | int }}"
wait_for_instances: true
wait_timeout: 900
register: scale_result
- name: Wait until /healthz returns 200 on every new instance
ansible.builtin.uri:
url: "http://{{ item.private_ip_address }}:8080/healthz"
status_code: 200
timeout: 5
retries: 30
delay: 10
delegate_to: localhost
loop: "{{ scale_result.instances }}"
when: scale_result.instances is defined
The DNS cutover happens last because it is the visible-to-customers step. Once Route 53 points at the DR ALB, traffic shifts; that is the moment customers see a recovered service.
# roles/reconfigure_dns/tasks/main.yml
---
- name: Move app.example.com to DR ALB
amazon.aws.route53:
state: present
zone: example.com
record: app.example.com
type: A
alias: true
alias_hosted_zone_id: "{{ dr_alb_hosted_zone_id }}"
value: "{{ dr_alb_dns_name }}"
ttl: 60
overwrite: true
no_log: true
- name: Bump TTL back up after stabilisation
amazon.aws.route53:
state: present
zone: example.com
record: app.example.com
type: A
alias: true
alias_hosted_zone_id: "{{ dr_alb_hosted_zone_id }}"
value: "{{ dr_alb_dns_name }}"
ttl: 300
overwrite: true
when: stable | default(false)
Notice the TTL=60 during cutover. You want clients to re-resolve quickly when DR comes online, but you do not want a permanently low TTL that hammers your DNS bill. The stable flag is set by a follow-up task once the cutover has held for an hour.
End-to-end smoke test
The last technical step before declaring DR successful is to prove the application works, not just that the infrastructure is up. Health-check endpoints lie. Write a smoke test that exercises the actual user path.
# roles/app_smoke_test/tasks/main.yml
---
- name: Synthetic transaction — sign up a test user
ansible.builtin.uri:
url: "https://app.example.com/signup"
method: POST
body_format: json
body:
email: "dr-canary+{{ ansible_date_time.epoch }}@kv.local"
password: "{{ canary_pass }}"
status_code: 201
return_content: true
register: signup
no_log: true
- name: Synthetic transaction — log in
ansible.builtin.uri:
url: "https://app.example.com/login"
method: POST
body_format: json
body:
email: "dr-canary+{{ ansible_date_time.epoch }}@kv.local"
password: "{{ canary_pass }}"
status_code: 200
return_content: true
register: login
no_log: true
- name: Synthetic transaction — read a row
ansible.builtin.uri:
url: "https://app.example.com/api/v1/profile"
method: GET
headers:
Authorization: "Bearer {{ login.json.token }}"
status_code: 200
no_log: true
- name: Fail the whole DR if smoke test fails
ansible.builtin.fail:
msg: "Smoke test failed; DR is NOT yet customer-ready"
when: signup.status != 201 or login.status != 200
This is the runbook step that catches misconfigurations DNS health checks can never see: an expired certificate, a missing IAM role, a region-mismatched secret, a forgotten Postgres extension. If your smoke test passes, you can confidently tell the business “we are in DR, customers can use the service.”
Failback: the part everyone forgets
Failover is rehearsed; failback is improvised. That is why most real DR events end with the company running on the DR site for months — the team is too scared to fail back, because they know the failback was never tested.
Failback is its own playbook with its own gating. The steps are roughly the inverse of failover:
- Verify primary is healthy (storage replicated, network up, certs valid).
- Reverse replication direction: DR (now primary) replicates back to the original primary (now standby).
- Wait until reverse replication is within RPO.
- Schedule a maintenance window.
- Fence DR writes.
- Promote primary.
- Scale up primary application tier.
- Cut DNS back to primary.
- Smoke test.
- Re-establish forward replication (primary → DR).
- Scale DR back down to pilot-light.
- Record evidence.
The dangerous step is #2 — reversing replication. Postgres does not let you simply “reverse” a logical replication slot; you need to either rebuild the original primary from the DR site (pg_basebackup --pgdata=/var/lib/postgresql/data --host=dr-primary --wal-method=stream) or use a tool like AWS DMS that can do bidirectional sync. Whichever you choose, bake it into a role and rehearse it, because failback is the failure mode that costs you months of running on a more expensive site.
Cross-cloud and hybrid DR considerations
The dual-region pattern above lives entirely in AWS for DR. Many enterprises have hybrid DR (primary on-prem, DR in cloud) or cross-cloud DR (primary AWS, DR Azure). Ansible handles both, but the validation steps grow.
Hybrid (vSphere → AWS): the storage replication is usually NetApp SnapMirror to Cloud Volumes ONTAP, or Veeam Cloud Connect, or AWS Storage Gateway. Ansible orchestrates the same way; you just import a different replication-check role. The networking is harder because Direct Connect / VPN tunnels must come up, and routes must be advertised. Pre-stage the BGP peerings and keep them administratively shut, then use community.aws.directconnect_virtual_interface to enable them during failover.
Cross-cloud (AWS → Azure): you cannot use cloud-native replication; you must use application-level (Postgres logical replication over a VPN), storage-level (Veeam), or stream-based (Kafka MirrorMaker) replication. Identity federation gets harder — you need both AD/IAM Identity Center and Azure AD Connect to be replicated. This is the case where AAP automation mesh shines: control plane in one cloud, execution nodes in both, no inbound firewall holes anywhere.
Multi-region active-active (rare, but covered for completeness): there is no failover playbook because every site is always serving traffic. Ansible’s role here is traffic shaping — using Route 53 latency-based routing or Azure Traffic Manager to drain traffic away from a degraded region while the underlying replication keeps converging. The “DR” event is a configuration change, not a promotion.
DR testing: game days, table-tops and chaos
A DR runbook that has not been tested in 12 months is broken. The exam question is not “is it broken?” but “how broken?”. Test on a cadence:
- Monthly tabletop (1 hour): an engineer reads the runbook aloud while another challenges every step. Catches stale documentation.
- Quarterly partial DR (4 hours): execute steps 1–4 of failover (fence, validate, promote, scale) in a separate VPC/account. Do not cut DNS. Catches infrastructure drift.
- Annual full DR (1 day): execute the full failover, run the business on DR for the day, fail back. Catches certificate problems, IAM problems, and SaaS integrations.
- Continuous chaos (always): kill replication processes, expire certificates intentionally in pre-prod, simulate AZ failures with
aws fisor Azure Chaos Studio. Catches RPO breaches and runbook fragility.
The playbooks/99-game-day.yml is your scaffolding for the partial DR test. It runs the failover playbook with --check --diff against a _test_ inventory, then runs the smoke test against a clone of the DR environment. Schedule it monthly in AAP. The test that runs every month and emails a green “DR rehearsal: PASS” message to risk management is worth more than 100 pages of plan documentation.
DR evidence: what auditors actually want
Every DR run, real or rehearsed, must produce evidence. The minimum bundle:
# roles/dr_evidence/tasks/main.yml
---
- name: Evidence directory for this run
ansible.builtin.set_fact:
dr_evidence_dir: "/var/lib/dr/evidence/{{ ansible_date_time.epoch }}-{{ event }}"
- name: Capture replication lag at start of run
ansible.builtin.copy:
dest: "{{ dr_evidence_dir }}/01-replication-lag.json"
content: "{{ replication_state | to_nice_json }}"
- name: Capture failover duration
ansible.builtin.copy:
dest: "{{ dr_evidence_dir }}/02-timing.json"
content: |
{ "start": "{{ run_start }}",
"end": "{{ ansible_date_time.iso8601 }}",
"rto_seconds": {{ (ansible_date_time.epoch | int) - run_start_epoch }},
"rto_target_seconds": {{ rto_target }},
"met_rto": {{ ((ansible_date_time.epoch | int) - run_start_epoch) <= rto_target }}
}
- name: Capture smoke-test results
ansible.builtin.copy:
dest: "{{ dr_evidence_dir }}/03-smoke-test.json"
content: "{{ smoke_results | to_nice_json }}"
- name: Sign the evidence with the runbook commit hash
ansible.builtin.shell:
cmd: "sha256sum {{ dr_evidence_dir }}/*.json > {{ dr_evidence_dir }}/MANIFEST.sha256"
- name: Push to evidence bucket (object lock + immutable)
amazon.aws.s3_object:
bucket: "{{ evidence_bucket }}"
object: "{{ dr_evidence_dir | basename }}/MANIFEST.sha256"
src: "{{ dr_evidence_dir }}/MANIFEST.sha256"
mode: put
The S3 bucket must have Object Lock enabled in compliance mode with a retention period that exceeds your audit cycle. Auditors do not accept “we have logs” — they accept “we have logs you cannot tamper with.” Every quarterly review then becomes a query: “show me all DR runs in the last 90 days, their RTO numbers, and whether we met target.” That is a Glue/Athena query against the evidence bucket. It is also the kind of report risk management loves.
Workflow: stitching it all together in AAP
The full DR workflow in Automation Controller is a directed graph:
[Replication health (scheduled, every 1m)]
|
| (failure path: PagerDuty incident)
|
[Declare disaster (manual approval)]
|
v
[30-failover.yml]
|
+-- on success --> [Notify ops + risk]
|
+-- on failure --> [Rollback (40-rollback.yml)]
|
v
[Page incident commander]
Two AAP-specific notes:
- Survey-driven approval: the “Declare disaster” node is a survey job template that asks for a human-typed reason and an incident ticket number. The Survey output flows into the next node as extra_vars, which become evidence. No silent automatic disasters.
- Mesh-aware execution nodes: replication-health checks must be able to run from either site (so a site outage does not blind the check). Configure the schedule to use a
node_type=executionmesh group that contains nodes in both DC1 and DR; AAP will pick a healthy one.
Anti-patterns that destroy DR
A non-exhaustive list, all collected from real audits:
- Treating backups as DR. Backups are the last line of defence; they are not RPO=5min. If your DR plan is “restore from backup”, your real RPO is 24 hours.
- A failover runbook that runs as a single shell script. No idempotency, no resumability, no pre-flight check. A single retry breaks state.
- Hardcoded primary endpoints in application config. When DR comes up, the app still points at the dead primary. Use service discovery or AWS Secrets Manager / Vault and rotate on failover.
- Untested fence step. Discovered during the real disaster, when both sites are accepting writes for 12 minutes and you now have two divergent timelines.
- DNS TTL of 86400s. Customers cache the dead IP for a day. Set TTL=60 well before failover, not during.
- Failback that takes longer than failover. This is a sign that failback was an afterthought. Engineer it to the same standard as failover.
- No rehearsal cadence. A DR plan that lives in Confluence and has not been executed in a year is a fairy tale.
- Using prod credentials for DR. Vault DR credentials separately; they should not depend on prod KMS keys (which will be down with prod).
- Cross-region IAM dependencies. DR account using a KMS key in the primary region is a dead DR. Use multi-region keys.
Frequently asked questions
1. Why orchestrate DR with Ansible instead of a vendor tool like Zerto, AWS DRS or Azure Site Recovery? Vendor tools handle one slice — typically VM-level block replication and orchestrated boot. They do not handle DNS cutover, certificate re-issuance, AD trust reconfiguration, application smoke tests, or evidence generation. Use a vendor tool as one role inside the Ansible runbook. The runbook is still the source of truth for “what does failover mean for this business service.”
2. What’s the smallest sensible RPO target? Anything below 30 seconds requires synchronous replication and a third site to break ties. For most enterprises, the practical floor is 5 minutes with asynchronous replication and a well-tested fence. Below that, costs and complexity explode.
3. How do I keep DR roles in sync with prod?
Use the same roles. The application playbook that deploys to prod must also deploy to DR — no DR-specific forks. The differences live in group_vars/dr/ (region, sizing, replication endpoints) only. If you find yourself maintaining two copies of a role, you have an architectural drift bug.
4. What about ransomware?
Ransomware is the DR scenario most enterprises fail. The replica is encrypted within minutes of the primary. Defend with immutable, air-gapped backups (S3 Object Lock in compliance mode, NetApp SnapVault with locked snapshots, or tape) plus the DR replica. Failover may not be from the replica; it may be from a 4-hour-old immutable backup, accepting the higher RPO as a deliberate trade-off. Have a separate ransomware runbook (playbooks/30-ransomware-failover.yml) that promotes the immutable backup, not the live replica.
5. How do I federate AD between primary and DR without making AD itself a SPOF? Two domain controllers in DR (different AZs), site-aware DNS, and replication topology configured so that DR is a separate site. AD Connect to Azure AD must use a service account replicated to both sites. Test the DR-only logon path quarterly: shut off the primary DCs at the network layer and ensure a user can still authenticate.
6. What’s the right way to handle stateful Kubernetes workloads in DR? For PVCs, use storage-class replication (Portworx, Longhorn, NetApp Trident with SnapMirror). For control plane, run separate clusters per region — never replicate the etcd quorum across regions; the latency will destroy you. Use Velero or Trident to back up cluster state hourly, and fail over by restoring into the DR cluster. The Kubernetes lesson covers the operator pattern; in DR, treat the operator manifest as the authoritative state.
7. How do I test DR without scaring the business? Invest in a “DR pre-prod” environment: a separate VPC/account that mirrors prod’s topology at 1/4 size with synthetic data. Run the full failover/failback there monthly. Run partial real-prod tests (steps 1–4) quarterly. Run full real-prod tests once a year, scheduled, with stakeholder buy-in.
8. Should the failover playbook use serial and strategy: free?
For application-tier scaling, yes — serial: 25% gives you a rolling start that lets healthchecks catch failed AMIs early. For database promotion, no — it must be sequential. Use one playbook per phase, each with the right strategy.
9. How do I handle ChatOps for DR declarations?
A Slack bot calls the AAP API to launch the 20-declare-disaster.yml job template, which in turn requires a survey approval on the actual failover. The Slack interaction is a convenience; the gating still happens inside AAP, with the audit trail. Never expose the failover endpoint directly to a chat command.
10. What’s the single most underrated DR practice?
The cleanup playbook. After a successful failover or rehearsal, run a playbook that records the commit hash of the runbook, archives the evidence, regenerates RPO/RTO Grafana dashboards with the actual numbers, and files a Jira ticket for any role that hit ignore_errors. The teams that do this never have a DR plan that quietly rots; the teams that don’t have a “DR reawakening project” once every two years.
Hands-on lab — your first automated failover
The following lab fails over a tiny Postgres + Flask app from a “primary” Docker Compose stack to a “DR” Docker Compose stack, all on one machine. It teaches the runbook structure without needing two cloud accounts.
Prerequisites: docker, docker compose, ansible-core ≥ 2.16, community.postgresql.
mkdir -p dr-lab/{primary,dr,playbooks,roles}
cd dr-lab
# primary/docker-compose.yml
services:
pg-primary:
image: postgres:16
environment:
POSTGRES_DB: app
POSTGRES_USER: app
POSTGRES_PASSWORD: appsecret
POSTGRES_INITDB_ARGS: "-c wal_level=logical"
ports: ["5432:5432"]
volumes:
- ./primary/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
app-primary:
image: python:3.12-slim
command: >
bash -c "pip install flask psycopg2-binary &&
python -c \"
from flask import Flask; import psycopg2, os;
a = Flask(__name__);
@a.route('/')
def i(): c = psycopg2.connect(os.environ['DB']); cur=c.cursor(); cur.execute('SELECT now()'); return str(cur.fetchone());
a.run(host='0.0.0.0', port=8080)\""
environment:
DB: postgresql://app:appsecret@pg-primary/app
ports: ["8080:8080"]
depends_on: [pg-primary]
# dr/docker-compose.yml — same shape but ports 5433/8081
services:
pg-dr:
image: postgres:16
environment:
POSTGRES_DB: app
POSTGRES_USER: app
POSTGRES_PASSWORD: appsecret
ports: ["5433:5432"]
app-dr:
image: python:3.12-slim
command: ... # same as primary
environment:
DB: postgresql://app:appsecret@pg-dr/app
ports: ["8081:8080"]
depends_on: [pg-dr]
# playbooks/30-failover.yml
- hosts: localhost
gather_facts: false
vars_prompt:
- name: confirm
prompt: "Type FAILOVER to proceed"
private: false
tasks:
- assert: { that: confirm == "FAILOVER" }
- name: Stop primary writes (read-only)
community.postgresql.postgresql_set:
name: default_transaction_read_only
value: "on"
login_host: 127.0.0.1
login_port: 5432
login_user: postgres
login_password: appsecret
no_log: true
- name: Take final snapshot
ansible.builtin.shell: |
docker exec primary-pg-primary-1 pg_dumpall -U app > /tmp/dr-final.sql
- name: Restore into DR
ansible.builtin.shell: |
cat /tmp/dr-final.sql | docker exec -i dr-pg-dr-1 psql -U app
- name: Smoke test DR
ansible.builtin.uri:
url: "http://127.0.0.1:8081/"
status_code: 200
retries: 10
delay: 2
Run:
docker compose -f primary/docker-compose.yml up -d
docker compose -f dr/docker-compose.yml up -d
ansible-playbook playbooks/30-failover.yml -e confirm=FAILOVER
curl http://localhost:8081/ # works; 8080 is now read-only
Now extend the lab: add a 40-validate-failover.yml that times the run, a 50-cutover-dns.yml that switches a /etc/hosts entry, a 60-failback-prep.yml that reverses the dump direction, and a 99-game-day.yml that runs the whole flow with --check. You will have a complete DR mental model in under an hour.
Glossary
- RPO — Recovery Point Objective. Maximum tolerable data loss measured in time.
- RTO — Recovery Time Objective. Maximum tolerable downtime measured in time.
- MTPD — Maximum Tolerable Period of Disruption. Business survival ceiling.
- Fence — The act of stopping the primary from accepting writes before promotion.
- Promotion — Converting a read replica to an independent primary.
- Failback — Returning service to the original primary after recovery.
- Pilot light — DR pattern where data is hot but compute is scaled to zero.
- Warm standby — DR pattern where compute runs at reduced capacity.
- Active-active — DR pattern where every site serves traffic continuously.
- Game day — Scheduled rehearsal of part or all of the DR runbook.
- Object Lock — S3 (or compatible) feature that makes objects immutable for a retention period.
- Split brain — Failure mode where two sites both believe they are primary.
Certification mapping
This lesson maps directly onto:
- EX374 — Exam objectives for orchestration, AAP workflows, RBAC, surveys, and audit.
- AWS Solutions Architect Pro — DR strategy domain (pilot-light/warm-standby/active-active trade-offs).
- CISSP — BCP/DRP domain (RPO/RTO definitions, MTPD).
- ISO 27031 — IT readiness for business continuity.
Next steps
You now have an opinionated, code-first DR foundation. The next two specialist lessons build on it:
- OS migrations and large-scale upgrades — DR planning often forces a parallel “lift and shift” because the DR site runs newer hardware/firmware. The next lesson covers P2V, V2V, RHEL major-version upgrades, and the migration runbook patterns that mesh with DR.
- Air-gapped automation — Some DR sites are isolated by regulation or threat model. Running Ansible against an air-gap requires a different content-distribution strategy that we will cover in detail.
If you only take one habit from this lesson: run your DR playbook on a schedule, not on a calendar reminder. The DR plan that runs every month against pre-prod is the DR plan that works the day reality demands it.