AWS Lesson 92 of 123

Configure AWS Elastic Disaster Recovery (DRS) for Cross-Region Server Failover and Failback

A mid-size payments company runs its authorization core — a dozen Linux app servers and two Windows license-bound appliances — on EC2 in us-east-1. The audit finding that started this project was blunt: the documented DR plan was “restore from AMIs,” nobody had timed it, and the last real attempt took eleven hours because the AMIs were three weeks stale. The mandate is now a contractual RTO of 60 minutes and RPO of seconds for the authorization path, proven by a non-disruptive drill every quarter. AWS Elastic Disaster Recovery (DRS) is the right tool: it does continuous, block-level, asynchronous replication of whole servers into a low-cost staging area in a second Region, and it orchestrates booting production-grade instances from that replicated data on demand — then fails you back when the primary Region returns.

This guide walks the full loop end to end: install the agent, watch replication go healthy, run a drill, run a real recovery, and fail back cleanly, with the wiring an enterprise actually needs around it. But it is also a reference. Because you will open this mid-incident at 02:00 with a Region down and a CHG ticket open, every replication state, every launch-configuration flag, every aws drs error, every limit and every cost driver is laid out as a scannable table. Read the prose once to build the mental model; then keep the tables open when it counts. The bar here is exhaustive enumeration — “every option, end to end” — not a long narrative.

By the end you will stop hoping. When the primary Region browns out you will know exactly which servers are CONTINUOUS, which snapshot to recover to (latest for an outage, a chosen point-in-time to land before a ransomware event), what each launch flag will do to the booted instance, how to flip DNS, and — the half most teams forget — how to reverse the whole thing and get home without losing the writes taken during the outage.

What problem this solves

DR plans rot silently. An AMI-and-runbook strategy looks fine in a wiki until the day you need it, and then three things bite at once: the images are stale (so you lose hours of data), nobody has timed the restore (so the RTO is fiction), and license-bound appliances cannot be rebuilt from a script at all (so they are simply absent in the recovery Region). The pain is not “we have no DR” — it is “we have DR on paper that has never been proven, for a workload where the regulator now wants a quarterly, evidenced drill.”

What breaks without DRS: you discover your real RTO during a real outage, with the business bridge full and revenue draining. AMIs captured “nightly” turn out to be 19 hours stale at the worst moment, so RPO is a day, not seconds. The Windows authorization appliances — vendor software, license-bound to a MAC/hostname — have no clean rebuild path, so the “DR Region” silently excludes the exact servers the authorization flow depends on. And because failover was never rehearsed, the first attempt is also the first time anyone has run start-recovery, under maximum pressure.

Who hits this: any team with a contractual or regulatory RTO/RPO on servers they cannot trivially re-provision — payments, healthcare, trading, anything with license-bound appliances or hand-built hosts. DRS targets exactly this gap: continuous block replication keeps RPO in seconds, the staging area is cheap so you can afford it always-on, a drill proves the RTO without touching production, and because DRS treats every server as a block device it replicates the Windows appliances identically to the Linux fleet — which is precisely why it beats an AMI strategy for things you cannot rebuild from code.

To frame the whole field before the deep dive, here is the loop this article covers, the AWS verb that drives each phase, and the single gate that says “this phase passed”:

Phase What happens Driving aws drs call Touches production? Pass/fail gate
Initialize Stand up DRS + service roles in the target initialize-service No Roles exist in us-west-2
Replicate Agent streams disk blocks to staging (agent installer) Read-only on source Every server CONTINUOUS, lag < RPO
Drill Boot recovery EC2 in isolation, time it start-recovery --is-drill No (isolated subnet) RTO ≤ 60 min, synthetic txn passes
Recover Real cutover to the target Region start-recovery Yes (this is the outage) App healthy + DNS flipped
Fail back Reverse replication, return home reverse-replicationstart-failback-launch Yes (planned window) us-east-1 primary, CONTINUOUS home
Teardown Stop paying for drill artifacts terminate-recovery-instances No No orphan recovery EC2 / EBS

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

Two AWS Regions are chosen and fixed for this guide: source us-east-1, recovery/target us-west-2. DRS is initialized per-Region in the target. You need a target VPC in us-west-2 with private subnets, route tables and security groups (we build them with Terraform below), and a staging subnet for the lightweight replication servers. You need AWS CLI v2 ≥ 2.15 with the drs command set, plus jq — confirm with aws drs help. You need IAM rights to create the DRS service roles, plus an operator role allowed to call drs:* for drills and recovery. The network path matters: outbound TCP 1500 from each source server to the staging subnet, and TCP 443 to the DRS and S3 endpoints (or VPC endpoints for a private path). You need root/Administrator on each source to install the agent. And you need a change-management hook in ServiceNow for the recovery runbook, with HashiCorp Vault available to mint short-lived installer credentials — we never bake a long-lived key into a server.

This sits in the resilience / BCDR track. It assumes the compute and networking fundamentals from the AWS EC2 Deep Dive: Instances, AMIs, EBS, User Data, IMDS and the AWS VPC Deep Dive: Subnets, Routing, IGW, NAT, Endpoints, because the staging subnet, route tables and VPC endpoints are where replication lives or dies. It pairs with the broader strategy in Enterprise Architecture on AWS: DR Strategies and Enterprise Architecture on AWS: Multi-Region — DRS is the warm-standby-adjacent, pilot-light-priced point on that spectrum. It complements, not replaces, AWS Backup with Organizations, Vault Lock, Cross-Account & Cross-Region Recovery: DRS gives you seconds-RPO server failover, Backup gives you immutable, governed point-in-time copies — most regulated shops run both.

Where DRS sits among the AWS resilience tools, so you reach for the right one:

Tool Granularity RPO RTO Cost at rest Best for
Elastic Disaster Recovery (DRS) Whole server (block) Seconds Minutes (boot + DNS) Low (staging EBS + t3) Lift-and-shift servers, appliances, hand-built hosts
AWS Backup Resource (vol/DB/FS) Hours (schedule) Hours (restore) Storage only Governed, immutable, long-retention copies
RDS/Aurora cross-Region replica Database Seconds–min Minutes (promote) A replica’s compute Managed DB tier specifically
S3 Cross-Region Replication Object Seconds–min N/A (already there) Storage + transfer Object data, not servers
Pilot light / warm standby (custom) Whole stack App-defined Seconds–min Medium–high Cloud-native stacks with IaC and golden AMIs
Multi-site active/active Whole stack ~0 ~0 High (2× live) Workloads that cannot take any downtime

Core concepts

Six mental models make every later step obvious.

DRS lives in the target Region. Replication, the staging area, snapshots and recovery launches all live where you want to recover tous-west-2. You initialize-service there. Running it in us-east-1 would set DRS up to replicate out of us-west-2, the exact inverse of the plan. This single fact trips more first-timers than anything else, so it leads the playbook.

The agent turns a server into a block-level source. The AWS Replication Agent installs on each source, inventories every attached disk, does one full initial sync (every used block, once), then streams only changed blocks asynchronously across the Region boundary. Because it reads blocks, not files, it does not care what the OS is or whether the software is license-bound — which is why it handles the Windows appliances identically to the Linux fleet.

The staging area is deliberately cheap. In the target, DRS maintains a small subnet of low-cost replication servers (default t3.small) plus low-cost EBS volumes (GP3) that hold a continuously-updated copy of every source disk. Nothing production-grade runs there until you ask. This is the entire cost story versus a warm standby: you pay for staging storage and tiny replication servers, not a parallel fleet.

Recovery is a launch, not a restore. On drill or recovery, DRS launches full-size recovery instances from the latest — or a chosen point-in-time — snapshot into your real target subnets, behind your security groups, using a per-server launch configuration you control. A drill does this into isolation while replication keeps running; a real recovery does the same minus --is-drill, plus the DNS move.

Point-in-time recovery is what beats ransomware. The PIT policy keeps a ladder of snapshots (e.g. every 10 min for an hour, hourly for a day, daily for several days). That lets you recover to just before a corruption or encryption event, not only “now” — the difference between losing minutes and restoring an already-poisoned disk.

Failback is a first-class, reversible phase. When us-east-1 is healthy again, DRS reverses replication: the running recovery instances become sources and stream their current state back to the original Region, so you return without losing the writes taken during the outage. A DR plan you cannot reverse is not a plan.

The vocabulary side by side — pin these down before the deep sections:

Term One-line definition Lives in Why it matters
Source server A registered server being replicated DRS console (target Region) The unit you drill/recover/fail back
Replication Agent Block-reader installed on the source On each source OS Streams changed blocks; no agent = no DR
Staging area Cheap subnet of replication servers + EBS Target Region Holds the live copy; the cost floor
Replication server t3.small that receives blocks Staging subnet Throughput bottleneck if undersized
PIT snapshot A point-in-time EBS snapshot ladder Target Region Recover to before an event
Launch configuration Per-server “how to boot” recipe DRS console Controls IP, size, licensing, tags
Recovery instance The EC2 launched on drill/recovery Target subnets The thing that serves during failover
Drill A non-disruptive recovery into isolation Target Region The quarterly audit deliverable
Failback Reverse replication back to the origin Both Regions Returning home without data loss
dataReplicationState Health of a server’s replication describe-source-servers CONTINUOUS = ready; anything else isn’t

The aws drs verbs you will actually run across the whole loop, grouped by phase — keep this as your command index:

aws drs command Phase What it does You run it…
initialize-service Setup Stand up DRS + service roles in the target Once per target Region
create-replication-configuration-template Setup Define the default staging footprint Once, then update as needed
describe-source-servers Replicate List servers + dataReplicationState/lag Constantly — the health gate
update-launch-configuration Replicate Set how a server’s recovery boots Per server, version-controlled
get-launch-configuration Replicate Read a server’s launch recipe To audit osByol/sizing
describe-recovery-snapshots Recover List PIT snapshots for a server Before a PIT recovery
start-recovery Drill / Recover Launch recovery instances (--is-drill for drills) Drill quarterly; recover on outage
describe-jobs Drill / Recover Track a launch job to COMPLETED While a launch runs
describe-recovery-instances Recover / Failback Show recovery EC2 + failbackState Post-launch and during failback
reverse-replication Failback Make recovery instances replicate home When the origin is healthy again
start-failback-launch Failback Finalise failback to the origin In a maintenance window
terminate-recovery-instances Teardown Remove launched recovery EC2 Immediately after a drill
stop-replication Teardown Stop replicating a source (keeps record) Decommission / cost cleanup
delete-source-server Teardown Remove a source + its staging resources Full removal

Choosing Regions, networking, and the staging footprint

Before any CLI, lock the topology, because a real disaster must not depend on click-ops memory. The shape is deliberately simple, which is what makes it auditable: production runs in us-east-1; an agent on each server streams blocks to a cheap staging area in us-west-2; on demand DRS launches full-size recovery instances into your real target subnets; authoritative failover DNS steers clients to the recovery Region without a client-side change.

The data path and the ports it needs

Replication is a few flows on a few ports. Get one wrong and agents register but never leave INITIAL_SYNC. Enumerate every path:

Flow Source → Destination Port / protocol Direction Why it exists If it’s blocked
Block stream Source server → staging replication servers TCP 1500 Outbound from source Carries replicated disk blocks Stuck in INITIAL_SYNC; dataReplicationError set
Agent ↔ DRS API Source / staging → DRS service TCP 443 Outbound Registration, control plane Agent never registers
Agent ↔ S3 Source → regional installer bucket TCP 443 Outbound Pull installer + components Install fails to download
Replication server ↔ EBS/EC2 Staging subnet → EC2/EBS endpoints TCP 443 Outbound Manage staging volumes Staging provisioning errors
Recovery inbound DNS origin range → recovery EC2 TCP 443 (app) Inbound Live traffic post-cutover Cutover “works” but no traffic
Failback stream Recovery EC2 → original Region TCP 1500 Outbound Reverse replication home Failback stuck

Decide public vs private for the data plane up front — it changes both security posture and NAT cost:

Routing option data-plane-routing value Path Needs Cost note When to pick
Private IP via VPC endpoints PRIVATE_IP Stays on AWS backbone DRS interface endpoint + private subnets Avoids NAT data-processing Regulated / least-exposure (our choice)
Public IP PUBLIC_IP Over internet (TLS) Public subnet / IGW on staging NAT or IGW egress Quick PoC only

The staging footprint settings

The staging area’s size and cost come from a handful of template settings. Tune them deliberately:

Setting What it controls Default When to change Trade-off / gotcha
staging-area-subnet-id Where replication servers live (none — required) Always set it Must have a route to sources + endpoints
replication-server-instance-type Size of the receiver t3.small Many/large/high-churn sources Bigger = faster sync but higher rest cost
use-dedicated-replication-server One server per source vs shared false Strict isolation needs Dedicated multiplies cost
default-large-staging-disk-type EBS type for big volumes GP3 Rarely GP2/ST1 change perf and price
ebs-encryption Encrypt staging volumes DEFAULT Use a CMK for control CUSTOM needs the KMS key + grants
data-plane-routing Private vs public block path PRIVATE_IP (set it) PoC only → public Public exposes the path
create-public-ip Give replication servers a public IP false Public routing Leave false for private
bandwidth-throttling Cap replication Mbps (0 = unlimited) 0 Protect a thin source link Too low → lag climbs past RPO
associate-default-security-group Auto-attach the default SG true (set false) Always set false Default SG is too permissive
replication-servers-security-groups-ids SG for replication servers (none) Always set your own Must allow 1500 inbound from sources
pit-policy The PIT snapshot ladder (provided) Match retention to compliance Longer retention = more snapshot cost

Initialize DRS and create the service roles in the target Region

DRS is initialized in the target Region (us-west-2) — that is where replication and recovery live. Initialization creates the default replication settings and the IAM service roles DRS needs.

export SRC_REGION=us-east-1
export DR_REGION=us-west-2

# Initialize Elastic Disaster Recovery in the recovery Region.
aws drs initialize-service --region "$DR_REGION"

That call creates AWSServiceRoleForElasticDisasterRecovery and the recovery/conversion roles. Confirm they exist before going further:

aws iam list-roles --region "$DR_REGION" \
  --query "Roles[?contains(RoleName, 'ElasticDisasterRecovery')].RoleName" \
  --output table

The roles DRS creates, and what each is allowed to do — know these so you can scope and audit them:

Role Created by Purpose Over-permission risk
AWSServiceRoleForElasticDisasterRecovery initialize-service Service-linked role for DRS internals AWS-managed; do not edit
DRS recovery instance role Initialization Lets launched instances talk to DRS Scope to what recovery needs
DRS conversion role Initialization Runs the boot-converter on launch Temporary; deleted after launch
drs:StartRecovery (operator) You attach it Launch drills/recovery High — can boot copies of prod
drs:* (admin) You attach it Full DRS administration Highest — break-glass only

Now define the default replication configuration template so every server that registers inherits a sane, least-cost staging footprint. Point it at the staging subnet, force in-transit encryption, keep the cheap default instance type, and attach the PIT policy:

aws drs create-replication-configuration-template \
  --region "$DR_REGION" \
  --staging-area-subnet-id subnet-0dr1staging0west2a \
  --replication-server-instance-type t3.small \
  --use-dedicated-replication-server false \
  --default-large-staging-disk-type GP3 \
  --ebs-encryption DEFAULT \
  --data-plane-routing PRIVATE_IP \
  --create-public-ip false \
  --associate-default-security-group false \
  --replication-servers-security-groups-i-ds sg-0drsstaging0001 \
  --bandwidth-throttling 0 \
  --staging-area-tags Environment=dr,Owner=platform \
  --pit-policy '[{"enabled":true,"interval":10,"retentionDuration":60,"units":"MINUTE","ruleID":1},{"enabled":true,"interval":1,"retentionDuration":24,"units":"HOUR","ruleID":2},{"enabled":true,"interval":1,"retentionDuration":3,"units":"DAY","ruleID":3}]'

data-plane-routing PRIVATE_IP keeps replication traffic on private addressing (pair it with the VPC endpoints in the next section). The PIT policy is what delivers point-in-time recovery — decode the ladder so you can tune retention to your compliance need:

Rule Interval Retention Units What it buys you Cost driver
1 10 60 MINUTE Recover to within ~10 min for the last hour Most snapshots — finest grain
2 1 24 HOUR Hourly points across a day Moderate snapshot count
3 1 3 DAY Daily points for three days Few snapshots — cheap, coarse

The fields that make up each PIT rule, in case you tailor it:

PIT field Meaning Valid values Note
enabled Whether this rule is active true / false Disable without deleting
interval Spacing between snapshots positive integer Combined with units
retentionDuration How long to keep positive integer Combined with units
units Time unit MINUTE / HOUR / DAY Per-rule
ruleID Stable identifier unique integer Reference for updates

Provision the target landing zone with Terraform

Treat the recovery network as infrastructure-as-code. This is Terraform, applied through CI (GitHub Actions with OIDC to AWS — no stored keys). Keep it minimal and explicit: VPC endpoints for a private replication path, a staging security group, and the recovery security group the launched instances will use.

# providers.tf — operate in the recovery Region
provider "aws" {
  region = "us-west-2"
}

# Staging subnet security group: only the replication protocol, inbound from sources.
resource "aws_security_group" "drs_staging" {
  name        = "drs-staging"
  description = "DRS replication servers - inbound replication"
  vpc_id      = var.dr_vpc_id

  ingress {
    description = "AWS Replication Agent stream"
    from_port   = 1500
    to_port     = 1500
    protocol    = "tcp"
    cidr_blocks = [var.source_fleet_cidr] # e.g. 10.10.0.0/16 in us-east-1
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  tags = { Name = "drs-staging", Environment = "dr" }
}

# Interface endpoints so replication + API calls never leave the AWS network.
locals {
  drs_endpoints = ["com.amazonaws.us-west-2.drs"]
}

resource "aws_vpc_endpoint" "drs" {
  for_each            = toset(local.drs_endpoints)
  vpc_id              = var.dr_vpc_id
  service_name        = each.value
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.dr_private_subnet_ids
  security_group_ids  = [aws_security_group.drs_staging.id]
  private_dns_enabled = true
  tags                = { Name = "vpce-drs" }
}

# Recovery instances land here at failover/drill time.
resource "aws_security_group" "recovery" {
  name        = "drs-recovery-app"
  description = "Launched recovery instances - app traffic"
  vpc_id      = var.dr_vpc_id

  ingress {
    description = "Authorization API from failover-DNS origin range"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = var.dns_origin_cidrs
  }
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  tags = { Name = "drs-recovery-app", Environment = "dr" }
}

S3 (gateway) and EC2/EBS interface endpoints are also worth adding for a fully private control path; they are routine and omitted here for length. Run it the usual way:

terraform init
terraform plan  -var-file=dr.tfvars -out=dr.plan
terraform apply dr.plan

The endpoints worth adding for a fully private DRS path, and what each carries:

Endpoint Type Carries Skip it and…
com.amazonaws.<region>.drs Interface DRS control + data plane Replication rides public/NAT
com.amazonaws.<region>.s3 Gateway Installer + component pulls Installer egress via NAT (data charges)
com.amazonaws.<region>.ec2 Interface EC2 control for launches Launch API over the internet
com.amazonaws.<region>.ebs Interface EBS direct APIs (snapshots) Snapshot ops over the internet
com.amazonaws.<region>.kms Interface CMK calls if ebs-encryption CUSTOM KMS calls leave the VPC

The two security groups, side by side — they do very different jobs and conflating them is a common error:

SG Attached to Inbound Outbound Mistake to avoid
drs-staging Replication servers + endpoints TCP 1500 from source CIDR All (to AWS APIs) Forgetting 1500 → stuck sync
drs-recovery-app Launched recovery EC2 TCP 443 from DNS origin range All Opening 0.0.0.0/0 inbound → public exposure

Install the AWS Replication Agent on each source server

The agent is what turns a server into a replicating source. Do not paste a long-lived access key onto a production box. Have HashiCorp Vault’s AWS secrets engine mint a short-TTL credential scoped to exactly the DRS agent-installation actions, install, then let the lease expire:

# On a bastion: pull a 15-minute installer credential from Vault.
eval "$(vault read -format=json aws/creds/drs-agent-install \
  | jq -r '.data | "export AWS_ACCESS_KEY_ID=\(.access_key)\nexport AWS_SECRET_ACCESS_KEY=\(.secret_key)"')"

On each Linux source server (the credential is passed to the installer, not persisted):

wget -O ./aws-replication-installer-init \
  "https://aws-elastic-disaster-recovery-us-west-2.s3.us-west-2.amazonaws.com/latest/linux/aws-replication-installer-init"
chmod +x aws-replication-installer-init

sudo ./aws-replication-installer-init \
  --region us-west-2 \
  --aws-access-key-id "$AWS_ACCESS_KEY_ID" \
  --aws-secret-access-key "$AWS_SECRET_ACCESS_KEY" \
  --no-prompt

On each Windows virtual appliance, fetch aws-replication-installer-init.exe from the same regional bucket and run it from an elevated PowerShell with the same --region us-west-2 --no-prompt flags. The agent inventories every attached disk and begins the initial sync — a full block-level copy — immediately.

The installer flags you actually use, and when:

Flag Purpose Default When to set
--region Target (DRS) Region (none — required) Always: us-west-2 here
--aws-access-key-id / --aws-secret-access-key Short-TTL install creds (none) From Vault, per install
--no-prompt Non-interactive interactive Automation / fleet rollout
--devices Replicate only listed disks all disks Exclude scratch/ephemeral volumes
--no-upgrade Pin agent version upgrades Change-controlled fleets
--s3-endpoint / --endpoint Private endpoints public Fully private install path

For fleets, do not hand-run this. Drive it with Ansible so installation is repeatable and logged:

# drs-agent.yml — install the agent across the source fleet
- hosts: authorization_core
  become: true
  vars:
    drs_region: us-west-2
  tasks:
    - name: Stage the DRS installer
      get_url:
        url: "https://aws-elastic-disaster-recovery-{{ drs_region }}.s3.{{ drs_region }}.amazonaws.com/latest/linux/aws-replication-installer-init"
        dest: /opt/aws-replication-installer-init
        mode: "0755"
    - name: Install and register the agent
      command: >
        /opt/aws-replication-installer-init
        --region {{ drs_region }}
        --aws-access-key-id {{ lookup('env','AWS_ACCESS_KEY_ID') }}
        --aws-secret-access-key {{ lookup('env','AWS_SECRET_ACCESS_KEY') }}
        --no-prompt
      args:
        creates: /var/lib/aws-replication-agent

Source-OS support is broad but not infinite — confirm your hosts are in scope before you promise an RTO on them:

Source type DRS support Note
Linux (RHEL/CentOS, Ubuntu, Amazon Linux, SUSE, Debian) Yes Kernel + version matrix applies; check current docs
Windows Server (incl. license-bound appliances) Yes Use osByol:true to preserve BYOL
Physical / on-prem servers Yes Same agent; needs network path to staging
Other-cloud VMs Yes Treated as a server with disks
Containers / serverless No DRS replicates servers, not tasks/functions
Unsupported kernel/OS build No Agent install fails fast — verify first

Confirm replication reaches a healthy, continuous state

Initial sync moves every used block once; after that the agent ships only changed blocks, which is what keeps RPO in seconds. Watch each source server progress to CONTINUOUS:

aws drs describe-source-servers --region "$DR_REGION" \
  --query "items[].{Host:sourceProperties.identificationHints.hostname, \
           State:dataReplicationInfo.dataReplicationState, \
           Lag:dataReplicationInfo.lagDuration, \
           Backlog:dataReplicationInfo.replicatedStorageBytes}" \
  --output table

You want dataReplicationState = CONTINUOUS and a near-zero lagDuration (an ISO-8601 duration like PT0S). Every state you can see, what it means, and what to do about it:

dataReplicationState Meaning Normal when… Action if stuck
INITIAL_SYNC First full block copy in flight Day one, just after install If it never finishes → check TCP 1500 / bandwidth
RESCAN Re-reading disks after a large change/restart After bulk writes or reboot Wait; persistent rescans → disk churn or agent issue
CONTINUOUS Streaming changed blocks; ready Steady state — the goal None; this is the pass state
STALLED Replication stopped progressing Never Read dataReplicationError; fix and resume
DISCONNECTED Agent lost contact with DRS Never Network/agent down; restart agent, check 443
PAUSED You paused replication Intentional maintenance start-replication to resume
STOPPED Replication stopped (decommission) Decommissioning Expected; re-init if you need it back

When a server is unhealthy, dataReplicationInfo.dataReplicationError names the class. Map each to a confirm-and-fix:

dataReplicationError Root cause Confirm Fix
AGENT_NOT_SEEN Agent process down / host unreachable Is the host up? agent service running? Restart aws-replication-agent; check 443 egress
SNAPSHOTS_FAILURE Staging EBS snapshot couldn’t be taken EBS limits / KMS grant on the CMK Raise snapshot quota; fix KMS key policy
NOT_CONVERGING Lag growing, source out-writes the link lagDuration climbing Raise bandwidth-throttling to 0; bigger replication server
UNSTABLE_NETWORK Packet loss / flaps on the path VPC Flow Logs; path MTU Stabilise route; prefer private endpoints
FAILED_TO_CREATE_STAGING_DISKS Can’t provision staging volumes Service quotas / subnet capacity Raise EBS quota; check subnet free IPs
FAILED_TO_AUTHENTICATE_WITH_SERVICE Agent creds/role invalid IAM role for the agent Re-run install with valid short-TTL creds

Now define how a recovery instance should boot by setting each server’s launch configuration. The critical flag for a real cutover is copy-private-ip false with a target-subnet IP plan; right-sizing and BYOL licensing matter for the appliances:

SERVER_ID=s-1111aaaa2222bbbb3   # from describe-source-servers

aws drs update-launch-configuration --region "$DR_REGION" \
  --source-server-id "$SERVER_ID" \
  --name "auth-core-01-recovery" \
  --launch-disposition STARTED \
  --copy-private-ip false \
  --copy-tags true \
  --target-instance-type-right-sizing-method BASIC \
  --licensing '{"osByol":true}'

Every launch-configuration flag, what it does to the booted instance, and when to flip it:

Flag What it controls Default Set it when… Gotcha
launch-disposition STARTED vs STOPPED on launch STARTED STOPPED to inspect before powering on Drills can use STOPPED to stage quietly
copy-private-ip Reuse the source’s private IP false Keep false for cross-Region cutover true can collide with prod / wrong CIDR
copy-tags Carry source tags to the instance false Cost allocation, ownership Without it, recovery EC2 is untagged
target-instance-type-right-sizing-method Auto-pick instance size BASIC NONE to pin a type yourself BASIC may under/over-size — drill it
licensing.osByol Bring-your-own Windows license false License-bound Windows appliances Omit → AWS-provided Windows billing added
target-instance-type (via NONE) Exact instance type (derived) Strict perf/SLA per server You own sizing correctness
Launch template (managed) Subnet, SG, IAM profile of recovery EC2 DRS-managed Land in the right subnet/SG Edit the DRS-managed template, not a copy

osByol:true preserves bring-your-own-license rather than billing AWS-provided Windows. BASIC right-sizing maps to a comparable family instead of you hard-coding one per server — but verify it meets the performance need in a drill, not during an incident.

Run a non-disruptive failover drill

This is the quarter’s audit deliverable and it must not touch production or replication. A drill launches recovery instances from the latest snapshot into an isolated test subnet while replication keeps running. Open a ServiceNow change ticket first (the runbook references the CHG number in every step), then launch:

# Launch a DRILL for one or many servers from the latest point in time.
aws drs start-recovery --region "$DR_REGION" \
  --is-drill \
  --source-servers sourceServerID="$SERVER_ID"

For the whole authorization core in one orchestrated job, pass every server ID in a single start-recovery call so DRS launches them together. Track the job to completion:

JOB_ID=$(aws drs start-recovery --region "$DR_REGION" --is-drill \
  --source-servers sourceServerID=s-1111aaaa2222bbbb3 \
                   sourceServerID=s-4444cccc5555dddd6 \
  --query "job.jobID" --output text)

aws drs describe-jobs --region "$DR_REGION" \
  --filters jobIDs="$JOB_ID" \
  --query "items[].{Job:jobID,Status:status,Type:type}" --output table
# ...poll until Status = COMPLETED

How a drill and a real recovery differ — same engine, very different blast radius:

Aspect Drill (--is-drill) Real recovery (no flag)
Production impact None This is the cutover
Replication Keeps running uninterrupted Keeps running (until failback)
Target subnet Isolated test subnet Real recovery subnets
DNS Not touched Flipped to recovery Region
Point in time Usually latest Latest (outage) or chosen PIT (corruption)
Cost Compute only while drill runs Real running cost until failback
Purpose Prove RTO, capture evidence Restore service
Teardown Terminate immediately after Keep until failback completes

A start-recovery job moves through states — know them so “is it done yet?” has a precise answer:

Job status Meaning Typical next step
PENDING Accepted, not yet running Wait
STARTED Launch in progress Poll describe-jobs
COMPLETED Recovery instances launched Boot app + run synthetic txn
FAILED Launch failed Read job log; fix launch config / quota

The start-recovery parameters that change what gets launched — the verb that does the real work, end to end:

Parameter What it does Drill value Real-recovery value
--is-drill Marks this a non-disruptive drill Present Absent
--source-servers sourceServerID=… Which servers to launch All in-scope, one job All affected, one job
…,recoverySnapshotID=… Pin a point-in-time snapshot Usually omit (latest) Set for corruption/ransomware
(launch config, set earlier) Disposition, IP, sizing, licensing From update-launch-configuration Same — version-controlled
--query "job.jobID" Capture the job to track Yes Yes

When the job completes, the drill instances are running in us-west-2. Boot the application, run your synthetic authorization transaction against them, and capture timings. Record the measured RTO against the 60-minute SLA in the ServiceNow ticket, then terminate the drill instances to stop paying for them:

aws drs terminate-recovery-instances --region "$DR_REGION" \
  --recovery-instance-ids i-0recovery1111 i-0recovery2222

Replication was never interrupted; you have just proven recovery works without a real outage. The evidence the auditor wants from each drill, and where it comes from:

Evidence item Source Pass criterion
Measured RTO Timestamp from start-recovery → synthetic txn pass ≤ 60 min
Measured RPO at cutover lagDuration just before launch < seconds target
Data correctness App-level checks (row counts, known test record) Matches expected state
Monitoring intact Recovery EC2 visible in your APM/agent Host green within minutes
Clean teardown describe-recovery-instances empty after No orphan instances/EBS
Change record ServiceNow CHG with all of the above Approved + closed

Execute a real cross-region recovery (failover)

When us-east-1 is genuinely down (or you are committing to a planned Region cutover), the steps are the same minus --is-drill, plus the DNS move. Recover to the latest point in time for an outage, or to a chosen recovery point to land before a corruption/ransomware event:

# Real failover for the full authorization core, latest point in time.
aws drs start-recovery --region "$DR_REGION" \
  --source-servers sourceServerID=s-1111aaaa2222bbbb3 \
                   sourceServerID=s-4444cccc5555dddd6

To recover to a specific earlier snapshot, list the points and target one:

aws drs describe-recovery-snapshots --region "$DR_REGION" \
  --source-server-id "$SERVER_ID" \
  --query "items[].{Snap:snapshotID,Time:timestamp}" --output table

aws drs start-recovery --region "$DR_REGION" \
  --source-servers sourceServerID="$SERVER_ID",recoverySnapshotID=pit-0abc123def456

Choosing the recovery point is a real decision — pick deliberately:

Scenario Recover to… Why Risk if you pick wrong
Region outage / hardware loss Latest snapshot Minimise data loss; source was healthy None — latest is correct
Ransomware / encryption event A PIT before the event Avoid restoring poisoned disks Latest = recovering the encryption
Bad deploy / logical corruption PIT before the deploy Roll back to known-good state Latest = same corruption
Compliance “restore to T” test The specified PIT Prove PIT works Latest fails the test intent

Once the recovery instances are STARTED and the app passes health checks, flip traffic. Update your authoritative failover DNS to mark the us-east-1 origin down and promote the us-west-2 recovery instances as the live origin for the authorization hostname, so clients follow DNS without any client-side change. The DNS layer is where many otherwise-perfect failovers quietly fail; if you run this on Route 53, the mechanics are in Route 53: DNS Records, Routing Policies & Health Checks. Verify the cutover:

aws drs describe-recovery-instances --region "$DR_REGION" \
  --query "items[].{Host:sourceProperties.identificationHints.hostname, \
           EC2:ec2InstanceID,Failback:failbackState}" --output table

The cutover sequence as an ordered checklist — order is the lesson, because skipping DNS is the classic miss:

# Step Command / action Gate before next step
1 Declare the incident Open CHG; assemble bridge CHG number issued
2 Pick recovery point Latest vs PIT decision Point agreed
3 Launch recovery start-recovery (no --is-drill) Job COMPLETED
4 Boot + health-check app App readiness probes App green
5 Data correctness check Known record / signed txn Data verified
6 Flip failover DNS Mark origin down, promote recovery Resolver returns recovery IPs
7 Confirm live traffic Synthetic txn through DNS Real requests served
8 Record + monitor Update CHG; watch dashboards Steady state

You are now serving authorization out of us-west-2.

Fail back to the primary Region

Failback is the half teams forget — and a DR plan you cannot reverse is not a plan. When us-east-1 is healthy again, DRS reverses the replication: the running recovery instances become sources and stream their current state back to the original Region, so you return without losing the writes taken during the outage.

# Reverse replication: recovery instances -> original source Region.
aws drs reverse-replication --region "$DR_REGION" \
  --recovery-instance-id i-0recovery1111

Watch the failback direction sync, then, on a maintenance window, complete the failback so the original-Region servers become production again and the DR posture flips back to normal (us-east-1 source → us-west-2 staging):

aws drs describe-recovery-instances --region "$DR_REGION" \
  --query "items[].{Host:sourceProperties.identificationHints.hostname,Failback:failbackState}" \
  --output table
# Expect: FAILBACK_READY  -> then finalize in a window:

aws drs start-failback-launch --region "$DR_REGION" \
  --recovery-instance-i-ds i-0recovery1111 i-0recovery2222

The failbackState values you’ll watch, and what each means:

failbackState Meaning What you do
FAILBACK_NOT_STARTED Still serving from recovery Region Begin reverse-replication when origin is healthy
FAILBACK_IN_PROGRESS Streaming current state back to origin Wait; monitor lag
FAILBACK_READY_FOR_LAUNCH Origin has the current data Schedule a window
FAILBACK_COMPLETED Origin is primary again Re-establish forward DR; move DNS home
FAILBACK_ERROR Reverse replication failed Check origin network/agent; retry

Move your failover DNS back to the us-east-1 origin, confirm in monitoring, and you have completed the full loop. The forward-vs-reverse data flow side by side, so the direction is never ambiguous:

Normal / forward Failback / reverse
Source of truth us-east-1 production us-west-2 recovery EC2
Direction of blocks east → west (staging) west → east (origin)
Triggered by Steady state reverse-replication
Finalised by (always on) start-failback-launch in a window
DNS points to us-east-1 us-west-2 until failback completes

Architecture at a glance

Read the diagram left to right as the data actually flows. In us-east-1 the production fleet runs normally; on each server an AWS Replication Agent reads every disk block and streams changes asynchronously over TCP 1500 across the Region boundary — that hop carries badge 1, the place a blocked port traps a server in INITIAL_SYNC. The stream rides a private path through a DRS VPC interface endpoint so replication never touches the public internet. It lands in the staging area in us-west-2: small t3.small replication servers plus low-cost GP3 EBS holding a continuously-updated copy of every source disk, with a PIT snapshot ladder (10-minute, hourly, daily). Badge 2 sits here — throttling or an undersized replication server is where lag creeps past your RPO. Nothing production-grade runs in staging until you ask.

On drill or recovery, DRS launches full-size recovery instances into your real target subnets. Badge 4 marks the launch itself — BASIC right-sizing or a copy-private-ip mistake bites here — and badge 3 marks the drs:StartRecovery permission, because anyone holding it can boot copies of production, which is why it is scoped tightly and gated behind change control. Finally, authoritative failover DNS health-checks the us-east-1 origin and swaps the authorization hostname to the recovery instances; badge 5 is the failover most teams forget — healthy recovery EC2 that nothing routes to because the DNS flip was skipped. The green arrow back from recovery to staging is the failback path: when the origin returns, the running instances reverse-replicate their current state home. Follow the numbers and you have both the architecture and the diagnostic map in one picture.

AWS Elastic Disaster Recovery cross-Region architecture from us-east-1 to us-west-2: production EC2 app servers and two Windows appliances run the AWS Replication Agent that streams disk blocks over TCP 1500 through a DRS VPC interface endpoint into a low-cost us-west-2 staging area of t3.small replication servers and GP3 EBS with a point-in-time snapshot ladder, from which DRS launches right-sized recovery EC2 on drill or cutover behind a scoped drs:StartRecovery permission, with authoritative failover DNS health-checking the origin and swapping traffic to the recovery Region, and a reverse failback path back to staging — numbered badges mark the agent sync, staging lag, StartRecovery blast radius, launch right-sizing, and the DNS cutover failure points

Real-world scenario

Cresta Pay runs the authorization core described in the intro: twelve Linux app servers (.NET and Java) and two Windows license-bound HSM-front appliances on EC2 in us-east-1, fronted by an NLB, processing ~3,400 authorizations/second at peak. The platform team is five engineers; the pre-DRS “DR plan” was nightly AMIs and a Confluence runbook nobody had executed. The PCI auditor’s finding was explicit: prove a 60-minute RTO and seconds-RPO with a non-disruptive drill every quarter, or accept a finding.

The first attempt to drill exposed the classic trap. An engineer ran aws drs initialize-service in us-east-1 — the Region they thought of as “where the servers are” — and spent a morning confused that the staging area wanted to replicate out of us-west-2. Re-initializing in the target Region fixed it in minutes, and it became rule one in the runbook. The second snag was network: ten of twelve Linux servers reached CONTINUOUS within two hours, but two sat in INITIAL_SYNC indefinitely. describe-source-servers showed dataReplicationError: AGENT_NOT_SEEN on one and a stalled byte counter on the other; the cause was a source-side security group missing egress TCP 1500 to the staging subnet on those two hosts (they were in a stricter SG). Opening 1500 cleared both, and they added a Reachability check to the pre-drill checklist.

The first real drill was the revelation. They opened a CHG, ran start-recovery --is-drill for all fourteen servers in one job, and watched. The Linux fleet launched and passed synthetic authorization in 38 minutes — comfortably inside SLA. The two Windows appliances, though, came up as AWS-provided Windows because the first launch configuration omitted osByol:true; the drill instances ran fine but would have added Windows licensing to every future recovery, and worse, the vendor license keyed to the original build was at risk. Setting licensing.osByol=true and re-drilling brought them up correctly as BYOL. The drill also caught a right-sizing surprise: BASIC mapped one Java server to a smaller instance family whose memory was tight under load; they pinned that one server’s type explicitly with right-sizing-method NONE.

The near-miss that justified the whole program came three months later — not a Region outage, but a bad deploy that corrupted a config store at 14:20. Because DRS held a 10-minute PIT ladder, they listed snapshots, picked the 14:10 point with recoverySnapshotID, and launched recovery to a state before the corruption. They never had to cut public traffic — they validated against the recovered instances, confirmed the good state, fixed the deploy, and discarded the recovery instances — but it proved PIT worked for real, and it would have been the play if the corruption had been ransomware.

The lasting fix had four parts. One: the runbook now leads with “initialize/verify DRS is in us-west-2” and a describe-source-servers health gate — no server out of CONTINUOUS, no drill. Two: launch configurations are in version control (update-launch-configuration driven from a reviewed file), with osByol:true on the appliances and explicit sizing on the memory-tight server. Three: the DNS cutover is an explicit, tested step with its own health-check, not an afterthought — the first dry run had healthy recovery EC2 that nothing routed to for nine minutes because nobody owned the flip. Four: failback is rehearsed too; they reverse-replicate to a scratch account quarterly so the first real failback is not the first failback ever. The quarterly drill now costs about ₹3,100 of compute (a few instance-hours, torn down immediately) and produces a clean, signed CHG. The auditor’s finding closed, and the line on the wall became: “DR you haven’t timed is a wish. DR you can’t reverse is a trap.”

The incident-to-fix timeline, because the order of moves is the lesson:

Time Event Action Effect What it taught
Day 1 DRS set up wrong initialize-service in us-east-1 Staging tried to replicate the wrong way Rule 1: DRS lives in the target
Day 1 2 servers stuck INITIAL_SYNC Read dataReplicationError AGENT_NOT_SEEN + blocked 1500 Add a 1500 reachability gate
Wk 2 First drill start-recovery --is-drill, 14 servers Linux in 38 min; Windows as AWS-licensed Set osByol:true on appliances
Wk 2 Right-size miss BASIC undersized one Java host Memory tight under load Pin type with right-sizing NONE
Mo 3 Bad deploy 14:20 Recover to 14:10 PIT Landed before corruption PIT proven for real
Mo 3 DNS dry run Forgot the flip Healthy EC2, no traffic, 9 min Make DNS cutover an owned step
Ongoing Quarterly drill Drill + teardown ₹3,100, signed CHG Finding closed

Advantages and disadvantages

DRS is the cheap-at-rest, fast-to-recover point on the BCDR spectrum, and the trade-offs are specific:

Advantages Disadvantages
Continuous block replication → RPO in seconds, not the hours an AMI/snapshot schedule gives Replicates servers, not application semantics — it won’t repair logical/app-level state for you
Cheap at rest: staging EBS + tiny t3 replication servers, not a duplicate fleet You pay for staging storage continuously, and real compute the moment a drill/recovery runs
Block-level means it replicates license-bound appliances and hand-built hosts identically OS/kernel support matrix applies; an unsupported build simply can’t be a source
Drills are non-disruptive — prove RTO every quarter without touching prod or replication Drills cost compute and leave EBS billing if you forget to terminate
Point-in-time recovery lets you land before ransomware/corruption, not just “now” Longer PIT retention = more snapshot storage cost; you must tune it
Failback is first-class and reversible — return home without losing outage-window writes Failback is easy to under-rehearse; the first real one fails if never practised
Orchestrated, scripted recovery (start-recovery) replaces a hand-run runbook drs:StartRecovery effectively lets a holder boot copies of production — must be tightly scoped

DRS is the right tool for lift-and-shift servers, appliances and hand-built hosts where you want seconds-RPO failover without paying for a hot standby. It is not the tool for the managed tiers — use an RDS/Aurora cross-Region replica for the database (see Aurora High Availability, Global Database & Zero-Downtime), and CRR for S3 — nor for stacks that are fully cloud-native with golden AMIs and IaC, where a pilot-light/warm-standby pattern is cleaner. And it does not absolve you of immutable backups: pair it with AWS Backup with Organizations, Vault Lock & Cross-Region Recovery for governed, ransomware-resistant copies.

Hands-on lab

Stand up DRS for one throwaway Linux EC2 source, watch it reach CONTINUOUS, run a drill, and tear it all down. Keep it small and delete at the end — the only real cost is a few instance-hours of staging plus the brief drill instance. Run in CloudShell (or any host with AWS CLI v2 and credentials).

Step 1 — Set Regions and confirm the CLI has drs.

export SRC_REGION=us-east-1
export DR_REGION=us-west-2
aws drs help >/dev/null && echo "drs commands present"

Step 2 — Initialize DRS in the target Region and verify roles.

aws drs initialize-service --region "$DR_REGION"
aws iam list-roles \
  --query "Roles[?contains(RoleName,'ElasticDisasterRecovery')].RoleName" -o table

Expected: at least AWSServiceRoleForElasticDisasterRecovery listed.

Step 3 — Create a minimal replication template pointed at a staging subnet you control in us-west-2 (replace the subnet/SG IDs):

aws drs create-replication-configuration-template --region "$DR_REGION" \
  --staging-area-subnet-id subnet-EXAMPLEstaging \
  --replication-server-instance-type t3.small \
  --use-dedicated-replication-server false \
  --default-large-staging-disk-type GP3 \
  --ebs-encryption DEFAULT \
  --data-plane-routing PRIVATE_IP \
  --create-public-ip false \
  --associate-default-security-group false \
  --replication-servers-security-groups-i-ds sg-EXAMPLEstaging \
  --bandwidth-throttling 0 \
  --pit-policy '[{"enabled":true,"interval":10,"retentionDuration":60,"units":"MINUTE","ruleID":1}]'

Step 4 — Install the agent on a throwaway Linux EC2 in us-east-1. Use a short-TTL credential (or a tightly scoped lab key you delete after). On the instance:

wget -O ./aws-replication-installer-init \
  "https://aws-elastic-disaster-recovery-us-west-2.s3.us-west-2.amazonaws.com/latest/linux/aws-replication-installer-init"
chmod +x aws-replication-installer-init
sudo ./aws-replication-installer-init --region us-west-2 --no-prompt \
  --aws-access-key-id "$AWS_ACCESS_KEY_ID" --aws-secret-access-key "$AWS_SECRET_ACCESS_KEY"

Step 5 — Watch it reach CONTINUOUS.

watch -n 30 'aws drs describe-source-servers --region us-west-2 \
  --query "items[].{Host:sourceProperties.identificationHints.hostname,\
  State:dataReplicationInfo.dataReplicationState,Lag:dataReplicationInfo.lagDuration}" -o table'

Expected: INITIAL_SYNCCONTINUOUS with lagDuration near PT0S. If it never leaves INITIAL_SYNC, check egress TCP 1500 from the source to the staging subnet.

Step 6 — Run a drill and time it.

SERVER_ID=$(aws drs describe-source-servers --region us-west-2 \
  --query "items[0].sourceServerID" -o text)
JOB_ID=$(aws drs start-recovery --region us-west-2 --is-drill \
  --source-servers sourceServerID="$SERVER_ID" --query "job.jobID" -o text)
aws drs describe-jobs --region us-west-2 --filters jobIDs="$JOB_ID" \
  --query "items[].{Status:status,Type:type}" -o table   # poll to COMPLETED

Step 7 — Teardown (do this or it bills).

# Terminate any drill instances:
aws drs describe-recovery-instances --region us-west-2 \
  --query "items[].recoveryInstanceID" -o text | xargs -r \
  aws drs terminate-recovery-instances --region us-west-2 --recovery-instance-ids

# Stop and remove the source server (also removes its staging resources):
aws drs stop-replication --region us-west-2 --source-server-id "$SERVER_ID"
aws drs delete-source-server --region us-west-2 --source-server-id "$SERVER_ID"

# Terminate the throwaway EC2 source, and uninstall the agent if reusing it:
#   sudo /var/lib/aws-replication-agent/uninstall.sh

After teardown, verify nothing is left billing — run each check and confirm the expected empty/clean result:

Check Command Expected
No recovery instances aws drs describe-recovery-instances --region us-west-2 Empty items
No replicating source aws drs describe-source-servers --region us-west-2 Empty (after delete)
No orphan EBS in staging EC2 console → Volumes, filter Environment=dr None unattached
EC2 source terminated EC2 console → Instances Lab instance gone

terminate-recovery-instances does not stop source-side replication — they are independent calls, which is exactly the trap that leaves staging volumes billing after a “cleanup.”

Common mistakes & troubleshooting

The differentiator. Before the playbook, the instruments — what each tool tells you during a DRS incident, so you reach for the right one instead of guessing:

Tool What it shows How to reach it Best for
describe-source-servers Per-server state, lag, dataReplicationError CLI “Is replication healthy?” — the first gate
describe-jobs Launch job status (STARTED/COMPLETED/FAILED) CLI “Did my drill/recovery finish?”
describe-recovery-instances Recovery EC2 IDs + failbackState CLI “Did it boot? Where is failback?”
get-launch-configuration Disposition, IP, sizing, licensing CLI “Will it boot the way I expect?”
VPC Reachability Analyzer Whether a path on a port works Console / CLI Proving TCP 1500 source→staging
VPC Flow Logs Accepted/rejected flows on the path CloudWatch / S3 Confirming a blocked/flapping link
CloudTrail Who called StartRecovery/Terminate… and when CloudTrail / Athena Audit + “who launched this?”
CloudWatch alarms Replication-stalled / lag breach CloudWatch Catching STALLED before a drill

Now the playbook. Each row is a real failure mode with the exact way to confirm it and the fix. Scan it, then read the detail for whichever bites.

# Symptom Root cause Confirm (exact command / path) Fix
1 Staging wants to replicate the wrong direction DRS initialized in the source Region aws drs describe-replication-configuration-template in each Region Re-run initialize-service in us-west-2; tear down the wrong one
2 Server stuck in INITIAL_SYNC forever TCP 1500 blocked (source egress or staging SG) describe-source-serversdataReplicationError; VPC Reachability Analyzer Open 1500 source→staging; confirm staging SG ingress
3 dataReplicationState = STALLED Snapshot failure or non-converging lag dataReplicationInfo.dataReplicationError Fix KMS grant / raise throttle to 0 / bigger replication server
4 Agent shows DISCONNECTED Agent process down or 443 egress blocked Is the service running? check 443 to DRS endpoint Restart aws-replication-agent; allow 443
5 Lag climbs past RPO Bandwidth throttle too low / link saturated lagDuration trending up Set bandwidth-throttling 0; upsize replication server
6 Drill collides with production Forgot --is-drill or copy-private-ip true into prod subnet Check the launch config + the job target Always --is-drill; copy-private-ip false; isolated subnet
7 Windows appliance billed AWS-Windows osByol:true omitted on launch config aws drs get-launch-configurationlicensing Set licensing.osByol=true; re-launch
8 Recovery instance too small/slow BASIC right-sizing under-provisioned Compare recovery instance type vs source need Pin type with right-sizing-method NONE
9 Recovery healthy but no traffic DNS cutover step skipped dig the hostname; check origin health-check Make the DNS flip an explicit, owned runbook step
10 Failback never starts / errors Reverse path blocked or origin agent down describe-recovery-instancesfailbackState Fix origin network/agent; retry reverse-replication
11 Staging EBS still billing after “cleanup” terminate-recovery-instances doesn’t stop replication List source servers still present Also stop-replication + delete-source-server
12 Recovered to a poisoned disk Used latest during a ransomware/corruption event Compare event time vs snapshot time Recover to a PIT before the event via recoverySnapshotID

Initializing DRS in the wrong Region

DRS lives in the target. Running initialize-service in us-east-1 sets up replication out of us-west-2 — the opposite of the plan. Confirm: describe-replication-configuration-template in each Region shows where staging lives. Fix: initialize in us-west-2; remove the inverted setup.

Port 1500 blocked

If the staging security group or a source-side egress rule misses TCP 1500, agents register but never leave INITIAL_SYNC. Confirm: describe-source-serversdataReplicationInfo.dataReplicationError, and run VPC Reachability Analyzer source→staging on 1500. Fix: open egress 1500 on the source SG and ingress 1500 on the staging SG from the source CIDR.

Drills that touch production

Forgetting --is-drill, or pointing a drill’s launch config at the production subnet/IP, can collide with live systems. Confirm: the start-recovery job’s target subnet and the launch config’s copy-private-ip. Fix: keep a dedicated test subnet, always pass --is-drill, and keep copy-private-ip false.

Windows BYOL billed as AWS-provided

Omitting osByol:true on license-bound appliances silently adds Windows licensing to every recovery instance and can jeopardise the vendor’s host-keyed license. Confirm: get-launch-configurationlicensing. Fix: licensing.osByol=true, then re-launch.

Skipping the failback rehearsal

Teams drill failover and never failback; the first real failback then fails under pressure. Confirm: check whether reverse-replication/start-failback-launch have ever been exercised. Fix: rehearse the reverse loop quarterly (to a scratch account is fine).

Right-sizing surprises

BASIC maps to a comparable family, but verify the recovery instance type actually meets the performance need before an incident. Confirm: compare the booted instance type to the source’s CPU/RAM under load during a drill. Fix: pin critical servers with right-sizing-method NONE and an explicit target-instance-type.

Best practices

Security notes

Keep the DR plane as governed as production. Replication uses EBS encryption (ebs-encryption DEFAULT, or a CMK with CUSTOM) at rest and TLS in transit; with data-plane-routing PRIVATE_IP plus the VPC interface endpoints, replication traffic never touches the public internet. Agent installation pulls short-lived credentials (Vault’s AWS secrets engine) so no static key lands on a server — and any operator who can call drs:StartRecovery is, in effect, able to launch copies of production, so scope that IAM permission tightly and gate it behind a ServiceNow change. If you encrypt staging with a customer-managed key, the KMS grants matter — the mechanics are in AWS KMS Encryption Deep Dive: Keys, Policies, Envelope, Rotation. Roll the DR Region into your normal posture tooling so a misconfigured staging SG, a publicly exposed recovery instance, or an unencrypted volume is caught continuously, and run your endpoint/EDR agent on the source images so the sensor is present on every recovery instance from first boot. For workforce access to the DRS console and the break-glass operator role, federate through SSO with conditional access rather than IAM users, and require MFA on the recovery role.

The DRS-specific permissions and the blast radius of each:

Permission / action Who needs it Blast radius Guardrail
drs:DescribeSourceServers On-call, dashboards Read-only Broad read is fine
drs:UpdateLaunchConfiguration Platform engineers Changes how recovery boots Reviewed-file driven; PR-gated
drs:StartRecovery Break-glass operator Boots copies of production Scope tight; MFA; CHG-gated
drs:TerminateRecoveryInstances Operator Removes recovery EC2 Scope to DR account/Region
drs:StopReplication / DeleteSourceServer Senior platform Stops DR for a server Senior-only; audited
drs:* (admin) Rare Full control Break-glass identity only

The encryption-in-transit/at-rest posture at a glance:

Layer Mechanism Setting Verify
Block stream in transit TLS over TCP 1500 (default) Private path via endpoint
Staging volumes at rest EBS encryption ebs-encryption DEFAULT/CUSTOM Volumes show encrypted
PIT snapshots at rest EBS snapshot encryption Inherits volume encryption Snapshots encrypted
CMK control (optional) KMS customer-managed key CUSTOM + key + grants Key policy allows DRS roles
Recovery instance volumes EBS encryption From launch template Encrypted on boot

Cost & sizing

DRS is deliberately cheap at rest, which is the entire point versus a warm standby: you pay a small per-source-server hourly DRS charge, the low-cost staging EBS volumes (GP3) holding replicated data, and the small t3.small replication servers — not full-size duplicate infrastructure. Real compute cost only appears while drill or recovery instances run, so terminate drill instances the moment validation is captured — the single biggest avoidable line item.

What actually drives the DRS bill, and how to keep each honest:

Cost driver Billed as Rough scale Control it by
Per-source-server DRS charge Hourly per replicating server Small, continuous Stop replicating decommissioned sources
Staging EBS (GP3) Per GB-month of replicated data Proportional to total disk Exclude scratch/ephemeral volumes (--devices)
Replication servers t3.small hours (shared) Low, continuous Don’t use dedicated unless required
PIT snapshot storage Per GB-month of snapshots Grows with retention ladder Tune pit-policy to compliance, not “max”
Drill / recovery compute Full instance-hours while running Spiky Terminate immediately after the drill
Data transfer Cross-Region + any NAT Per GB Private endpoints; avoid NAT data-processing

A rough monthly sketch (illustrative; verify against the AWS pricing calculator for your Region and disk sizes):

Item Assumption Rough monthly
14 source servers (DRS charge) Small per-server hourly A few thousand ₹
Staging EBS (GP3) ~2 TB replicated Storage-driven
Replication servers Shared t3.small Low
PIT snapshots 10m/1h/3d ladder, ~2 TB Moderate
Quarterly drill 14 instances × ~1 hr, torn down ≈ ₹3,100 / drill
At-rest baseline No drill running Dominated by EBS + per-server

The teardown calls are independent — this is the single most common way DRS keeps billing after a “cleanup,” so know exactly what each call releases and what it leaves behind:

Call Releases Leaves behind You still pay for…
terminate-recovery-instances Launched recovery EC2 + its EBS Source-side replication + staging Per-server DRS charge + staging EBS
stop-replication Active replication for that source The source-server record in DRS Nothing ongoing for that source
delete-source-server The source record + its staging resources Nothing (full removal) Nothing
terraform destroy (landing zone) VPC endpoints, SGs you created DRS objects (separate) Nothing (network)
Agent uninstall (on source) The agent on the host DRS-side record (until deleted) Nothing

Sizing the replication servers is the one knob that affects both cost and whether you meet RPO: too small and high-churn sources push lagDuration past your target (the NOT_CONVERGING error); too large and you pay for idle receive capacity. Start at t3.small, watch lagDuration under real write load during a drill, and step up only the servers that need it.

Source profile Replication server Why
Low/steady write rate t3.small (default, shared) Cheapest; sync keeps up easily
High-churn DB-like disks Larger t3/m-family Avoids NOT_CONVERGING lag
Many sources, mixed Shared t3.small + selective upsize Pay for headroom only where needed
Strict isolation requirement use-dedicated-replication-server true One server per source (costlier)

Interview & exam questions

Q1. In which Region do you initialize DRS, and why? In the target/recovery Region (us-west-2). Replication, the staging area, snapshots and recovery launches all live where you recover to; initializing in the source Region inverts the design so staging would replicate out of the recovery Region. (AWS SAP-C02, AWS Certified Security; BCDR design.)

Q2. How does DRS keep RPO in seconds? The Replication Agent does one full initial sync of every used block, then streams only changed blocks continuously and asynchronously to the staging area, so the staging copy trails the source by seconds, not the hours an AMI/snapshot schedule gives. (SAP-C02.)

Q3. What is the difference between a drill and a real recovery in DRS? A drill (start-recovery --is-drill) launches recovery instances into isolation while replication keeps running, to prove RTO without touching production; a real recovery is the same call minus --is-drill, into real subnets, followed by the DNS cutover. (SAP-C02; operational.)

Q4. How do you recover to a point before a ransomware event rather than to the corrupted “now”? Use the PIT policy’s snapshot ladder: describe-recovery-snapshots to list points, then start-recovery with recoverySnapshotID set to a snapshot timestamped before the event. (Security specialty; resilience.)

Q5. A source server is stuck in INITIAL_SYNC. What is the single most likely cause and how do you confirm it? Blocked TCP 1500 from the source to the staging subnet. Confirm with describe-source-serversdataReplicationInfo.dataReplicationError and VPC Reachability Analyzer on 1500. (Operational; troubleshooting.)

Q6. Why must osByol:true be set for license-bound Windows appliances? Without it, recovery instances launch as AWS-provided Windows, adding licensing charges and risking the vendor’s host-keyed license; osByol:true preserves bring-your-own-license. (Cost + licensing.)

Q7. What does copy-private-ip control, and what is the right value for a cross-Region cutover? Whether the recovery instance reuses the source’s private IP. For cross-Region cutover keep it false — the source CIDR won’t exist in the target VPC and reusing it risks collisions; plan target-subnet addressing instead. (SAP-C02; networking.)

Q8. How does failback work in DRS? reverse-replication makes the running recovery instances sources and streams their current state back to the original Region; once FAILBACK_READY_FOR_LAUNCH, start-failback-launch in a maintenance window makes the origin primary again without losing outage-window writes. (SAP-C02; BCDR.)

Q9. Why does terminating recovery instances not stop your staging bill? terminate-recovery-instances only removes the launched EC2; source-side replication (and its staging EBS) is a separate lifecycle — you must also stop-replication/delete-source-server. (Cost; operational trap.)

Q10. When would you choose DRS over a pilot-light/warm-standby pattern? When you’re recovering servers you can’t trivially rebuild from code — license-bound appliances, hand-built hosts, lift-and-shift VMs — and want seconds-RPO without paying for a hot standby. Cloud-native stacks with golden AMIs and IaC are usually better served by pilot light. (SAP-C02; architecture trade-off.)

Q11. What guardrails belong on drs:StartRecovery? It can boot copies of production, so scope it to the DR account/Region, gate it behind change control (a CHG), require MFA on the operator role, and federate access via SSO rather than IAM users. (Security specialty.)

Q12. How do you keep the DRS replication path off the public internet? Set data-plane-routing PRIVATE_IP, create a DRS VPC interface endpoint (plus S3 gateway / EC2 / EBS endpoints), and use private subnets — this also avoids NAT data-processing charges. (Networking; cost.)

Quick check

  1. In which Region do you run aws drs initialize-service for a us-east-1us-west-2 setup, and why?
  2. A server shows dataReplicationState = STALLED with dataReplicationError: NOT_CONVERGING. What is happening and what’s the fix?
  3. You need to recover to a state just before a bad deploy at 14:20. Which call lists your options, and which flag selects the earlier point?
  4. Name two things that are billing you that terminate-recovery-instances alone will not stop.
  5. What single launch-configuration flag prevents a cross-Region recovery instance from colliding with production addressing?

Answers

  1. In the target Region, us-west-2 — that’s where replication, staging and recovery live; running it in the source Region inverts the design.
  2. The source is out-writing the replication link so lag is growing and not converging. Fix: set bandwidth-throttling 0 and/or move that source to a larger replication server.
  3. aws drs describe-recovery-snapshots lists the PIT points; pass recoverySnapshotID=<earlier-snap> to start-recovery to land before the event.
  4. Source-side replication (its per-server DRS charge) and the staging EBS volumes — stop those with stop-replication / delete-source-server.
  5. copy-private-ip false — so the recovery instance gets a target-subnet IP instead of reusing the source’s private IP.

Glossary

Next steps

AWSDRSDisaster RecoveryFailoverTerraformBCDR
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments