AWS Lesson 9 of 123

Advanced EC2 Auto Scaling: Warm Pools, Lifecycle Hooks, and Zero-Downtime Instance Refresh

Most teams stand up an Auto Scaling group, attach a target-tracking policy, and call it done. That works right up until the moment it doesn’t: a traffic spike outruns a five-minute boot time, a Spot reclaim kills in-flight requests, or an AMI rollout takes down half the fleet because nobody told the load balancer to drain connections first. An EC2 Auto Scaling group (ASG) is not a thermostat — it is a state machine over instance lifecycles, and the interesting engineering lives in the transitions between Pending, InService, Terminating, Warmed:Stopped and the wait states in between.

This guide walks the controls I reach for on every production fleet: launch templates and capacity strategy, warm pools, lifecycle hooks, instance refresh, Spot interruption choreography, and health-check tuning — with the failure modes that justify each one. The difference between an ASG that quietly absorbs a flash sale and one that pages you at 2am is almost never the scaling policy; it is whether the transitions are instrumented. A scale-out is only as fast as your slowest boot. A scale-in is only as safe as your drain. An AMI rollout is only as reversible as your rollback alarm.

Because this is a reference you will return to mid-incident and mid-design, the prose explains the mechanism but the tables enumerate every option, default, limit and failure fingerprint — every allocation strategy, every lifecycle state, every instance-refresh preference, every Spot signal, every termination policy. Read the prose once; keep the tables open when you are tuning warmup, sizing a warm pool, or deciding whether a stuck Terminating:Wait is a hook that never called back or a heartbeat you forgot to extend.

What problem this solves

Reactive scaling has a built-in lie: the metric breaches after demand has already arrived, and the replacement instance then has to boot, pull containers, JIT-warm, prime connection pools, and pass health checks before it serves a single request. If that takes four minutes, a sharp spike is four minutes of degraded service no scaling policy can shorten — the policy fired correctly, the capacity just wasn’t ready. Teams paper over this with a fat On-Demand floor they pay for around the clock, which is expensive, or with aggressive step scaling that overshoots and thrashes.

The second pain is uncontrolled termination. By default the ASG kills an instance the instant it decides to scale in — mid-request, mid-job, mid-flush — and the same brutality applies when a Spot instance is reclaimed or an AMI rollout replaces a node. Without a drain contract, every scale-in event drops connections, every Spot interruption loses in-flight work, and every deploy is a coin-flip. The risk team at any payments or healthcare shop will (correctly) veto Spot entirely until you can prove an infrastructure event never kills a live request.

The third pain is risky rollouts. “Ship a new AMI” should be a routine, reversible, observable operation. Done wrong — bump desired_capacity and pray, or terminate instances by hand — it takes down capacity, offers no canary, and has no automatic revert when the new build is bad. Who hits all three: anyone running a stateful or latency-sensitive tier on EC2 at scale, anyone trying to capture Spot savings without dropping work, and anyone who deploys by replacing instances. The fix is to treat the ASG as the state machine it is and wire the transitions — warm pools for the cold-start gap, lifecycle hooks for the drain/bootstrap windows, instance refresh for controlled rollouts, and capacity rebalancing for graceful Spot exits.

To frame the field before the deep dive, here is every control this article covers, the production pain it removes, and the single setting that anchors it:

Control Pain it removes The state/transition it owns Anchor setting
Launch template + mixed instances Single-pool capacity stall; all-or-nothing purchase What an instance is at launch spot_allocation_strategy
Scaling policies Over/under-provisioning; thrash When desired_capacity changes EstimatedInstanceWarmup
Warm pool Cold-start gap on scale-out Warmed:StoppedInService pool-state + min-size
Lifecycle hooks Killed-mid-request; un-bootstrapped nodes Pending:Wait / Terminating:Wait heartbeat-timeout + default-result
Instance refresh Risky AMI rollouts; no canary, no revert Rolling replacement of the fleet MinHealthyPercentage / AutoRollback
Capacity Rebalancing Dropped work on Spot reclaim Proactive replace before the 2-min gun --capacity-rebalance
Health checks + termination policy Booted-but-broken in rotation; wrong instance dies Who is healthy / who dies on scale-in health_check_type / termination-policies

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand EC2 fundamentals — instances, AMIs, instance families/sizes, EBS volume types, IMDS (instance metadata) — at the level of the AWS EC2 Deep Dive: Instances, AMIs, EBS, User Data, IMDS, and the Auto Scaling basics (launch templates, a single target-tracking policy, min/max/desired) from the EC2 Auto Scaling: Launch Templates, Policies, Lifecycle. You should be comfortable running aws CLI with named profiles, reading JSON output, and applying Terraform. Familiarity with an Application Load Balancer (ALB) and target groups helps, because the drain contract runs through them — see the Elastic Load Balancing: ALB, NLB, GWLB Deep Dive.

This sits in the Compute / Reliability track. It is downstream of the EC2 and ASG fundamentals and upstream of the Spot-heavy and event-driven scaling patterns: the EC2 Spot + Mixed Instances: Capacity-Optimized ASGs and Interruption Handling goes deeper on the purchase-option blend, and the E-commerce Black Friday: AWS Surge Autoscaling Architecture shows the whole stack under flash-sale load. Observability for all of it lives in CloudWatch & CloudTrail Observability Deep Dive.

A quick map of who owns what during an ASG incident, so you escalate to the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Launch template / AMI Boot image, user-data, IMDS, EBS Platform / app team Bad AMI → refresh fails; slow boot → cold-start gap
ASG control plane desired/min/max, policies, hooks Platform / SRE Thrash, stuck *:Wait, refresh stall
Warm pool Pre-initialized reserve Platform / SRE Empty pool → slow scale-out; reuse bugs
Lifecycle hook handler Drain/bootstrap automation (SSM/Lambda) App team Stuck Terminating:Wait; dropped requests
Load balancer / target group Routing, health, deregistration delay Network / app team 5xx on scale-in; flapping registration
Spot capacity pools Reclaim risk per type/AZ AWS (you choose pools) Simultaneous interruptions; capacity stall
CloudWatch alarms Scaling triggers + rollback signal SRE / app team Mis-scaled fleet; no auto-revert on bad deploy

Core concepts

Five mental models make every later decision obvious.

An ASG is a state machine, not a counter. The group does not just hold a number; it moves every instance through a defined lifecycle: Pending → (optional Pending:Wait for a launch hook) → Pending:ProceedInService, and on the way out InService → (optional Terminating:Wait for a terminate hook) → Terminating:ProceedTerminated. Warm pools add Warmed:* variants. Every control in this article is a hook into, or a policy over, one of these transitions. Diagnosing the ASG always starts with “what state is the instance stuck in?”

Capacity diversity is a reliability primitive, not a cost trick. A single instance type is a single point of failure for capacity — when m6i.large is exhausted in an AZ, scale-out stalls and your spike goes unanswered. A mixed instances policy lets the group draw from a diversified pool of types across multiple AZs and blend On-Demand with Spot. The allocation strategy is the lever that turns that diversity into either resilience (price-capacity-optimized) or raw savings at higher interruption risk (lowest-price).

Cold start is a gap you pre-pay, not a latency you accept. Reactive scaling reacts after the breach, and the new instance still pays the full boot tax: AMI/EBS init, container pull, runtime JIT, DI/connection-pool priming, health-check pass — often minutes. A warm pool is a reserve of instances held past that expensive bootstrap in Stopped/Hibernated/Running state, so scale-out resumes a warm instance in seconds instead of launching cold. You move the cost from “every spike” to “once, in the background.”

Termination is a contract you must sign. By default the ASG terminates instantly. Lifecycle hooks insert a *:Wait state and hand you a window — to drain the load balancer and finish in-flight work before a terminate, or to bootstrap and register before a launch. The same drain path serves normal scale-in, Spot reclaim, and instance refresh. The contract is: the instance does not proceed until you call complete-lifecycle-action or the heartbeat times out.

A rollout is a controlled, reversible, observable replacement. Instance refresh rolls the fleet to the current launch template version in batches, honouring a minimum (and optional maximum) healthy percentage, an instance warmup, optional checkpoints for canary bake time, and an optional alarm-based rollback that auto-reverts on a bad build. “Update the AMI” becomes a normal, abortable operation instead of a manual fire drill.

The lifecycle states in one table

Before the deep sections, pin down every state an instance passes through and what each means operationally. This is the single most useful reference when something is stuck:

Lifecycle state What it means Triggered by What you do here Stuck-here symptom
Pending Launching, not yet in service Scale-out / refresh / replacement Nothing (transient) Slow boot if it lingers
Pending:Wait Held by a launch lifecycle hook Launch hook attached Run bootstrap, then complete-lifecycle-action Hook never called back → ABANDON/timeout
Pending:Proceed Hook done, finishing launch Hook completed Nothing (transient)
InService Healthy, taking traffic Passed health checks + grace Normal operation Booted-but-broken if health type is EC2
Terminating Being terminated Scale-in / refresh / unhealthy Nothing (transient)
Terminating:Wait Held by a terminate lifecycle hook Terminate hook attached Drain ELB, finish jobs, complete-lifecycle-action Drain never reports → waits out heartbeat
Terminating:Proceed Hook done, finishing termination Hook completed Nothing (transient)
Terminated Gone
Warmed:Pending Entering the warm pool, bootstrapping Warm pool + launch hook Bootstrap for the pool (distinguish from Pending) Bootstrap not pool-aware → wrong behaviour
Warmed:Stopped In warm pool, stopped, pre-initialized Warm pool (Stopped) Nothing — reserve waiting to be resumed Pool empty → no fast scale-out
Warmed:Hibernated In warm pool, hibernated (RAM saved) Warm pool (Hibernated) Nothing
Warmed:Running In warm pool, running, billed Warm pool (Running) Nothing — fastest, most expensive Paying compute for idle reserve
Standby Removed from rotation, still in group enter-standby (manual) Maintenance without termination Forgot to exit-standby → lost capacity

The vocabulary side by side

Concept One-line definition Where it lives Why it matters
Launch template Versioned blueprint for an instance EC2 → Launch Templates The unit instance refresh rolls forward
Mixed instances policy Diversified types + purchase blend On the ASG Capacity resilience + Spot savings
Allocation strategy How Spot pools are chosen instances_distribution Resilience vs cheapest
Warm pool Pre-initialized instance reserve On the ASG Seconds-not-minutes scale-out
Lifecycle hook Wait state on a transition On the ASG Safe drain / bootstrap window
Heartbeat The hook’s countdown clock Per in-progress hook action Extend it or the instance proceeds
Instance refresh Rolling replacement to new template ASG operation Zero-downtime AMI/template rollout
Checkpoint A pause % during a refresh Refresh preferences Canary bake before continuing
Capacity Rebalancing Proactive Spot replacement ASG flag Graceful exit before reclaim
Rebalance recommendation Early “elevated risk” signal EventBridge / metadata Drain before the 2-min notice
Interruption notice Hard “going away in ~2 min” Instance metadata / EventBridge Last-chance drain
Default instance warmup Group-wide time-to-ready On the ASG Stops double-scaling on fresh capacity
Termination policy Who dies on scale-in On the ASG Sheds the right (oldest/stalest) capacity
Scale-in protection “Don’t kill this instance” flag Per instance Protect non-resumable work

Launch templates, mixed instances, and allocation strategy

Launch configurations are dead; everything below requires a launch template. The template is versioned, supports the full EC2 surface (IMDSv2 enforcement, instance tags, detailed monitoring, instance-store mappings), and is the unit instance refresh rolls forward.

resource "aws_launch_template" "app" {
  name_prefix   = "app-"
  image_id      = var.ami_id
  instance_type = "m6i.large" # overridden by the mixed instances policy below

  metadata_options {
    http_tokens                 = "required" # IMDSv2 only
    http_put_response_hop_limit = 2
    instance_metadata_tags      = "enabled"
  }

  monitoring { enabled = true } # 1-minute metrics, not 5

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size           = 30
      volume_type           = "gp3"
      throughput            = 250
      delete_on_termination = true
      encrypted             = true
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags          = { Name = "app", Environment = "prod" }
  }
}

The equivalent in CLI, creating a template from a JSON spec:

aws ec2 create-launch-template \
  --launch-template-name app \
  --launch-template-data '{
    "ImageId": "ami-0abc123",
    "InstanceType": "m6i.large",
    "MetadataOptions": {"HttpTokens": "required", "HttpPutResponseHopLimit": 2, "InstanceMetadataTags": "enabled"},
    "Monitoring": {"Enabled": true},
    "BlockDeviceMappings": [{"DeviceName": "/dev/xvda", "Ebs": {"VolumeSize": 30, "VolumeType": "gp3", "Throughput": 250, "Encrypted": true, "DeleteOnTermination": true}}]
  }'

Every launch-template field that matters for an ASG

The template is where most “why is this instance wrong” bugs originate. Enumerate the fields you actually set, the default, and the gotcha:

Field What it sets Default When to change Gotcha / limit
image_id The AMI to boot (none) Every AMI roll Stale/bad AMI = refresh fails health check
instance_type Base type (none) Rarely (overridden by policy) Ignored when mixed-instances overrides exist
metadata_options.http_tokens IMDSv1 vs IMDSv2 optional Always set required required breaks SDKs that only do IMDSv1
http_put_response_hop_limit IMDS hop limit 1 Set 2 for container workloads Pods/containers need ≥2 to reach IMDS
instance_metadata_tags Tags via IMDS disabled When app reads its own tags Off by default; enable explicitly
monitoring.enabled 1-min vs 5-min metrics 5-min Set true for responsive scaling Detailed monitoring billed per instance
block_device_mappings EBS volumes AMI default gp3 + size + throughput delete_on_termination defaults vary
instance_market_options Spot at template level On-Demand Leave to the ASG policy Don’t set Spot here AND in the policy
iam_instance_profile Role for the instance (none) Always (SSM, app perms) Missing profile breaks SSM drain hooks
security_group_ids Network exposure default SG Always set explicitly VPC default SG is usually wrong
user_data Boot script (none) Bootstrap Runs on every launch incl. warm pool
tag_specifications Tags on instance/volume (none) Always (cost allocation) Per-resource-type; volumes need their own
ebs_optimized Dedicated EBS bandwidth per type Leave default on Nitro Built-in on modern types

A single instance type is a single point of failure for capacity. A mixed instances policy lets the group draw from a diversified pool and blend purchase options:

resource "aws_autoscaling_group" "app" {
  name                      = "app"
  min_size                  = 6
  max_size                  = 60
  desired_capacity          = 6
  vpc_zone_identifier       = var.private_subnet_ids
  health_check_type         = "ELB"
  health_check_grace_period = 90

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 2  # always-on floor
      on_demand_percentage_above_base_capacity = 25 # 25% OD / 75% Spot above the floor
      spot_allocation_strategy                 = "price-capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.app.id
        version            = "$Latest"
      }
      override { instance_type = "m6i.large" }
      override { instance_type = "m6a.large" }
      override { instance_type = "m5.large" }
      override { instance_type = "m5n.large" }
    }
  }
}

Allocation strategy: the lever that matters

The allocation strategy decides how the ASG picks Spot capacity pools (an instance-type × AZ combination). Pick wrong and you either park the whole group in the one pool about to be reclaimed, or pay more than you needed to. Every strategy, side by side:

Strategy How it picks Spot pools Interruption risk Cost Use when
price-capacity-optimized Weights pools by spare capacity and price Low Low-ish Default for almost everything
capacity-optimized Deepest-capacity pools only Lowest Higher than price-cap-opt Reclaims are very costly; price secondary
capacity-optimized-prioritized Capacity-optimized, but honour your override order Low Varies You have a real type preference
lowest-price Cheapest N pools High (shallow pools) Lowest Genuinely fault-tolerant batch only
diversified (legacy) Spread evenly across all pools Medium Medium Rarely; superseded by price-cap-opt

The On-Demand side of the distribution has its own two knobs, and they compose with the Spot strategy:

Distribution setting What it does Default Typical prod value Effect
on_demand_base_capacity Absolute floor of On-Demand instances 0 2–4 Guarantees a minimum always-on capacity
on_demand_percentage_above_base_capacity % OD for capacity above the base 100 20–30 Lower = more Spot savings, more risk
spot_allocation_strategy How Spot pools are chosen lowest-price (legacy) price-capacity-optimized The resilience lever
spot_instance_pools # of pools (only for lowest-price) 2 n/a with price-cap-opt Ignored by capacity strategies
spot_max_price Cap on Spot price On-Demand price Leave empty Setting it too low = no capacity

The allocation strategy can only work if you give it pools to choose from. Diversify deliberately, but keep the types roughly fungible behind a load balancer — mixing large and 2xlarge skews per-instance load unless you set capacity weights. How to choose your override set:

Diversification axis Minimum for resilience Why Gotcha if you skip it
Instance types 4+ More Spot pools to draw from One exhausted pool stalls scale-out
Instance families 2+ (m6i,m6a,m5) Decorrelate reclaim events Same family can be reclaimed together
AZs (vpc_zone_identifier) 3 AZ-level capacity isolation 2 AZs halves your pool diversity
Generations 1–2 Newer = cheaper/better, older = available All-newest can be capacity-thin
Sizes Same size, or set weights Fungible load per instance Mixed sizes skew LB distribution

Rule of thumb: diversify across at least four instance types and three AZs before tuning anything else. Capacity-optimized allocation can only work if you give it pools to choose from.

You can also let AWS pick types for you with attribute-based instance selection — specify vCPU/memory ranges and the ASG enumerates matching types:

override {
  instance_requirements {
    vcpu_count   { min = 2, max = 4 }
    memory_mib   { min = 7168, max = 16384 }
    instance_generations = ["current"]
  }
}
Approach Pros Cons Use when
Explicit override list Predictable, reviewed Manual to maintain You know your fungible set
instance_requirements (ABS) Huge pool, future-proof Can pull surprising types You want max Spot capacity breadth

Scaling policies: target tracking, step, and predictive

Three policy types, and they compose. The right web-tier setup is usually a target-tracking policy on a load-correlated metric, optionally augmented by predictive scaling for cyclical demand, with step scaling reserved for asymmetric responses.

aws autoscaling put-scaling-policy \
  --auto-scaling-group-name app \
  --policy-name tt-requests-per-target \
  --policy-type TargetTrackingScaling \
  --target-tracking-configuration '{
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/my-alb/50dc6c495c0c9188/targetgroup/app-tg/943f017f100becff"
    },
    "TargetValue": 1000.0,
    "EstimatedInstanceWarmup": 90
  }'

Policy types compared

Policy type How it decides Best metric Reacts Use when Avoid when
Target tracking Holds a metric at a target ALBRequestCountPerTarget After breach + warmup Steady-state web/API tier You need asymmetric response
Step scaling Add/remove N per breach band CPU, custom After breach Aggressive/asymmetric scaling Simple steady demand
Simple scaling (legacy) One adjustment + cooldown any After breach + cooldown Never (superseded) Always — use step instead
Predictive Forecasts ahead of demand CPU, request count, custom Before demand Cyclical/recurring traffic Spiky/random load
Scheduled action Fixed capacity at a time n/a (time-based) At the scheduled time Known events (sale launch) Unpredictable demand

Predefined target-tracking metrics

Predefined metric type What it tracks Good for Caveat
ASGAverageCPUUtilization Mean CPU across the group CPU-bound work Lags; conflates GC/IO with demand
ALBRequestCountPerTarget Requests per healthy target Web/API tiers Needs the ALB ResourceLabel
ASGAverageNetworkIn Bytes in per instance Ingest-bound Noisy; rarely the true driver
ASGAverageNetworkOut Bytes out per instance Egress-bound Same
Custom metric spec Any CloudWatch metric Queue depth, p95 latency You own the math/aggregation

EstimatedInstanceWarmup (or the group-level default instance warmup) is the single most-overlooked field. It tells the ASG to ignore a freshly launched instance’s metrics until it has warmed up, so you don’t double-scale while new capacity boots. Set it to your real time-to-ready, not zero. The warmup-related settings that interact:

Setting Scope What it does Default Set it to
EstimatedInstanceWarmup Per policy Ignore new-instance metrics this long (falls back to default) Real time-to-ready
default_instance_warmup Group Default warmup for all policies + refresh 0 (if unset) Real time-to-ready (set once)
Cooldown (simple scaling) Per policy Wait after a scaling action 300 s Avoid simple scaling
metrics_granularity Group 1-min vs 5-min group metrics 1 min Keep 1 min
Health check grace period Group Amnesty before health checks count 0 / 300 Boot-to-healthy + margin

Predictive scaling is best run in ForecastOnly mode for a week first, then flipped to ForecastAndScale once you trust the forecast — and paired with a target-tracking policy that handles the unpredicted remainder.

aws autoscaling put-scaling-policy \
  --auto-scaling-group-name app \
  --policy-name predictive-cpu \
  --policy-type PredictiveScaling \
  --predictive-scaling-configuration '{
    "MetricSpecifications": [{
      "TargetValue": 50,
      "PredefinedMetricPairSpecification": {"PredefinedMetricType": "ASGCPUUtilization"}
    }],
    "Mode": "ForecastOnly",
    "SchedulingBufferTime": 300
  }'
Predictive setting What it does Default Note
Mode ForecastOnly vs ForecastAndScale ForecastOnly Always observe first
SchedulingBufferTime Launch this many seconds early 300 s Cover boot time before demand
MaxCapacityBreachBehavior Allow exceeding max? HonorMaxCapacity Or IncreaseMaxCapacity
MaxCapacityBuffer Headroom % above forecast 10 Only with IncreaseMaxCapacity

Warm pools: paying down cold-start latency

Target tracking is reactive — it reacts after the metric breaches, and the new instance still has to boot, pull containers, JIT-warm, and pass health checks. If that takes four minutes, a sharp spike is four minutes of degraded service. A warm pool is a pre-initialized reserve of instances held in Stopped (or Hibernated, or Running) state, already past the expensive bootstrap. On scale-out the ASG starts a stopped instance instead of launching from scratch — seconds instead of minutes.

aws autoscaling put-warm-pool \
  --auto-scaling-group-name app \
  --pool-state Stopped \
  --min-size 4 \
  --max-group-prepared-capacity 20 \
  --instance-reuse-policy '{"ReuseOnScaleIn": true}'
resource "aws_autoscaling_group" "app" {
  # ... as above ...
  warm_pool {
    pool_state                  = "Stopped"
    min_size                    = 4
    max_group_prepared_capacity = 20
    instance_reuse_policy { reuse_on_scale_in = true }
  }
}

Pool state: the cost/latency trade

State choice drives the cost/latency trade:

Pool state Resume latency EBS cost EC2 cost while warm Use when
Stopped Seconds Yes (volumes) None Default. Bootstrap is expensive, RAM state is not needed
Hibernated Fast, RAM restored Yes (incl. RAM-to-disk) None App has long in-memory warmup (large caches, JIT)
Running Near-instant Yes Yes Latency is critical and you’ll eat the compute cost

Every warm-pool setting

Setting What it does Default When to change Gotcha
pool_state Stopped/Hibernated/Running Stopped Latency/cost trade Hibernate needs encrypted root + supported type
min_size Min instances kept warm 0 Size to spike-rate gap Too small = no benefit on a real spike
max_group_prepared_capacity Cap on warm + in-service prepared max_size Bound the reserve cost Counts toward prepared, not desired
instance_reuse_policy.reuse_on_scale_in Return scaled-in instances to pool false Cost optimization App must tolerate stop/resume cleanly

The two details that bite people

First, the warm-pool transition runs your lifecycle hooks. An instance entering the pool fires autoscaling:EC2_INSTANCE_LAUNCHING, and leaving it (into service) fires its own transition, so your bootstrap automation must know which phase it’s in (LifecycleState is Warmed:Pending vs Pending). Second, ReuseOnScaleIn returns scaled-in instances to the pool instead of terminating them, which is great for cost but means your app must tolerate being stopped and resumed cleanly. Size min-size to cover the gap between your spike rate and your real launch time, not your whole peak.

The phases an instance moves through, and what your bootstrap must do in each:

Phase / state Hook fired What user-data / hook should do Common mistake
Entering pool EC2_INSTANCE_LAUNCHING (Warmed:Pending) Full expensive bootstrap (pull image, JIT-warm) Registering with LB here (it’s not in service)
In pool none Nothing (stopped/hibernated) Assuming it’s serving traffic
Leaving pool → service EC2_INSTANCE_LAUNCHING (Pending) Light re-init only (refresh creds, re-register) Re-running the full bootstrap (slow)
Scaled in (with reuse) terminate hook then back to pool Drain, then expect a stop Treating it as a permanent termination

Detect the phase on the instance from IMDS-tagged lifecycle state or the hook payload:

# In the launch hook handler: branch on the transition origin
STATE=$(aws autoscaling describe-auto-scaling-instances \
  --instance-ids "$INSTANCE_ID" \
  --query 'AutoScalingInstances[0].LifecycleState' --output text)
# "Warmed:Pending"  -> full bootstrap for the pool
# "Pending"          -> light re-init, this one is going into service

Sizing a warm pool

The right min_size covers the gap between how fast demand arrives and how fast a cold launch can answer it. A worked rule:

Input Example value Role in sizing
Worst observed surge rate +30 instances in 2 min Demand side
Cold launch time-to-ready 3.5 min Why you can’t launch in time
Warm resume time ~20 s Why the pool helps
Warm min_size ≥ surge over (cold − warm) window Cover the deficit, not the whole peak
Cost of the reserve EBS for min_size stopped vols The bill you pay for the headroom

Lifecycle hooks: clean drain and safe bootstrap

By default the ASG terminates an instance the instant it decides to scale in — mid-request, mid-job, mid-flush. Lifecycle hooks insert a wait state into the transition and hand you a window to act before the instance proceeds.

There are two hook types:

aws autoscaling put-lifecycle-hook \
  --lifecycle-hook-name drain-on-terminate \
  --auto-scaling-group-name app \
  --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
  --heartbeat-timeout 300 \
  --default-result CONTINUE
resource "aws_autoscaling_lifecycle_hook" "drain" {
  name                   = "drain-on-terminate"
  autoscaling_group_name = aws_autoscaling_group.app.name
  lifecycle_transition   = "autoscaling:EC2_INSTANCE_TERMINATING"
  heartbeat_timeout      = 300
  default_result         = "CONTINUE"
}

Every lifecycle-hook setting

Setting What it does Default Launch hook Terminate hook
lifecycle_transition Which transition to intercept (none) EC2_INSTANCE_LAUNCHING EC2_INSTANCE_TERMINATING
heartbeat_timeout Seconds the instance waits 3600 Bootstrap budget > deregistration delay
default_result What happens if you never call back ABANDON usually ABANDON usually CONTINUE
notification_target_arn Where the event is sent (SNS/SQS) EventBridge default optional optional
role_arn Role to publish to the target (none) with SNS/SQS target with SNS/SQS target
notification_metadata Extra payload data (none) optional optional

default-result: a safety decision, not a formality

--default-result is the behaviour when your automation never reports back. For a terminating hook, CONTINUE means “if my drain logic never reports back, proceed with termination anyway” — correct, because a stuck drain shouldn’t pin a dying instance forever. For a launching hook, ABANDON is usually right: a bootstrap that never signals success should be thrown away, not put into service. The instance stays in the wait state until you call complete-lifecycle-action or the heartbeat times out (extendable with record-lifecycle-action-heartbeat).

Hook type default_result Meaning if no callback Why
Terminating CONTINUE Terminate anyway after timeout A stuck drain must not pin a dying node forever
Terminating ABANDON Also terminates (no resume) Rarely different for terminate
Launching ABANDON Throw the instance away A bootstrap that never succeeds is unfit
Launching CONTINUE Put it in service anyway Dangerous — serves traffic un-bootstrapped

Wire the hook to an EventBridge rule and a small handler. A drain runbook on the instance via SSM:

# Triggered by the EC2_INSTANCE_TERMINATING event; runs on the instance.
# 1. Deregister from the target group so the ALB stops sending new requests.
aws elbv2 deregister-targets \
  --target-group-arn "$TG_ARN" \
  --targets Id="$INSTANCE_ID"

# 2. Wait out deregistration_delay so in-flight requests finish.
aws elbv2 wait target-deregistered \
  --target-group-arn "$TG_ARN" \
  --targets Id="$INSTANCE_ID"

# 3. Tell the ASG it's safe to terminate now (don't wait for the timeout).
aws autoscaling complete-lifecycle-action \
  --lifecycle-hook-name drain-on-terminate \
  --auto-scaling-group-name app \
  --lifecycle-action-result CONTINUE \
  --instance-id "$INSTANCE_ID"

If a drain legitimately needs longer than the heartbeat (a long job finishing), extend the clock instead of letting it expire:

aws autoscaling record-lifecycle-action-heartbeat \
  --lifecycle-hook-name drain-on-terminate \
  --auto-scaling-group-name app \
  --instance-id "$INSTANCE_ID"

The hook actions you drive

Action / command Purpose When
complete-lifecycle-action Release the wait state now Drain/bootstrap finished
record-lifecycle-action-heartbeat Reset the timeout clock Work needs more time
--lifecycle-action-result CONTINUE Proceed with the transition Normal completion
--lifecycle-action-result ABANDON Abandon (terminate the launch) Bootstrap failed
--lifecycle-action-token Idempotency for the action Optional dedupe

Set the hook’s heartbeat-timeout comfortably above the target group’s deregistration_delay.timeout_seconds (default 300s). If the hook times out before drain completes, the instance is killed mid-flight and you’ve gained nothing. The timing relationship that must hold:

Timer Default Relationship If violated
Target group deregistration_delay 300 s Baseline drain time Connections cut mid-flight
Hook heartbeat_timeout 3600 s > deregistration delay Instance killed before drain done
ELB connection idle timeout 60 s < deregistration delay Idle conns closed first (fine)
Spot interruption window ~120 s Drain must fit or be partial Reclaim before drain → use rebalance

Instance refresh: rolling AMI and template updates

You baked a new AMI. The wrong way to ship it is to bump desired_capacity and pray, or to terminate instances by hand. Instance refresh rolls the fleet to the current launch template version in controlled batches, replacing instances while honoring health checks and your minimum healthy percentage.

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name app \
  --strategy Rolling \
  --desired-configuration '{
    "LaunchTemplate": {
      "LaunchTemplateId": "lt-0abc123",
      "Version": "$Latest"
    }
  }' \
  --preferences '{
    "MinHealthyPercentage": 90,
    "MaxHealthyPercentage": 110,
    "InstanceWarmup": 120,
    "ScaleInProtectedInstances": "Wait",
    "StandbyInstances": "Wait",
    "CheckpointPercentages": [25, 50],
    "CheckpointDelay": 600
  }'

Every instance-refresh preference

The preferences are the whole game — enumerate them:

Preference What it does Default Set it to Effect / gotcha
MinHealthyPercentage Floor of healthy capacity during refresh 90 90 Lower = faster, riskier
MaxHealthyPercentage Ceiling that enables surge 100 110+ >100 launches before terminating (no dip)
InstanceWarmup Healthy-for-this-long before counting default_instance_warmup Real time-to-ready Same value as scaling warmup
CheckpointPercentages Pause points (e.g. [25,50]) (none) Canary thresholds Each is a bake gate
CheckpointDelay Seconds to pause at each checkpoint (none) 600 Watch dashboards during the pause
ScaleInProtectedInstances Honour scale-in-protected nodes Ignore Wait Wait respects protection
StandbyInstances Handle instances in Standby Ignore Wait Wait respects parked nodes
SkipMatching Skip instances already on target false true Avoids replacing already-current nodes
AutoRollback Revert on alarm/failure false true Needs alarms or a stable template
AlarmSpecification.Alarms CloudWatch alarms that trip rollback (none) your 5xx/latency alarm The auto-revert trigger
MaxHealthyPercentage + warmup Surge speed tune together Too tight = slow, too loose = cost

The preferences that matter, in prose

Refresh strategy and the surge math

Min / Max healthy Behaviour Capacity during refresh Speed Use when
90 / 100 Terminate first, then replace Dips to 90% Slower Cost-sensitive, can tolerate a dip
90 / 110 Surge: launch then terminate Never below 100% Faster, briefly hot Production zero-downtime default
100 / 150 Aggressive surge Up to 150% briefly Fastest Need speed, tolerate the extra cost
50 / 100 Replace half at a time Dips to 50% Fast Only for tolerant, stateless tiers

Better still, attach alarm-based rollback so a CloudWatch alarm trips an automatic revert to the previous configuration:

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name app \
  --preferences '{
    "MinHealthyPercentage": 90,
    "AutoRollback": true,
    "AlarmSpecification": { "Alarms": ["app-5xx-high"] }
  }'

In Terraform, an instance_refresh block on the ASG triggers a refresh automatically whenever the launch template version changes, which makes “update AMI” a normal apply:

resource "aws_autoscaling_group" "app" {
  # ...
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90
      max_healthy_percentage = 110
      instance_warmup        = 120
      checkpoint_percentages = [25, 50]
      checkpoint_delay       = 600
      auto_rollback          = true
      alarm_specification { alarms = ["app-5xx-high"] }
    }
    triggers = ["tag"] # also refresh on tag changes, not just LT version
  }
}

Monitoring and aborting a refresh

Monitor and, if needed, abort:

aws autoscaling describe-instance-refreshes --auto-scaling-group-name app \
  --query 'InstanceRefreshes[0].[Status,PercentageComplete,StatusReason]' --output text

aws autoscaling cancel-instance-refresh --auto-scaling-group-name app

Cancellation stops further replacements but does not roll back instances already replaced — AutoRollback does. The refresh statuses you will see and what each means:

Status Meaning Your move
Pending Accepted, not started Wait
InProgress Replacing instances Watch dashboards
Successful Whole fleet on the new config Done
Cancelling / Cancelled You aborted Already-replaced stay new
RollbackInProgress Reverting to previous config Alarm tripped or you rolled back
RollbackSuccessful Fleet back on old config Investigate the bad build
RollbackFailed Revert itself failed Manual intervention needed
Failed Could not maintain healthy % Check warmup, health, grace period
Baking At a checkpoint, bake delay running Canary window — watch alarms

Spot blends, rebalance recommendations, and interruption handling

Running 75% Spot only works if interruptions are choreographed, not endured. Two signals, two-minute warning each:

The two Spot signals

Signal Timing Certainty Delivery What to do
Rebalance recommendation Earlier, best-effort “Elevated risk” EventBridge / metadata Launch replacement, begin drain early
Interruption notice (instance-action) ~2 min before Definite Metadata / EventBridge Last-chance drain; finish fast
Termination (Spot) At T-0 Happening Instance is reclaimed

Turn on Capacity Rebalancing so the ASG acts on the rebalance recommendation: it launches a replacement proactively and lets you drain the at-risk instance before the two-minute gun even fires.

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name app \
  --capacity-rebalance

Pair it with a termination lifecycle hook (above) so the drain on a rebalance/interruption follows the exact same deregister-and-wait path as a normal scale-in. The on-instance agent should watch for both signals:

# Poll the IMDSv2 interruption endpoint from a sidecar/systemd unit.
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
ACTION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action)
# Non-404 => interruption scheduled; begin connection draining immediately.

Interruption-handling options compared

Mechanism What it does Effort Best for Limit
Capacity Rebalancing ASG replaces at-risk instances early One flag All ASG Spot fleets Replacement still needs capacity
Terminate lifecycle hook Drains on reclaim like a scale-in Hook + handler Stateful/latency-sensitive Drain must fit the window
IMDS instance-action poll Per-instance last-chance drain Sidecar/systemd Custom drain logic Only ~2 min
AWS Node Termination Handler Cordon/drain k8s node on both signals Helm install Kubernetes on EC2 k8s-specific
EventBridge rule → Lambda Centralized reaction to signals Rule + function Fleet-wide automation Adds a moving part

If you run Kubernetes on these nodes, don’t hand-roll this — the AWS Node Termination Handler consumes both signals and cordons/drains the node for you. The principle is identical: convert a hardware-level warning into a graceful application drain.

Health checks, ELB integration, and termination policies

Two independent health verdicts decide whether an instance lives: EC2 status checks (is the VM alive?) and ELB health checks (does the app respond?). Set health_check_type = "ELB" or your ASG will happily keep a booted-but-broken instance in rotation because the hypervisor is fine while your process is crash-looping.

Health check types

Health check type Checks Catches Misses Use when
EC2 (default) VM/hypervisor status Dead instance, failed status checks App crash-loop, port not bound Never alone for a served tier
ELB Target group health probe App not responding, bad port Nothing the probe doesn’t test Any LB-fronted tier
Custom (set-instance-health) Your own signal App-specific health Whatever you don’t report Bespoke health logic
EBS (attached) Volume reachability Impaired EBS App-level issues Volume-sensitive workloads

The health_check_grace_period is the launch-time amnesty: how long after an instance starts before health checks count against it. Too short and the ASG kills instances that simply haven’t finished booting, producing a launch/terminate thrash loop. Set it to your boot-to-healthy time plus margin.

Grace period vs boot time Result
Grace < boot-to-healthy Thrash: ASG kills instances mid-boot, relaunches, repeats
Grace ≈ boot-to-healthy Borderline; transient blips can still evict
Grace = boot-to-healthy + margin Correct: real failures caught, boots survive
Grace far too long Slow to evict genuinely broken instances

For scale-in, termination policies decide who dies. The default is sensible (allocation-strategy alignment, then oldest launch template/config, then closest to the next billing hour, balanced across AZs), but custom policies matter during rollouts.

Termination policies

Policy Sheds Pairs with Use when
Default Balanced AZ → oldest LT → near billing hour General use Steady-state
OldestInstance The stalest instance AMI hygiene Always shed oldest capacity
NewestInstance The most recent instance Testing/rollback Undo a bad recent launch
OldestLaunchTemplate Old-template instances first Rollouts Converge on the new version on scale-in
OldestLaunchConfiguration Old launch-config first Legacy (LC) groups Migrating off launch configs
ClosestToNextInstanceHour Best billing efficiency Cost focus Maximize per-hour value (less relevant post per-second billing)
AllocationStrategy Realigns to the Spot strategy Mixed instances Keep the fleet optimally distributed
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name app \
  --termination-policies "OldestLaunchTemplate" "Default" \
  --default-instance-warmup 90

default-instance-warmup set here at the group level becomes the default EstimatedInstanceWarmup for every policy and refresh — set it once, correctly, and stop repeating yourself. Use instance scale-in protection for nodes you can’t lose mid-task (a long-running consumer draining a queue), and let the termination policy route around them.

# Protect a node doing non-resumable work from scale-in
aws autoscaling set-instance-protection \
  --auto-scaling-group-name app \
  --instance-ids i-0abc123 \
  --protected-from-scale-in

Architecture at a glance

The diagram traces a request and the ASG’s control loop together, left to right, then marks each transition where a failure class bites. Read it as four zones. Traffic enters at an ALB that distributes to a target group of healthy instances — this is where the drain contract lives, because deregistration is how an instance stops taking new requests. The ASG control plane zone holds the group itself plus its scaling policies and CloudWatch alarms; this is the loop that watches the metric, decides desired_capacity, and fires the lifecycle transitions. The lifecycle zone is the heart of the system: a launching instance can sit in Pending:Wait for a bootstrap hook, a terminating one in Terminating:Wait for a drain hook, and instance refresh drives rolling replacement through these same states. Finally the warm pool zone holds the Warmed:Stopped reserve that feeds fast scale-out, drawing instances from mixed Spot/On-Demand capacity across AZs.

Follow the numbered badges to read the failure map. Each one sits on the exact hop where it bites: a warm pool sized too small (1) means scale-out falls back to cold launches and the spike goes unanswered; a terminate hook whose heartbeat is shorter than the target group’s deregistration delay (2) cuts connections mid-flight; an instance refresh with MaxHealthyPercentage = 100 and no surge (3) dips capacity during the rollout; a Spot reclaim without Capacity Rebalancing (4) drops in-flight work; and a health_check_type = EC2 (5) leaves a booted-but-broken instance serving 5xx because the hypervisor is fine. The legend narrates each as symptom · how to confirm · fix. The method is always the same: localise the problem to a transition, read the badge, run the named aws command, apply the fix.

EC2 Auto Scaling architecture showing traffic from an Application Load Balancer into a target group of healthy instances, the ASG control plane with scaling policies and CloudWatch alarms driving desired capacity, the instance lifecycle with Pending:Wait bootstrap and Terminating:Wait drain hooks plus rolling instance refresh, and a warm pool of Warmed:Stopped instances fed by mixed Spot and On-Demand capacity across availability zones — with numbered failure points marking an undersized warm pool, a drain hook heartbeat shorter than the deregistration delay, a no-surge instance refresh that dips capacity, a Spot reclaim without capacity rebalancing dropping work, and an EC2-only health check leaving a booted-but-broken instance in rotation

Real-world scenario

Cohort Pay runs its card-authorization API on an ASG behind an ALB: a JVM service (Spring Boot) on m6i.large, target tracking on CPU, 100% On-Demand, in ap-south-1 across three AZs. Steady traffic is ~600 requests/second with a 7pm spike to ~2,200 rps when a partner merchant runs daily promotions. The platform team is five engineers; the monthly EC2 spend is about ₹9.4 lakh. Two problems collided.

First, the JVM service took ~3.5 minutes from launch to warm — config fetch from Parameter Store, connection-pool priming to the HSM and the database, and JIT compilation of the hot authorization path. Every 7pm spike meant minutes of elevated p99 latency and a scatter of 5xx while target tracking spun up cold capacity that wasn’t ready to serve. The on-call reflex — raise the CPU target or add a step policy — only made the ASG launch more cold instances faster, overshooting and then scaling back in, a thrash that never closed the latency gap.

Second, finance wanted the ~58% cost reduction Spot would bring, but the risk team had a hard, audited rule: an authorization request in flight must never be killed by an infrastructure event. Naive Spot was a non-starter — a reclaim mid-auth was exactly the failure they were chartered to prevent. The two requirements looked contradictory: go cheaper with Spot, but never drop a request when Spot (or anything) reclaims a node.

The fix combined four of the controls above. They added a Stopped warm pool with min_size sized to their worst observed surge (about +18 instances over two minutes) so scale-out resumed pre-warmed instances in ~20 seconds instead of cold-launching for 3.5 minutes — the JIT and pool priming happened once, in the background, not on the critical path. They moved to a mixed instances policy at on_demand_base_capacity = 6 with 30% On-Demand above the base and price-capacity-optimized Spot across five m6i/m6a/m5 sizes in three AZs. Crucially, they enforced the no-killed-request rule with Capacity Rebalancing plus a terminating lifecycle hook that deregistered the instance from the ALB target group and waited out the full deregistration_delay before completing the action — so Spot reclaims, rebalance recommendations, and normal scale-in all drained through one identical path.

# The contract that satisfied the risk team: never complete termination
# until the ALB has stopped routing and in-flight auths have finished.
aws autoscaling put-lifecycle-hook \
  --lifecycle-hook-name auth-drain \
  --auto-scaling-group-name payments-authz \
  --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
  --heartbeat-timeout 330 \
  --default-result CONTINUE   # 330s > the 300s deregistration delay, with margin

AMI patching moved to instance refresh with MaxHealthyPercentage = 110 (surge, so capacity never dipped below 100%), checkpoints at 25% and 50% with a 10-minute bake each, and AutoRollback wired to their 5xx alarm — so a bad build paused at the first checkpoint and reverted itself instead of paging anyone. The timeline of the migration tells the story:

Phase Change Symptom before Result after
Week 0 Baseline: CPU target tracking, 100% OD p99 spikes to seconds on every 7pm surge
Week 1 Add Stopped warm pool (min_size 18) 3.5-min cold launches during spike Scale-out in ~20 s; p99 flat
Week 2 Mixed instances, 30% OD above base All-OD cost (₹9.4L/mo) Spot blend, savings begin
Week 3 Capacity Rebalancing + terminate drain hook Reclaim could drop an in-flight auth Every reclaim drains cleanly
Week 4 Instance refresh + checkpoints + AutoRollback Manual, risky AMI rollouts Canary + auto-revert on 5xx
Steady p99 flat, cost −55%, 0 dropped auths in 12 mo

Net result: p99 latency during surges dropped from seconds to flat, compute cost fell ~55% (to about ₹4.2 lakh/month), and in twelve months of Spot interruptions not one authorization request was dropped. The lesson on the wall: “Scaling policy is the easy 10%. The transitions — warm, drain, refresh, rebalance — are the 90% that actually keeps you up.”

Advantages and disadvantages

The state-machine model of EC2 Auto Scaling is what makes warm pools, clean drains, and reversible rollouts possible — but every one of those controls is a knob you must turn, and the defaults are tuned for simplicity, not for production safety. Weigh it honestly:

Advantages (why these controls help you) Disadvantages (why they bite)
Warm pools collapse scale-out from minutes to seconds without paying for idle compute (Stopped state) Warm pools add real complexity: user-data runs in two phases, and bootstrap must be pool-aware
Lifecycle hooks give a guaranteed drain/bootstrap window on every transition, Spot reclaim included A hook with no working callback or too-short heartbeat stalls or cuts — a stuck *:Wait is its own incident
Instance refresh makes “ship an AMI” a controlled, abortable, auto-revertible operation Misconfigured (no surge, bad warmup, wrong health type) it can dip capacity or stall mid-fleet
Mixed instances + price-capacity-optimized turn Spot into resilient, cheap capacity Spot still requires interruption choreography; lowest-price without it drops work in waves
Capacity Rebalancing converts reclaims into graceful, pre-warned drains It launches replacements early, briefly running hot and costing a little more
ELB health checks evict booted-but-broken instances automatically Defaults are unsafe: EC2 health type, 0/300 grace, no warm pool, instant termination
Termination policies let you shed the right capacity (oldest/old-template) during rollouts Wrong policy kills new instances during a deploy, or non-resumable work without protection
Predictive scaling pre-empts cyclical demand before it arrives Useless (or harmful) on spiky/random load; needs weeks of clean history to trust

The model is right for any EC2 tier that must absorb spikes, capture Spot savings, or roll AMIs without downtime. It bites hardest on teams that adopt one control without its partner — a warm pool with pool-unaware bootstrap, Spot without rebalancing, an instance refresh without surge or rollback. The disadvantages are all manageable, but only if you know the transition each control owns, which is the point of this article.

Hands-on lab

Stand up an ASG behind an ALB, add a warm pool and a terminate drain hook, then run a no-op instance refresh and watch it surge through the lifecycle — all free-tier-friendly (t3.micro; delete at the end). Run with the AWS CLI configured to a sandbox account and a default VPC.

Step 1 — Variables and a security group.

export AWS_DEFAULT_REGION=ap-south-1
VPC=$(aws ec2 describe-vpcs --filters Name=isDefault,Values=true --query 'Vpcs[0].VpcId' --output text)
SUBNETS=$(aws ec2 describe-subnets --filters Name=vpc-id,Values=$VPC \
  --query 'Subnets[].SubnetId' --output text | tr '\t' ',')
SG=$(aws ec2 create-security-group --group-name asg-lab --description "asg lab" \
  --vpc-id $VPC --query GroupId --output text)
aws ec2 authorize-security-group-ingress --group-id $SG --protocol tcp --port 80 --cidr 0.0.0.0/0
AMI=$(aws ssm get-parameters --names \
  /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \
  --query 'Parameters[0].Value' --output text)

Expected: a VPC id, a comma-separated subnet list across AZs, a security-group id, and a current Amazon Linux 2023 AMI id.

Step 2 — A launch template with IMDSv2 and a tiny web server in user-data.

USERDATA=$(printf '#!/bin/bash\ndnf -y install httpd\nsystemctl enable --now httpd\necho ok > /var/www/html/health\n' | base64)
LT=$(aws ec2 create-launch-template --launch-template-name asg-lab \
  --launch-template-data "{
    \"ImageId\": \"$AMI\", \"InstanceType\": \"t3.micro\",
    \"SecurityGroupIds\": [\"$SG\"], \"UserData\": \"$USERDATA\",
    \"MetadataOptions\": {\"HttpTokens\": \"required\"}
  }" --query 'LaunchTemplate.LaunchTemplateId' --output text)

Step 3 — An ALB, target group, and listener.

ALB=$(aws elbv2 create-load-balancer --name asg-lab --type application \
  --subnets ${SUBNETS//,/ } --security-groups $SG \
  --query 'LoadBalancers[0].LoadBalancerArn' --output text)
TG=$(aws elbv2 create-target-group --name asg-lab --protocol HTTP --port 80 \
  --vpc-id $VPC --target-type instance --health-check-path /health \
  --query 'TargetGroups[0].TargetGroupArn' --output text)
aws elbv2 create-listener --load-balancer-arn $ALB --protocol HTTP --port 80 \
  --default-actions Type=forward,TargetGroupArn=$TG

Step 4 — Create the ASG with ELB health checks and a sane grace period.

aws autoscaling create-auto-scaling-group --auto-scaling-group-name asg-lab \
  --launch-template "LaunchTemplateId=$LT,Version=\$Latest" \
  --min-size 2 --max-size 6 --desired-capacity 2 \
  --vpc-zone-identifier "$SUBNETS" \
  --target-group-arns $TG \
  --health-check-type ELB --health-check-grace-period 120 \
  --default-instance-warmup 120

Expected: after ~2 minutes, two instances reach InService. Confirm:

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names asg-lab \
  --query 'AutoScalingGroups[0].Instances[].[InstanceId,LifecycleState,HealthStatus]' --output table

Step 5 — Add a Stopped warm pool and watch it fill.

aws autoscaling put-warm-pool --auto-scaling-group-name asg-lab \
  --pool-state Stopped --min-size 2 --max-group-prepared-capacity 4

aws autoscaling describe-warm-pool --auto-scaling-group-name asg-lab \
  --query '[WarmPoolConfiguration,Instances[].[InstanceId,LifecycleState]]' --output json
# Look for instances in Warmed:Pending -> Warmed:Stopped

Step 6 — Add a terminate drain hook so scale-in waits.

aws autoscaling put-lifecycle-hook --lifecycle-hook-name drain \
  --auto-scaling-group-name asg-lab \
  --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
  --heartbeat-timeout 120 --default-result CONTINUE

Step 7 — Run a no-op instance refresh with surge and watch the lifecycle.

aws autoscaling start-instance-refresh --auto-scaling-group-name asg-lab \
  --preferences '{"MinHealthyPercentage":90,"MaxHealthyPercentage":110,"InstanceWarmup":120}'

watch -n 10 'aws autoscaling describe-instance-refreshes \
  --auto-scaling-group-name asg-lab \
  --query "InstanceRefreshes[0].[Status,PercentageComplete]" --output text'
# Status walks Pending -> InProgress -> Successful; capacity never dips below 100%

Step 8 — Teardown (delete in order to avoid dependency errors).

aws autoscaling delete-warm-pool --auto-scaling-group-name asg-lab --force-delete
aws autoscaling delete-auto-scaling-group --auto-scaling-group-name asg-lab --force-delete
aws elbv2 delete-listener --listener-arn $(aws elbv2 describe-listeners --load-balancer-arn $ALB --query 'Listeners[0].ListenerArn' --output text)
aws elbv2 delete-load-balancer --load-balancer-arn $ALB
sleep 30
aws elbv2 delete-target-group --target-group-arn $TG
aws ec2 delete-launch-template --launch-template-id $LT
aws ec2 delete-security-group --group-id $SG

Common mistakes & troubleshooting

The ASG fails in transitions, and almost every failure has a precise fingerprint. This is the playbook: match the symptom, run the confirm command, apply the fix.

# Symptom Root cause Confirm (exact command) Fix
1 Instance stuck in Pending:Wait for the full heartbeat Launch hook handler never calls complete-lifecycle-action describe-auto-scaling-instances ... LifecycleState shows Pending:Wait Fix handler to call complete; set sane default_result ABANDON
2 Instance stuck in Terminating:Wait, capacity not freed Drain logic never reports back; heartbeat huge describe-scaling-activities shows long Terminating:Wait Make handler call complete-lifecycle-action; lower heartbeat
3 5xx on every scale-in / Spot reclaim No terminate hook, or heartbeat < deregistration delay TG deregistration_delay vs hook heartbeat_timeout Add hook; set heartbeat > deregistration_delay
4 Scale-out still slow despite a warm pool Warm pool min_size too small / pool empty describe-warm-pool shows 0 Warmed:Stopped Raise min_size to cover the surge-rate gap
5 Launch/terminate thrash loop right after boot health_check_grace_period < boot-to-healthy describe-scaling-activities shows repeated launch→terminate Raise grace to boot-to-healthy + margin
6 Booted-but-broken instance serving traffic health_check_type = EC2 (hypervisor OK, app dead) describe-auto-scaling-groups ... HealthCheckType = EC2 Set health_check_type = ELB
7 Capacity dips during an AMI rollout Instance refresh with MaxHealthyPercentage = 100 (no surge) describe-instance-refreshes shows dip; preferences max 100 Set MaxHealthyPercentage = 110+
8 Instance refresh stuck InProgress, never completes New AMI fails health within warmup; can’t hold min healthy describe-instance-refreshes ... StatusReason Fix AMI/health path; verify warmup matches boot time
9 Bad AMI rolled to whole fleet, no revert AutoRollback not set / no alarm wired Refresh Preferences.AutoRollback = false Enable AutoRollback + an alarm spec
10 Spot interruptions drop work in waves lowest-price strategy, no Capacity Rebalancing instances_distribution.spot_allocation_strategy Switch to price-capacity-optimized; enable --capacity-rebalance
11 Scale-out stalls; “no capacity” errors Too few instance types/AZs for Spot to draw from describe-scaling-activities shows insufficient-capacity Diversify to 4+ types, 3 AZs
12 New instances never register with the ALB Wrong/missing IAM profile or SG; SSM hook can’t run describe-target-health shows no targets Fix instance profile + SG; verify hook ran
13 Warm-pool instances behave as if in service Bootstrap not distinguishing Warmed:Pending from Pending Check user-data branch on LifecycleState Branch bootstrap on lifecycle state
14 Long-running consumer killed on scale-in No scale-in protection on the busy node describe-auto-scaling-instances ... ProtectedFromScaleIn set-instance-protection --protected-from-scale-in
15 Double-scaling: ASG over-provisions on a spike EstimatedInstanceWarmup/default_instance_warmup = 0 Policy/group warmup is 0 Set warmup to real time-to-ready

Error and limit reference

The control-plane errors and the hard limits you will actually hit:

Error / condition Where it surfaces Likely cause Fix
Failed to launch ... insufficient capacity Scaling activities No Spot/OD capacity in chosen pools Diversify types/AZs; widen Spot strategy
Launch template version ... does not exist ASG / refresh Pinned a deleted/non-existent LT version Use $Latest/$Default or a valid version
Health check grace period evictions Scaling activities Grace too short Raise grace period
Instance failed to pass health checks Refresh StatusReason Bad AMI / health path Fix AMI; verify probe path returns 200
Could not maintain minimum healthy percentage Refresh failed Warmup too short or capacity tight Raise warmup; relax min healthy; add capacity
Lifecycle action ... already completed Hook handler Double complete-lifecycle-action Make handler idempotent (use token)
AccessDenied on elbv2:DeregisterTargets Drain handler logs Instance profile lacks ELB perms Grant the drain role ELB deregister perms
Limit (default, soft unless noted) Value Note
ASGs per region 500 Adjustable via quota
Launch templates per region 5,000 Each with many versions
Launch template versions 10,000 per template Prune old versions
Instances per ASG (bounded by EC2 limits) Effectively your account’s instance quotas
Lifecycle hooks per ASG 50 Per group
Scaling policies per ASG 50 Step + target + predictive
Scheduled actions per ASG 125 Time-based capacity changes
Warm pool max prepared capacity ≤ max_size Cannot exceed group max
Spot interruption notice ~2 minutes Hard, not adjustable
Lifecycle heartbeat timeout 30 s – 7,200 s (172,800 s max with renewals) Per action

Best practices

Security notes

The ASG itself is mostly a control-plane resource, but the instances it launches inherit a security posture you set in the launch template — get it wrong and every node in the fleet is wrong.

Area Risk Control
IMDS SSRF stealing instance-role credentials via IMDSv1 http_tokens = required (IMDSv2 only); hop_limit = 1 (or 2 for containers, no more)
Instance profile Over-broad role on every instance Least-privilege role; the drain hook needs only elbv2:DeregisterTargets + autoscaling:CompleteLifecycleAction
EBS encryption Data at rest unencrypted encrypted = true in block device mappings; account-default EBS encryption on
Security groups Fleet exposed beyond the ALB SG allows ingress only from the ALB SG, not 0.0.0.0/0
User-data secrets Plaintext secrets in user-data (readable via IMDS) Pull secrets from Secrets Manager/Parameter Store at boot, never bake them in
AMI provenance Unpatched/untrusted AMI rolled fleet-wide Pin to vetted, scanned AMIs; refresh from a hardened pipeline
Hook handler Lambda/SSM with excess permissions Scope the handler role to the specific hook actions and target group
Cross-AZ traffic Drain handler reaching ELB API VPC endpoints for elasticloadbalancing/autoscaling keep API calls private

The instance profile that the drain hook needs is small — resist the urge to attach a broad role:

{
  "Version": "2012-10-17",
  "Statement": [
    {"Effect": "Allow", "Action": ["autoscaling:CompleteLifecycleAction", "autoscaling:RecordLifecycleActionHeartbeat"], "Resource": "*"},
    {"Effect": "Allow", "Action": ["elasticloadbalancing:DeregisterTargets", "elasticloadbalancing:DescribeTargetHealth"], "Resource": "*"}
  ]
}

For least-privilege IAM patterns beyond this, see IAM Least Privilege & Permission Boundaries.

Cost & sizing

The ASG is free; you pay for the instances, the EBS attached to warm-pool members, the ALB, and any detailed monitoring. The levers that actually move the bill:

Cost driver What drives it Lever Rough magnitude
On-Demand floor on_demand_base_capacity × instance price Keep the floor minimal Largest steady cost if over-set
Spot vs On-Demand mix on_demand_percentage_above_base_capacity Lower % = more savings Spot saves ~50–70% vs OD
Warm pool (Stopped) EBS volumes for min_size reserve Right-size min_size; Stopped not Running EBS-only; no compute
Warm pool (Running) Full instance cost for the reserve Use only when latency demands Same as in-service capacity
Detailed monitoring 1-min metrics per instance On for responsive scaling Small per-instance charge
Instance refresh surge Extra instances during rollout Surge briefly, then settle Transient (rollout duration only)
Cross-AZ data Traffic between AZs Keep chatty paths in-AZ where possible Per-GB
ALB LCU-hours + hourly Right-size; consolidate Modest baseline

Sizing guidance

Decision Heuristic Why
min_size (group) Survive one AZ loss at steady traffic Reliability floor
max_size (group) Peak demand + headroom, within quotas Don’t cap a real spike
on_demand_base_capacity Minimum capacity you’d run all-OD Floor against Spot reclaim waves
Warm pool min_size Surge over (cold − warm) launch window Pay for the deficit, not the peak
Instance size Same fungible size across types Even LB distribution
InstanceWarmup Real p95 boot-to-healthy Avoid double-scaling and false unhealthy

A worked figure: a 6-instance m6i.large steady fleet at 30% On-Demand above a base of 2, with a 4-instance Stopped warm pool, in ap-south-1, lands roughly at ₹55,000–75,000/month depending on Spot pricing — versus ~₹1.3 lakh/month for the same fleet all On-Demand with no warm pool but a fatter floor to fake the latency. The warm pool’s only marginal cost is the EBS for four stopped volumes (a few hundred rupees/month), which buys you minutes-to-seconds scale-out — almost always worth it. There is no free tier for sustained ASG capacity; the t3.micro lab above fits within the 750 hours/month free-tier EC2 allowance if you delete promptly.

Interview & exam questions

Q1. Why is a single instance type a reliability risk in an ASG, and how does a mixed instances policy fix it? A single type is a single capacity pool; when it’s exhausted in an AZ, scale-out stalls and the spike goes unanswered. A mixed instances policy draws from multiple types across AZs, so capacity-aware allocation (price-capacity-optimized) can route around a thin pool. (SAA-C03, SAP-C02.)

Q2. What is a warm pool, and when would you choose Hibernated over Stopped? A warm pool is a pre-initialized reserve of instances held past the expensive bootstrap, resumed in seconds on scale-out. Choose Hibernated when the app has a long in-memory warmup (large caches, JIT state) worth preserving across the stop; Stopped (the default) suffices when only the disk-level bootstrap is expensive. (SAP-C02.)

Q3. A lifecycle hook is meant to drain connections on scale-in. What single timing relationship must hold, and what breaks if it doesn’t? The hook’s heartbeat_timeout must exceed the target group’s deregistration_delay; otherwise the heartbeat expires and the instance is terminated mid-drain, cutting in-flight requests — defeating the hook’s purpose. (DVA-C02, SOA-C02.)

Q4. How does MaxHealthyPercentage > 100 change an instance refresh? It enables surge: the refresh launches replacement instances before terminating old ones, so total capacity never dips below 100% during the rollout — the closest thing to a true zero-downtime rolling deploy. (SAP-C02, DOP-C02.)

Q5. What is the difference between the Spot interruption notice and the rebalance recommendation? The interruption notice is a hard “~2 minutes until reclaim”; the rebalance recommendation is an earlier, best-effort “elevated risk of interruption” that often precedes it. Capacity Rebalancing acts on the recommendation to replace and drain before the 2-minute gun fires. (SAP-C02.)

Q6. Why set health_check_type = ELB instead of leaving the default? The default EC2 check only verifies the hypervisor/VM is alive; an app that crash-loops or never binds its port still passes. ELB ties health to the target group’s application probe, so booted-but-broken instances are evicted. (SAA-C03.)

Q7. An ASG launches new capacity on a spike, then immediately scales back in, repeatedly. Name two likely causes. Either EstimatedInstanceWarmup/default_instance_warmup is 0 (so fresh instances’ metrics trigger more scaling before they’re ready — double-scaling), or health_check_grace_period is shorter than boot-to-healthy (so the ASG kills instances mid-boot). (SOA-C02.)

Q8. How do you ship a new AMI to an ASG with a canary and automatic rollback? Run an instance refresh with CheckpointPercentages (e.g. 25%, 50%) and a CheckpointDelay bake period as the canary, plus AutoRollback: true and an AlarmSpecification referencing a 5xx/latency alarm — so a bad build pauses at the first checkpoint and reverts itself. (DOP-C02.)

Q9. Why does running a warm pool require pool-aware bootstrap automation? The launch transition fires for both entering the pool (Warmed:Pending) and entering service (Pending). Bootstrap that doesn’t branch on LifecycleState may register a stopped pool instance with the load balancer, or re-run the full expensive bootstrap on the fast resume path. (SAP-C02.)

Q10. Which termination policy do you favour during a rollout, and why? OldestLaunchTemplate — so that when the ASG scales in during a refresh, it sheds old-template instances first and the fleet converges on the new version rather than killing freshly updated nodes. (DOP-C02.)

Q11. Why is scaling out a band-aid for SNAT/connection or per-instance memory problems, and how does this relate to ASG sizing? Scaling out adds instances but doesn’t fix a per-instance constraint (each new node hits the same ceiling) — the same anti-pattern as masking an OOM by adding capacity. The fix is in the instance (code/RAM), and the ASG should be sized for demand, not to dilute a per-instance bug. (SAP-C02.)

Q12. How does on_demand_base_capacity interact with a Spot reclaim wave? It guarantees an absolute floor of On-Demand instances that Spot interruptions cannot touch, so a correlated reclaim across Spot pools degrades capacity down to — but never below — that floor. (SAP-C02.)

Quick check

  1. What lifecycle state does an instance enter when a terminating lifecycle hook is attached, and what must you call to release it?
  2. Your warm pool is configured but scale-out is still slow during a real spike. What is the most likely single cause?
  3. Which MaxHealthyPercentage value enables surge during an instance refresh, and what does surge prevent?
  4. Name the two Spot signals and which one Capacity Rebalancing acts on.
  5. Why must health_check_grace_period be at least your boot-to-healthy time?

Answers

  1. Terminating:Wait — release it by calling complete-lifecycle-action (with CONTINUE to proceed, or ABANDON), or let the heartbeat time out.
  2. The warm pool min_size is too small (or the pool is empty) — it isn’t sized to cover the gap between your surge rate and your cold launch time, so scale-out falls back to cold launches.
  3. Any value above 100 (e.g. 110) enables surge; surge launches replacements before terminating old instances, so capacity never dips below 100% during the rollout.
  4. The rebalance recommendation (earlier, “elevated risk”) and the interruption notice (~2 minutes, definite). Capacity Rebalancing acts on the rebalance recommendation.
  5. Because health checks count against an instance after the grace period; if it’s shorter than boot-to-healthy, the ASG marks still-booting instances unhealthy and kills them, producing a launch/terminate thrash loop.

Glossary

Next steps

awsec2auto-scalingasgdeployments
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments