Regional Managed Instance Groups: Autohealing, Canary Rollouts, and Stateful MIGs

A Managed Instance Group (MIG) is Compute Engine’s unit of fleet management: it owns a set of identical VMs derived from an instance template, keeps them at a target size, repairs them when they fail, and rolls new versions out gradually. Get this layer right and a VM-based service behaves like a managed platform: it self-heals, scales on demand, and ships canaries without a human babysitting gcloud. Get it wrong and you have a fragile fleet that takes the whole zone down with one bad image.

This walkthrough builds a production-grade regional MIG end to end: zone distribution, templates and update strategy, autohealing, rolling updates, canaries, autoscaling, and stateful configuration. Commands are written against gcloud compute instance-groups managed and a parallel Terraform google_compute_region_instance_group_manager.

Step 1: Regional vs zonal, and why regional wins for production

A zonal MIG places all instances in one zone. If that zone has an outage, your service is gone. A regional MIG spreads instances across multiple zones in a region (up to three by default) and keeps them balanced, so a single-zone failure takes out only a fraction of capacity.

Property	Zonal MIG	Regional MIG
Failure domain	One zone	Multiple zones in a region
Max size	1000-ish	Larger; spread across zones
Recommended for prod	No	Yes
Update unit	Per zone	Across zones, zone-aware

Create a regional MIG and pin the zones explicitly so you control placement instead of letting Compute Engine pick:

gcloud compute instance-groups managed create web-mig \
  --project=PROJECT_ID \
  --region=us-central1 \
  --template=web-tmpl-v1 \
  --size=6 \
  --zones=us-central1-a,us-central1-b,us-central1-f

The default distribution policy is EVEN: with size 6 across three zones you get 2 instances per zone. The target shape controls how the group reconciles when a zone is short on capacity. EVEN insists on balance; ANY and BALANCED let the group prefer availability over perfect symmetry, which matters when one zone can’t fulfill a resource request (common with GPUs or large machine types).

# Prefer availability when a zone can't satisfy the request
gcloud compute instance-groups managed update web-mig \
  --region=us-central1 \
  --target-distribution-shape=BALANCED

Rule of thumb: production fleets are regional with three zones and BALANCED shape unless you have a hard reason (data locality, licensing) to pin a single zone.

Step 2: Instance templates and the update model

A MIG never references a raw VM config; it references an instance template, which is immutable. You don’t edit a template, you create a new one and tell the MIG to migrate to it. That immutability is the whole basis of safe rollouts: every version is a named, frozen artifact.

gcloud compute instance-templates create web-tmpl-v2 \
  --project=PROJECT_ID \
  --machine-type=e2-standard-4 \
  --image-family=debian-12 \
  --image-project=debian-cloud \
  --boot-disk-size=50GB \
  --boot-disk-type=pd-balanced \
  --tags=http-server \
  --metadata=startup-script-url=gs://PROJECT_ID-cfg/startup.sh \
  --region=us-central1

The MIG’s update policy has two type modes that decide when instances move to a new template:

PROACTIVE - the MIG actively replaces instances to converge on the target version. This is what you want for a real rollout.
OPPORTUNISTIC - the MIG does nothing on its own; instances only adopt the new template when they happen to be recreated (autohealing, autoscaling scale-up, manual recreate). Use this when you want to stage a version and let it bleed in, or when you’ll drive the rollout yourself with update-instances.

You set this on the group, then trigger rollouts by changing the version. The start-update (a.k.a. rolling-action) form makes the intent explicit:

gcloud compute instance-groups managed rolling-action start-update web-mig \
  --region=us-central1 \
  --version=template=web-tmpl-v2 \
  --type=proactive \
  --max-surge=2 \
  --max-unavailable=0

Step 3: Autohealing with a health check and an initial delay

By default a MIG only recreates instances that are deleted or whose VM crashes at the hypervisor level. That does not catch an app that’s hung, deadlocked, or returning 500s. Autohealing fixes that: you attach an HTTP/TCP health check and the MIG recreates any instance the check reports unhealthy.

Create a health check that probes the application, not just the port:

gcloud compute health-checks create http web-autoheal-hc \
  --project=PROJECT_ID \
  --port=8080 \
  --request-path=/healthz \
  --check-interval=10s \
  --timeout=5s \
  --healthy-threshold=2 \
  --unhealthy-threshold=3

Then bind it to the MIG with an initial delay. This is the single most-misconfigured field on a MIG. The initial delay is how long after an instance boots the MIG waits before autohealing starts judging it. Set it shorter than your real warm-up (image pull, JIT warm, cache fill, DB connection pool) and the MIG will kill healthy-but-still-booting instances in a loop, never reaching steady state.

gcloud compute instance-groups managed update web-mig \
  --region=us-central1 \
  --health-check=web-autoheal-hc \
  --initial-delay=300

Important: use a separate, more lenient health check for autohealing than the one your load balancer uses for routing. The LB check decides “stop sending traffic” (cheap, reversible). The autoheal check decides “destroy this VM” (expensive, irreversible). A flapping dependency should drain a node, not nuke it.

Step 4: Rolling updates - maxSurge, maxUnavailable, and minimal disruption

A rolling update replaces instances in waves governed by two knobs:

maxSurge - how many extra instances above target the MIG may create temporarily. Surge first, then delete old, so capacity never dips.
maxUnavailable - how many instances may be down (being replaced) at once.

The combination defines your disruption budget. For zero-capacity-loss rollouts, surge and keep unavailable at zero:

gcloud compute instance-groups managed rolling-action start-update web-mig \
  --region=us-central1 \
  --version=template=web-tmpl-v2 \
  --max-surge=3 \
  --max-unavailable=0 \
  --min-ready=120 \
  --replacement-method=substitute

Key fields:

--min-ready holds a freshly created, health-check-passing instance in service for that long before the MIG counts it “ready” and proceeds. This catches versions that pass health checks immediately but fall over under real traffic a minute later.
--replacement-method=substitute creates a new instance to replace an old one (new name, new IP unless stateful). The alternative, recreate, reuses the same instance name in place and is required for stateful MIGs.

In a regional MIG the update is zone-aware: the MIG won’t take down more than the allowed fraction in any single zone at once, so a rollout never collapses an availability zone. You don’t configure this; it’s inherent to regional groups respecting the distribution policy.

To stop a bad rollout immediately:

gcloud compute instance-groups managed rolling-action stop-proactive-update web-mig \
  --region=us-central1

Step 5: Canary releases with two template versions

A canary runs the new template on a slice of the fleet while the bulk stays on the known-good version. The MIG models this natively with two versions on the same group, where the canary version carries a target-size:

gcloud compute instance-groups managed rolling-action start-update web-mig \
  --region=us-central1 \
  --version=template=web-tmpl-v1 \
  --canary-version=template=web-tmpl-v2,target-size=20% \
  --type=proactive \
  --max-surge=1 \
  --max-unavailable=0 \
  --min-ready=180

Now 20% of instances run v2 and 80% run v1. You watch metrics, error rates, and latency on the canary slice. If it’s healthy, promote by making v2 the sole version (target-size 100%, no canary); if not, drop the canary and the fleet is already fully on v1.

# Promote: v2 becomes the whole fleet
gcloud compute instance-groups managed rolling-action start-update web-mig \
  --region=us-central1 \
  --version=template=web-tmpl-v2 \
  --type=proactive --max-surge=3 --max-unavailable=0

# Or roll back: drop the canary, fleet stays on v1
gcloud compute instance-groups managed rolling-action start-update web-mig \
  --region=us-central1 \
  --version=template=web-tmpl-v1 \
  --type=proactive --max-surge=3 --max-unavailable=0

target-size accepts a percentage or a fixed count. Percentages are evaluated against the current group size, so a canary scales with the fleet - useful when an autoscaler is also resizing during the bake.

Step 6: Autoscaling on CPU, LB utilization, and custom metrics

Attach an autoscaler so the group resizes on demand. The cleanest mental model: pick one or more signals, give each a target, and the autoscaler computes the size that holds every signal at its target, then takes the max.

CPU utilization is the default starting point:

gcloud compute instance-groups managed set-autoscaling web-mig \
  --region=us-central1 \
  --min-num-replicas=6 \
  --max-num-replicas=30 \
  --target-cpu-utilization=0.6 \
  --cool-down-period=90

For a group behind an HTTP(S) load balancer, load-balancing utilization is usually a better signal than CPU because it tracks the serving capacity you defined on the backend service (e.g., max RPS per instance):

gcloud compute instance-groups managed set-autoscaling web-mig \
  --region=us-central1 \
  --min-num-replicas=6 --max-num-replicas=30 \
  --target-load-balancing-utilization=0.8 \
  --cool-down-period=90

For queue-driven or app-specific load, scale on a custom Cloud Monitoring metric - this is how you size workers off backlog depth rather than CPU:

gcloud compute instance-groups managed set-autoscaling worker-mig \
  --region=us-central1 \
  --min-num-replicas=2 --max-num-replicas=50 \
  --custom-metric-utilization='metric=custom.googleapis.com/app/queue_depth,utilization-target=100,utilization-target-type=GAUGE' \
  --cool-down-period=120

utilization-target-type matters: GAUGE targets the instantaneous per-instance value (100 messages each), DELTA_PER_SECOND/DELTA_PER_MINUTE target a rate. Pick the one that matches how your metric is emitted, or the autoscaler will chase the wrong number.

Production note: keep min-num-replicas high enough to survive the loss of one zone in a regional group. If you need 6 instances to serve peak and you run three zones, a floor of 6 means losing a zone drops you to ~4 until repair - size the floor for the post-failure target, not the happy path.

Step 7: Stateful MIGs - preserved disks, stateful IPs, per-instance configs

Default MIGs are stateless: replace an instance and it gets a fresh disk and a new internal IP. That’s wrong for stateful workloads (databases, brokers, anything with identity or local data). A stateful MIG preserves named resources across recreate and update.

Two layers of statefulness:

Stateful policy on the group - a blanket rule that all instances keep their data disk(s) and (optionally) internal IP across updates.
Per-instance configs - individual overrides naming the exact disk and metadata for one instance, so VM web-mig-abc always reattaches its disk.

Set a group-wide stateful policy preserving a data disk and the boot disk:

gcloud compute instance-groups managed update web-mig \
  --region=us-central1 \
  --stateful-disk=device-name=data,auto-delete=never \
  --stateful-internal-ip=interface-name=nic0,auto-delete=on-permanent-instance-deletion

Pin a specific disk and a stable IP to one named instance with a per-instance config:

gcloud compute instance-groups managed instance-configs create web-mig \
  --region=us-central1 \
  --instance=web-mig-7x2q \
  --stateful-disk=device-name=data,source=projects/PROJECT_ID/zones/us-central1-a/disks/web-data-7x2q,mode=rw,auto-delete=never \
  --stateful-metadata=role=primary

Critically, stateful MIGs must use recreate as the replacement method, not substitute - the instance name is preserved so its identity and preserved resources survive. Updates on stateful groups are also non-disruptive only to the extent the workload tolerates an in-place recreate; plan rollouts with maxUnavailable sized to your quorum (e.g., for a 3-node quorum, never take down more than 1).

resource "google_compute_region_instance_group_manager" "stateful" {
  name               = "web-mig"
  region             = "us-central1"
  base_instance_name = "web-mig"
  target_size        = 3

  version {
    instance_template = google_compute_instance_template.web_v2.id
  }

  stateful_disk {
    device_name = "data"
    delete_rule = "NEVER"
  }

  stateful_internal_ip {
    interface_name = "nic0"
    delete_rule    = "ON_PERMANENT_INSTANCE_DELETION"
  }

  update_policy {
    type                  = "PROACTIVE"
    minimal_action        = "RESTART"
    replacement_method    = "RECREATE"   # required for stateful
    max_surge_fixed       = 0            # cannot surge a stateful group
    max_unavailable_fixed = 1
  }

  auto_healing_policies {
    health_check      = google_compute_health_check.autoheal.id
    initial_delay_sec = 300
  }
}

Note max_surge_fixed = 0: a stateful group can’t surge because the preserved identity/disk can’t be duplicated, so you trade availability headroom for max_unavailable. Size it to protect quorum.

Step 8: Draining, surge protection, and validating safely

Before any update touches production traffic, make sure removed instances drain instead of dropping connections. Connection draining lives on the backend service, not the MIG - set it so an instance being replaced finishes in-flight requests:

gcloud compute backend-services update web-backend \
  --global \
  --connection-draining-timeout=120

The safe-rollout pattern that ties the whole article together:

Build web-tmpl-vN+1, validate it boots and passes /healthz in a scratch MIG or a single test instance.
Start a canary at 10-20% with max-unavailable=0, a real min-ready, and surge enabled (stateless) so capacity never dips.
Bake against SLO dashboards for the canary slice; watch error rate and p99, not just “instances healthy.”
Promote to 100% proactively, or stop-proactive-update + roll back to the prior template on regression.
For stateful groups, drive max-unavailable=1, recreate, and verify quorum after each wave.

Verify

Confirm the group, its versions, health, and per-instance state:

# Group summary: target size, versions, instance template(s) in use
gcloud compute instance-groups managed describe web-mig \
  --region=us-central1

# Per-instance status: which template each VM runs and its current/standby action
gcloud compute instance-groups managed list-instances web-mig \
  --region=us-central1 \
  --format='table(instance, status, currentAction, version.name, instanceHealth[0].detailedHealthState)'

Healthy steady state shows every instance RUNNING, currentAction=NONE, and detailedHealthState=HEALTHY. During a rollout you’ll see CREATING, DELETING, RECREATING, or VERIFYING actions - if instances are stuck in VERIFYING or churning in RECREATING, your initial delay or health check is wrong (revisit Step 3).

Check that the autoscaler is making decisions you expect:

gcloud compute instance-groups managed describe web-mig \
  --region=us-central1 \
  --format='yaml(autoscaler.status, autoscaler.statusDetails)'

For stateful groups, confirm the per-instance config actually pinned the disk:

gcloud compute instance-groups managed instance-configs describe web-mig \
  --region=us-central1 \
  --instance=web-mig-7x2q

Enterprise scenario

A payments platform team ran a fraud-scoring service on a regional MIG of 24 GPU-backed instances (a2 family) across three zones in us-central1. They hit two compounding problems during a model-image rollout.

First, the rollout deadlocked on capacity. They used max-surge=4 to keep capacity flat, but a2 GPUs were constrained in us-central1-a that afternoon. Surging needs spare capacity to create new instances before deleting old ones; with no GPU headroom, the new instances sat in CREATING and the rollout stalled at 30%, holding double cost on the instances that did surge.

Second, the new image pulled a 9 GB model from GCS on boot and took ~6 minutes to warm. Their autoheal initial-delay was 180 seconds - so the MIG started recreating instances that were still loading the model, and the regional group churned through GPU quota trying (and failing) to land healthy nodes.

The fix was a deliberate switch in strategy for a capacity-constrained, slow-warming fleet:

# 1. Raise the autoheal initial delay above real warm-up time
gcloud compute instance-groups managed update fraud-mig \
  --region=us-central1 \
  --health-check=fraud-warmup-hc \
  --initial-delay=480

# 2. Roll WITHOUT surge: delete-then-create using the unavailable budget,
#    one instance at a time, so it never needs spare GPU capacity to proceed
gcloud compute instance-groups managed rolling-action start-update fraud-mig \
  --region=us-central1 \
  --version=template=fraud-tmpl-v7 \
  --type=proactive \
  --max-surge=0 \
  --max-unavailable=1 \
  --min-ready=300

By moving from a surge-based to an max-unavailable=1, surge-zero rollout, the update never required free GPU capacity to make forward progress - it reused the slot freed by each deleted instance. Pairing that with an initial-delay of 480s (comfortably above the 6-minute warm) stopped the autoheal churn. The rollout completed in ~3 hours at a controlled 1-in-24 disruption, and they accepted the brief single-instance capacity dip because the service ran with enough margin at min replicas to absorb it. The lesson: surge trades capacity headroom for speed; when capacity is the scarce resource, max-unavailable is the safer lever - and your autoheal initial delay must always exceed real warm-up, not just boot.

Regional Managed Instance Groups: Autohealing, Canary Rollouts, and Stateful MIGs

Step 1: Regional vs zonal, and why regional wins for production

Step 2: Instance templates and the update model

Step 3: Autohealing with a health check and an initial delay

Step 4: Rolling updates - maxSurge, maxUnavailable, and minimal disruption

Step 5: Canary releases with two template versions

Step 6: Autoscaling on CPU, LB utilization, and custom metrics

Step 7: Stateful MIGs - preserved disks, stateful IPs, per-instance configs

Step 8: Draining, surge protection, and validating safely

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

BigQuery Fine-Grained Security: Column-Level, Row-Level, and Data Masking

Cloud DNS at Scale: Private Zones, Peering, Forwarding, and Response Policies

Event-Driven Architecture with Cloud Functions 2nd Gen and Eventarc