Most teams write off CloudFormation after their first 500-line YAML file and reach for CDK or Terraform. That’s a mistake deep inside AWS Organizations, where CloudFormation is the substrate StackSets, Control Tower, and Service Catalog are built on and the only engine that deploys across every account with native org trust. This is the part that earns its keep: org-wide rollout, extending the resource model, enforcing policy before a resource exists, and keeping reality honest with drift detection.
1. When raw CloudFormation beats CDK and Terraform on AWS
Pick the tool for the constraint, not the fashion. CloudFormation wins in a narrow but important band:
- No external state, no state lock to corrupt. The stack is the state, stored and managed by the service. There is no S3 backend to bootstrap, no DynamoDB lock table, no
terraform force-unlockat 3 a.m. - Native org trust. StackSets with
SERVICE_MANAGEDpermissions deploy into accounts created by AWS Organizations without you provisioning a cross-account IAM role per account. Terraform needs an assumable role wired into every account first. - Drift detection and Hooks are first-party. The control plane knows the desired state, so it can compare live resources against it and can intercept create/update operations with policy checks before mutation.
- It’s the lingua franca. Control Tower, Service Catalog, SAM, and CDK all compile down to CloudFormation. CDK is a synthesizer; the thing that actually runs is a template.
CDK is the better authoring experience for complex app stacks; Terraform wins the moment you span multiple clouds. But for an AWS landing zone, guardrails, and account baselines, raw CloudFormation plus StackSets is frequently the right answer precisely because it has the fewest moving parts.
Rule of thumb: “deploy this baseline into every account and keep it correct” is a StackSets job. “Build a bespoke application” is a CDK job (which still emits CloudFormation).
2. Multi-account, multi-region rollout with StackSets
A StackSet is a template plus a deployment definition that fans out stack instances across target accounts and regions. The pivotal choice is the permission model.
Self-managed requires you to pre-create two IAM roles: an administration role in the admin account and an execution role in every target account trusting it. This is the legacy path and the source of most StackSet pain.
Service-managed integrates with AWS Organizations: AWS manages the trust, you target organizational units (OUs) instead of enumerating account IDs, and new accounts landing in a targeted OU enroll automatically. This is what you want for a landing zone.
Before service-managed StackSets work, enable trusted access between CloudFormation and Organizations once, from the management account:
aws cloudformation activate-organizations-access
Create a service-managed StackSet, then roll it out to OUs. Two operational dials matter more than anything else here: --auto-deployment (enroll/remove accounts as OU membership changes) and the --operation-preferences that control blast radius.
# Create the StackSet (run from the Organizations management or a delegated admin account)
aws cloudformation create-stack-set \
--stack-set-name org-baseline-guardrails \
--template-body file://baseline.yaml \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--capabilities CAPABILITY_NAMED_IAM \
--description "Account baseline: config recorder, log bucket, IAM password policy"
# Roll out to OUs, region by region, with a conservative failure tolerance
aws cloudformation create-stack-instances \
--stack-set-name org-baseline-guardrails \
--deployment-targets OrganizationalUnitIds=ou-ab12-1a2b3c4d,ou-ab12-5e6f7g8h \
--regions us-east-1 eu-west-1 \
--operation-preferences \
RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5
FailureTolerancePercentage=5 means the operation stops rolling forward once more than 5% of instances fail, so a bad template in account 3 doesn’t get force-fed to 200 more. RegionConcurrencyType=SEQUENTIAL deploys one region fully before the next, which is what you want for a regional canary. Switch to PARALLEL only once you trust the change.
Updating the template is its own operation. Always update the StackSet and let it propagate, rather than touching instances directly:
aws cloudformation update-stack-set \
--stack-set-name org-baseline-guardrails \
--template-body file://baseline.yaml \
--capabilities CAPABILITY_NAMED_IAM \
--operation-preferences MaxConcurrentPercentage=25,FailureTolerancePercentage=0
| Concern | Self-managed | Service-managed |
|---|---|---|
| Trust setup | You create admin + execution roles per account | AWS Organizations manages trust |
| Targeting | Explicit account IDs | OUs (and the whole org root) |
| New-account enrollment | Manual | Automatic deployment |
| Best for | Pre-existing accounts outside Orgs | Landing zones, org guardrails |
3. Extending the resource model with Lambda-backed custom resources
When no AWS resource type exists for what you need (seed a DynamoDB table, look up an AMI by SSM parameter, call a third-party API during deploy), a custom resource bridges the gap. CloudFormation sends a request to a Lambda function (or SNS topic), and the stack blocks until your code calls back to a pre-signed S3 URL with SUCCESS or FAILED.
The contract is unforgiving, and getting it wrong is the classic CloudFormation foot-gun. Three rules keep you out of trouble:
- You must always respond. If your function errors, times out, or forgets to POST to the response URL, the stack hangs for up to an hour, then fails. Wrap everything in try/except and respond in the failure path too.
- Handle all three request types.
Create,Update, andDeleteall invoke the same function. ADeletethat throws will wedge a stack you can never cleanly remove. - PhysicalResourceId semantics drive replacement. Return a stable ID for in-place updates. Return a different ID and CloudFormation treats it as a replacement, then sends a
Deletefor the old physical ID after the new one succeeds. Mishandling this is how custom resources delete the thing they just created.
A minimal, correct handler in Python. The cfnresponse module ships in the Lambda Python runtime when you author inline, but for real code vendor your own small responder so behavior is explicit:
import json, urllib.request
def send(event, context, status, data=None, physical_id=None):
body = json.dumps({
"Status": status,
"Reason": f"See CloudWatch log stream: {context.log_stream_name}",
"PhysicalResourceId": physical_id or event.get("PhysicalResourceId") or context.log_stream_name,
"StackId": event["StackId"],
"RequestId": event["RequestId"],
"LogicalResourceId": event["LogicalResourceId"],
"Data": data or {},
}).encode()
req = urllib.request.Request(event["ResponseURL"], data=body, method="PUT")
req.add_header("content-type", "")
req.add_header("content-length", str(len(body)))
urllib.request.urlopen(req)
def handler(event, context):
try:
rt = event["RequestType"]
if rt == "Delete":
# idempotent teardown; never raise on a resource that's already gone
return send(event, context, "SUCCESS")
# Create/Update work goes here
result = {"Value": "computed-output"}
send(event, context, "SUCCESS", data=result, physical_id="my-stable-id")
except Exception as e:
print(f"failed: {e}")
send(event, context, "FAILED")
Reference it from the template. Outputs the function returns under Data are readable with !GetAtt:
Resources:
Seed:
Type: Custom::Seed
Properties:
ServiceToken: !GetAtt SeedFunction.Arn
# Any change to these properties triggers an Update invocation
TableName: !Ref AppTable
Version: "3"
Outputs:
SeedValue:
Value: !GetAtt Seed.Value
Failure-mode hard truth: a Lambda timeout does not notify CloudFormation. Set the function timeout well below the resource’s 1-hour ceiling, and consider a Step Functions or SNS pattern for long-running work. For genuinely reusable extensions, prefer a registered resource type (Section 8) over a one-off custom resource. AWS also publishes the open-source
AWSUtility::CloudFormation::CommandRunnerand the broader CloudFormation Provider Development Kit (CFN-CLI) for this.
4. Proactive policy enforcement with Hooks
Drift detection tells you something went wrong after it happened. Hooks stop it before. A CloudFormation Hook runs custom validation at PRE_PROVISION (and optionally PRE_UPDATE / PRE_DELETE) for targeted resource types, and can fail the operation so the resource is never created. This is policy-as-code that runs inside the deploy, not a nightly scan.
Two flavors exist. Guard Hooks let you write rules in CloudFormation Guard DSL with no Lambda to operate. Lambda Hooks call your own function for arbitrary logic. For most guardrails, Guard is the lower-maintenance choice. A rule that forbids public S3 buckets:
# s3-no-public.guard
rule s3_buckets_block_public_access {
AWS::S3::Bucket {
Properties {
PublicAccessBlockConfiguration exists
PublicAccessBlockConfiguration.BlockPublicAcls == true
PublicAccessBlockConfiguration.RestrictPublicBuckets == true
}
}
}
Hooks are themselves CloudFormation extensions, configured via the type configuration. The critical setting is failure mode: FAIL blocks non-compliant deploys; WARN only emits to the stack events. Start in WARN to measure impact, then promote to FAIL.
# Set a hook to actively block non-compliant operations
aws cloudformation set-type-configuration \
--type HOOK \
--type-name MyOrg::Guard::S3Public \
--configuration '{
"CloudFormationConfiguration": {
"HookConfiguration": {
"TargetStacks": "ALL",
"FailureMode": "FAIL",
"Properties": {}
}
}
}'
Deploy a Hook org-wide by activating it as a third-party/private type in each account (a StackSet that registers the hook is the clean pattern). Because Hooks evaluate at the control-plane level, they catch changes made through the console, CLI, or CDK alike, not just your pipeline. That coverage is the entire point: an SCP can deny an API call broadly, but a Hook can apply nuanced, template-aware logic (“RDS must have StorageEncrypted: true and deletion protection”) that an SCP can’t express.
5. Change sets and nested stacks for safe, reviewable updates
Never run update-stack blind against production. A change set is a dry run: CloudFormation computes the diff and, crucially, tells you which resources will be replaced (destroyed and recreated) versus modified in place. Replacement is where outages hide.
aws cloudformation create-change-set \
--stack-name prod-network \
--change-set-name net-2026-06-04 \
--template-body file://network.yaml \
--capabilities CAPABILITY_IAM
# Inspect before executing; look hard at "Replacement": "True"
aws cloudformation describe-change-set \
--stack-name prod-network --change-set-name net-2026-06-04 \
--query 'Changes[].ResourceChange.{LogicalId:LogicalResourceId,Action:Action,Replace:Replacement}' \
--output table
aws cloudformation execute-change-set \
--stack-name prod-network --change-set-name net-2026-06-04
For new stacks, --change-set-type CREATE gives you the same preview before the first deploy. The change set is the artifact your reviewer approves in a PR or a pipeline manual-approval gate.
Nested stacks decompose a large template into reusable child stacks referenced by an AWS::CloudFormation::Stack resource pointing at a child template in S3. The parent’s change set surfaces nested changes when you pass --include-nested-stacks, so you keep one reviewable diff across the whole tree. Nested stacks share a lifecycle with the parent (delete the parent, the children go too), which is the right coupling for “these components are one unit.” When components have independent lifecycles, use cross-stack references via Export/Fn::ImportValue instead, and accept that an export can’t be changed while another stack imports it.
6. Detecting and reconciling drift across stacks and an Organization
Drift is the gap between the template and what’s actually deployed after someone clicks in the console. CloudFormation detects it asynchronously: you start a detection operation, poll for completion, then read per-resource results.
# Per-stack: start, wait, then inspect
DID=$(aws cloudformation detect-stack-drift --stack-name prod-network \
--query StackDriftDetectionId --output text)
aws cloudformation describe-stack-drift-detection-status \
--stack-drift-detection-id "$DID" \
--query '{Status:DetectionStatus,Drift:StackDriftStatus}'
# Show exactly which resources drifted and how
aws cloudformation describe-stack-resource-drifts \
--stack-name prod-network \
--stack-resource-drift-status-filters MODIFIED DELETED \
--query 'StackResourceDrifts[].{Id:LogicalResourceId,Status:StackResourceDriftStatus}' \
--output table
At org scale, you don’t want to script a loop over every account. StackSets has native drift detection that fans out across all stack instances:
OID=$(aws cloudformation detect-stack-set-drift \
--stack-set-name org-baseline-guardrails \
--query OperationId --output text)
aws cloudformation describe-stack-set-operation \
--stack-set-name org-baseline-guardrails --operation-id "$OID" \
--query 'StackSetOperation.StatusReason'
Reconciliation is deliberately manual, and that’s correct. CloudFormation does not auto-revert drift, because the right response depends on intent:
- Unintended drift: re-run
update-stack(or update the StackSet) with the unchanged template. CloudFormation pushes the desired state back, overwriting the manual change. - Intended change made out-of-band: update the template to match reality, then deploy, so code and cloud reconverge with the template as the source of truth. Then find out who clicked in the console and close that door with the Hook and SCP from Section 4.
Operationalize this: an EventBridge scheduled rule that triggers stack/StackSet drift detection on a cadence, with results sent to Security Hub or an SNS alert. Drift you don’t measure is drift you discover during an incident.
7. Deletion safety: stack policies, retention, and termination protection
Three independent controls protect against the worst CloudFormation mistakes. Use all three; they guard different layers.
Termination protection is a stack-level flag that blocks delete-stack entirely. Turn it on for anything stateful or production:
aws cloudformation update-termination-protection \
--stack-name prod-network --enable-termination-protection
DeletionPolicy and UpdateReplacePolicy are per-resource attributes. DeletionPolicy: Retain keeps a resource when its stack is deleted; Snapshot takes a final snapshot for resources that support it (RDS, EBS, ElastiCache). Set UpdateReplacePolicy too, because a replacement during an update deletes the old resource just as surely as a stack delete:
Resources:
DataBucket:
Type: AWS::S3::Bucket
DeletionPolicy: Retain
UpdateReplacePolicy: Retain
AppDatabase:
Type: AWS::RDS::DBInstance
DeletionPolicy: Snapshot
UpdateReplacePolicy: Snapshot
Properties:
DeletionProtection: true
Stack policies are JSON documents (distinct from IAM) that restrict which resources an update may modify or replace. The canonical pattern is “allow everything, deny replacement of the database”:
{
"Statement": [
{ "Effect": "Allow", "Action": "Update:*", "Principal": "*", "Resource": "*" },
{ "Effect": "Deny", "Action": ["Update:Replace", "Update:Delete"],
"Principal": "*", "Resource": "LogicalResourceId/AppDatabase" }
]
}
aws cloudformation set-stack-policy \
--stack-name prod-network --stack-policy-body file://stack-policy.json
These layers compose: termination protection stops accidental stack deletes, the stack policy stops a careless update from replacing your database, and DeletionPolicy/UpdateReplacePolicy are the last net if the resource leaves the stack anyway.
8. Modularizing with the CloudFormation Registry and modules
Copy-pasted YAML rots. Two registry features give you real reuse with versioning.
Modules (MODULE type) package a fragment of template (resources plus their wiring) into a versioned, reusable building block that expands inline at deploy time. Unlike nested stacks, a module isn’t a separate stack at runtime, so there’s no extra stack to manage and no cross-stack export limits, while still centralizing a pattern like “our standard encrypted bucket.” Author the fragment, then register it:
aws cloudformation register-type \
--type MODULE \
--type-name MyOrg::S3::SecureBucket::MODULE \
--schema-handler-package s3://my-cfn-artifacts/secure-bucket-module.zip
Resource types (RESOURCE type), built with the CloudFormation Provider Development Kit (CFN-CLI), are full custom providers with create/read/update/delete/list handlers and drift support. This is the production-grade alternative to a Lambda-backed custom resource: it participates in drift detection, gets a proper schema, and is versioned in the registry. Reach for it when an extension is reused across many stacks and teams.
Both are governed by the same registry primitives: register-type to publish a version, set-type-default-version to promote, and (for service-managed StackSets) a registration StackSet so every account has the type available.
Verify
After wiring the above, confirm each piece independently rather than trusting a green stack.
# StackSet rolled out cleanly to every targeted instance
aws cloudformation list-stack-instances \
--stack-set-name org-baseline-guardrails \
--query 'Summaries[?StackInstanceStatus.DetailedStatus!=`SUCCEEDED`]'
# No drift across the StackSet (empty/IN_SYNC is the pass condition)
aws cloudformation list-stack-instances \
--stack-set-name org-baseline-guardrails \
--query 'Summaries[].{Account:Account,Drift:DriftStatus}' --output table
# Hook is registered and set to FAIL where you intend
aws cloudformation describe-type \
--type HOOK --type-name MyOrg::Guard::S3Public \
--query '{Status:DeprecatedStatus,Default:DefaultVersionId}'
# Termination protection is on for stateful stacks
aws cloudformation describe-stacks \
--stack-name prod-network \
--query 'Stacks[0].EnableTerminationProtection'
A real end-to-end test for the Hook: submit a change set that creates a public S3 bucket and confirm it is rejected with the Hook’s failure reason in the stack events. A guardrail you haven’t watched block something is a guardrail you don’t actually have.
Checklist
Pitfalls
The failures that recur on real estates, in priority order:
- Custom resources that never call back. A timeout or unhandled exception wedges the stack for an hour. This is the single most common CloudFormation outage. Respond in every code path.
- Replacement you didn’t see coming. Editing an immutable property (a DB engine, a subnet’s AZ) silently triggers destroy-and-recreate. Read change sets for
"Replacement": "True"before every prod execute, and back it with a stack policy. DeletionPolicywithoutUpdateReplacePolicy. Teams protect against stack deletion but forget that a replacement during a routine update deletes the resource too. Set both.- StackSets fanned out at full concurrency.
MaxConcurrentPercentage=100withFailureTolerancePercentage=100turns one bad template into an org-wide incident. Canary by region first. - Hooks left in
WARNforever. A warning that blocks nothing is theater. Measure inWARN, then commit toFAIL.
Next steps: codify all of this as a pipeline (CodePipeline or GitHub Actions) where the only way to change production is a reviewed change set that has passed Hook validation, then let StackSets and scheduled drift detection keep the org converged without anyone touching a console.