Every monolith modernization that ends in disaster started the same way: someone got approval for a “rewrite,” froze feature work for eighteen months, and tried to land a big-bang cutover over a weekend. The strangler fig pattern is the disciplined alternative. Named after the vine that grows around a host tree until the original rots away and the fig stands on its own, it replaces a monolith one capability at a time while the system keeps shipping features and serving traffic. This article is the playbook: how to find seams, insert a routing facade, extract the first service with branch-by-abstraction, migrate data ownership off the shared database, keep both systems consistent with change-data-capture, and cut over each slice with a real rollback path.
Why the big-bang rewrite fails
The big-bang rewrite is seductive because it promises a clean break. In practice it fails for structural reasons, not execution ones:
- The target moves. The monolith keeps accreting features during the rewrite. You are chasing a system that is still changing, so the rewrite is never “done.”
- Risk is concentrated at the end. All the integration risk – data migration, cutover, performance, behavioral parity – arrives in a single high-stakes event. There is no incremental signal that you are on track.
- There is no rollback. Once you flip DNS to the new system and start taking writes, going back means reconciling everything that changed in between. Most teams cannot, so they push forward through outages.
- Value is deferred for the entire project. The business funds eighteen months of work and sees nothing until the end. Funding evaporates after the first re-plan.
The strangler fig inverts every one of these. Risk is amortized across many small cutovers, each independently reversible. Value ships continuously because every extracted slice goes live on its own. And because the monolith and the new services run side by side behind a facade, you always have a fallback: route the slice back to the monolith.
The mental model: you are not building a replacement and then switching. You are growing a new system through the old one, slice by slice, until the monolith has nothing left to do and you delete it.
Step 1 – Find seams and bounded contexts
You cannot strangle a monolith you do not understand. The first job is to find seams – natural fracture lines where a capability can be separated with a narrow, well-defined interface. The best seams align with bounded contexts: areas of the domain with their own language, their own data, and their own reasons to change. Billing, Inventory, Notifications, and Catalog are typically separate contexts even if they live in one codebase.
Do not guess. Measure coupling against change. Two signals matter most:
- Data coupling – which tables are read and written together. Tables that are always touched in the same transactions belong to the same context.
- Change coupling – which files change together in commits. Modules that co-change are coupled regardless of how the package structure looks.
Mine git history for change coupling cheaply:
# Files most frequently changed together with billing code, last 2 years
git log --since="2 years ago" --name-only --pretty=format: -- 'src/billing/**' \
| grep -v '^$' \
| sort | uniq -c | sort -rn | head -30
For data coupling, pull the foreign-key graph straight from the database and look for clusters that the application logic also keeps together:
-- PostgreSQL: every FK edge, to find tightly bound table clusters
SELECT
tc.table_name AS from_table,
ccu.table_name AS to_table,
kcu.column_name AS via_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
ORDER BY from_table;
Rank candidate slices by value over difficulty. Pick a first slice that is high-value enough to justify the platform work but loosely coupled enough to extract safely. Notifications or Catalog read are classic first slices: they are mostly read-heavy, have few inbound foreign keys, and a bug there does not corrupt orders or money.
Step 2 – Insert a routing facade
Before you extract anything, put a facade in front of the monolith so you control routing. Today it forwards 100% of traffic to the monolith. As you extract slices, you redirect specific routes to new services – without clients knowing anything moved. This facade is the linchpin of the whole pattern: it is the layer where you shift traffic incrementally and roll back instantly.
Keep the facade dumb. It is an HTTP reverse proxy with route matching, not a place for business logic, or it becomes a new monolith. Envoy, NGINX, or a cloud gateway (Azure Application Gateway, AWS ALB, API Gateway) all work. Here is the routing core in NGINX – everything to the monolith except the one path we are about to own:
upstream monolith { server monolith.internal:8080; }
upstream catalog_svc { server catalog-service.internal:8443; }
server {
listen 443 ssl;
server_name api.example.com;
# Extracted slice: catalog reads now go to the new service
location /api/catalog/ {
proxy_pass https://catalog_svc;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
# Everything else still served by the monolith
location / {
proxy_pass http://monolith;
proxy_set_header Host $host;
}
}
Two non-negotiables for the facade:
- It must be transparent. Same hostnames, same paths, same auth semantics. Clients should not be redeployed because you moved a route. That is what lets you migrate without coordinating with every consumer.
- Routing must be data-driven, not a redeploy. Put the route table behind a config reload or a control plane so flipping a slice from monolith to service (and back) is a config change measured in seconds. Envoy’s xDS or an API gateway’s route API is ideal here.
The facade is where “incremental” stops being a slogan. With it in place, every extraction becomes: build the service, point its route at it, watch metrics, and if anything is wrong, point the route back. No client ever knows.
Step 3 – Extract the first service with branch-by-abstraction
Now extract a slice. The unsafe way is to rip the code out, stand up a service, and flip traffic. Branch-by-abstraction makes the same extraction incremental and reversible, so the monolith stays releasable the entire time.
The technique has four moves:
- Introduce an abstraction in the monolith over the capability you are extracting – an interface that all callers go through.
- Keep the existing in-process implementation behind that interface so nothing changes yet.
- Add a second implementation that calls the new remote service.
- Switch implementations behind a flag, percentage by percentage, then delete the old one once the remote path is proven.
In code, the abstraction is just an interface with two implementations chosen by a flag:
public interface CatalogGateway {
Product findById(String sku);
}
// Old path: still the in-process monolith module
public class InProcessCatalogGateway implements CatalogGateway { /* legacy code */ }
// New path: calls the extracted service over HTTP
public class RemoteCatalogGateway implements CatalogGateway {
private final HttpClient http;
public Product findById(String sku) {
return http.get("/api/catalog/" + sku, Product.class);
}
}
// Selection driven by a runtime flag, not a deploy
CatalogGateway gateway = flags.isEnabled("catalog.remote", request)
? remoteCatalogGateway
: inProcessCatalogGateway;
Because the switch is a flag evaluated per request, you can ramp the remote path from 1% to 100% gradually and watch error rate and latency at each step. If the new service misbehaves, set the flag to 0% and you are instantly back on the in-process code – no redeploy, no rollback drama. Only after the remote path has carried full traffic cleanly do you delete InProcessCatalogGateway. That deletion is the moment a slice of the monolith actually dies.
Step 4 – Migrate data ownership: shared database to service-owned stores
Routing is the easy half. The hard half is data. A monolith almost always has a shared database where every module reads every table. A service that still reaches into the monolith’s tables is not a service – it is a distributed module with a network hop, and you have made things worse. The goal is data ownership: each service owns its data and exposes it only through its API.
The failure mode to ban first: the new service connecting directly to the monolith’s database. Do that and you have shared-database coupling forever, where a schema change in the monolith silently breaks the service. Ownership means exactly one writer per table.
Migrate in deliberate phases rather than one move:
| Phase | Writes | Reads | What it buys you |
|---|---|---|---|
| 0. Shared DB | Monolith | Both | Starting point; tight coupling |
| 1. Encapsulate | Monolith | Service via monolith API | Break direct table reads first |
| 2. Replicate | Monolith (source of truth) | Service from its own copy, kept in sync | Service runs on owned store, still safe |
| 3. Flip writes | Service (source of truth) | Service; monolith reads via service API | Service now owns the data |
| 4. Sever | Service | Service | Drop the old tables; coupling gone |
The ordering is the whole point: stop direct cross-boundary reads before you move the data, and move the data before you move write ownership. Each phase is independently deployable and independently reversible. The dangerous step – flipping write ownership – happens last, after the service has already been serving reads from its own store long enough to trust it.
Carve the new store’s schema to the bounded context. Foreign keys that crossed into other contexts become API calls or denormalized copies. If Catalog referenced supplier_id from a Suppliers table now owned by another service, the catalog store keeps the supplier_id value but loses the database-level FK; integrity across contexts is now enforced by the application, not the engine. That is the cost of a clean boundary, and it is worth paying.
Step 5 – Keep systems in sync with change-data-capture
Phase 2 above requires the service’s store to mirror the monolith’s data continuously while the monolith remains the source of truth. The correct tool is change-data-capture (CDC): tail the database transaction log and stream every committed change as an event. CDC reads the write-ahead log, so it captures every committed change with no application code change in the monolith and no missed writes from a side-door SQL update.
Debezium is the standard implementation. Point it at the monolith’s Postgres write-ahead log and it emits one message per row change:
# Debezium PostgreSQL connector: stream catalog changes off the WAL
name: catalog-cdc-connector
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: monolith-db.internal
database.port: "5432"
database.user: debezium
database.dbname: monolith
plugin.name: pgoutput # logical decoding plugin shipped with Postgres
slot.name: catalog_cdc_slot
table.include.list: public.products,public.product_prices
topic.prefix: monolith
snapshot.mode: initial # backfill existing rows once, then stream
snapshot.mode: initial backfills the existing rows once, then switches to streaming incremental changes – so the service store is seeded and kept current. A small consumer applies each change event to the service’s own database, transforming the monolith’s schema into the service’s schema as it goes:
@KafkaListener(topics = "monolith.public.products")
public void onProductChange(ChangeEvent event) {
switch (event.op()) { // c=create, u=update, d=delete, r=snapshot read
case "c", "u", "r" -> catalogRepo.upsert(map(event.after()));
case "d" -> catalogRepo.delete(event.before().get("id"));
}
}
Two correctness rules apply during the sync window:
- Consume idempotently. CDC delivers at-least-once, so the same change can arrive twice after a relay restart. Make every apply an upsert or delete keyed by primary key, never a blind insert.
- Monitor replication lag. The service store trails the monolith by the consumer’s lag. You cannot safely flip write ownership (phase 3) until lag is consistently near zero, so alert on Debezium’s
MilliSecondsBehindSourceand the consumer group lag before you trust the copy.
CDC turns the riskiest part of decomposition – moving live data without losing writes – into a boring, observable stream. The monolith stays the source of truth the entire time the service is proving itself on a replica. You only flip the source of truth once the copy has demonstrably kept pace.
Step 6 – Cutover, parity testing, and safe rollback per slice
Each slice cuts over independently, and each cutover needs the same three guarantees: parity before, observability during, rollback after.
Parity testing. Before sending real traffic to the new service, prove it produces the same answers as the monolith. Run both in parallel and compare – a technique sometimes called shadowing or dark traffic. Mirror live requests to the new service, compare responses to the monolith’s, but only return the monolith’s answer to the client:
# Envoy: mirror live traffic to the new service for parity checks.
# The mirrored response is discarded; the client still gets the monolith.
route:
cluster: monolith
request_mirror_policies:
- cluster: catalog_svc
runtime_fraction:
default_value: { numerator: 10, denominator: HUNDRED } # mirror 10%
Log diffs between the two responses. Investigate every mismatch – it is either a bug in the new service or an undocumented behavior of the monolith you just discovered. Do not cut over until the diff rate is effectively zero.
Progressive cutover. Once parity holds, shift real traffic in stages – 1% -> 10% -> 50% -> 100% – watching error rate, latency, and business metrics at each step. The facade (Step 2) plus the per-request flag (Step 3) make each increment a config change.
Rollback. This is the property the big-bang rewrite never has. For a read slice, rollback is trivial: point the route back at the monolith. For a write slice, rollback is only safe if the monolith’s data is still current – which is exactly why CDC sync runs bidirectionally during the write-flip window: while the service owns writes, stream the service’s changes back into the monolith’s tables. As long as the monolith stays current, flipping the route back is a safe, instant escape hatch. The moment you stop syncing back to the monolith, you have crossed the point of no return for that slice – so do it only after the slice has been stable in production.
Step 7 – Decommission the monolith, avoid the distributed big ball of mud
As slices peel off, the monolith shrinks until it owns nothing. Decommission deliberately. A module is safe to delete only when no traffic reaches it – prove that with the data, not a guess:
// Azure Monitor / App Insights: confirm a route is truly dead before deleting it
requests
| where timestamp > ago(30d)
| where url has "/api/catalog/"
| summarize hits = count() by bin(timestamp, 1d)
| order by timestamp asc
// Zero across the full window => the monolith path is dead and removable
Thirty days of zeros across a full business cycle (including month-end batch jobs that the daily view can hide) means the code is dead and you can delete it. Deleting dead code is not optional housekeeping; leaving it invites someone to wire it back in and resurrect the coupling you worked to remove.
The trap on the far side is the distributed big ball of mud: you decomposed the monolith into services that are still mutually entangled – chatty synchronous call chains, shared databases that never got split, no clear ownership. That is strictly worse than the monolith, because now the coupling has network latency and partial-failure modes too. Guardrails that prevent it:
- One owner per dataset. If two services write the same table, you have not finished decomposing – you have distributed the shared database. Finish the data migration before declaring victory.
- No synchronous call chains for writes. A request fanning out through five services synchronously is a distributed monolith with the availability of the product of its dependencies. Prefer asynchronous events for cross-service workflows.
- Boundaries follow the domain, not the org chart or the database. Services map to bounded contexts. If you sliced by technical layer (a “database service,” a “validation service”), you built tiers, not services.
Enterprise scenario
A retail platform team I worked with ran a ten-year-old Java monolith on a single shared Oracle database. Order placement, pricing, and inventory all lived in one deployable and all wrote the same orders, inventory, and prices tables in shared transactions. Black Friday was the forcing function: the monolith could not scale pricing independently of order capture, and every pricing change required a full monolith deploy with a four-hour change window. Leadership wanted a rewrite. The architecture team pushed back and ran a strangler fig instead.
The constraint that shaped everything: zero downtime and zero tolerance for incorrect prices. A wrong price shown to a customer was a regulatory and trust problem, so parity had to be provable, not assumed.
They started with Pricing, the highest-value, most read-heavy slice. The sequence:
- Stood up an Envoy facade in front of the monolith; 100% of traffic still hit the monolith on day one.
- Introduced a
PriceProviderabstraction in the monolith (branch-by-abstraction) with the legacy in-process implementation behind it. - Built a Pricing service with its own Postgres store, seeded and kept current from Oracle by a Debezium connector reading the redo log via LogMiner. The monolith stayed the source of truth.
- Mirrored 100% of live pricing requests to the new service for two weeks via Envoy request mirroring, logging every response diff. The mismatch rate started at 0.3% – all traced to a rounding rule in a legacy stored procedure no one had documented. They fixed the service to match, and the diff rate went to zero.
- Ramped real traffic 1% -> 100% over a week behind the flag, then flipped write ownership to the Pricing service with bidirectional CDC back to Oracle so rollback stayed available.
The rounding-rule diff is the part worth dwelling on. Without parity mirroring it would have shipped as a subtle pricing bug across millions of requests. The mirror config that caught it was deliberately simple:
# Envoy: 100% mirror of pricing traffic during the parity window
route:
cluster: monolith
request_mirror_policies:
- cluster: pricing_svc
runtime_fraction:
default_value: { numerator: 100, denominator: HUNDRED }
Outcome: Pricing now scales and deploys independently, the four-hour change window for price updates is gone, and the team has reused the same facade-mirror-CDC-cutover template to extract Inventory and Notifications. The Oracle monolith is still running – now down to order capture only – and is on a roadmap to deletion rather than a doomed weekend cutover. The lesson they kept repeating: the strangler fig was slower to start than a rewrite would have been, and far faster to deliver value, because every slice went live on its own and nothing ever required a big-bang flip.
Verify
Before you call any slice “migrated,” confirm it against reality, not intent:
- Facade routing is data-driven. Flip a slice’s route from service to monolith and back via config only, with no redeploy. Time it; it should be seconds.
- The old path is reversible. With the remote flag at 0%, confirm traffic falls back to the in-process implementation with zero errors.
- Parity is proven, not assumed. Pull the response-diff logs from the mirroring window and confirm the mismatch rate is effectively zero before any real cutover.
- CDC lag is near zero. Check Debezium
MilliSecondsBehindSourceand consumer-group lag are consistently low before flipping write ownership. - One writer per table. Query the new service’s store and the monolith and confirm no table accepts writes from both after the slice is done.
- The dead path is actually dead. Run the request-count query over a full business cycle and confirm zero hits before deleting monolith code.