Architecture Multi-Cloud

Strangler Fig Migration: Incrementally Decomposing a Monolith into Services

Every monolith modernization that ends in disaster started the same way: someone got approval for a “rewrite,” froze feature work for eighteen months, and tried to land a big-bang cutover over a weekend. The strangler fig pattern is the disciplined alternative. Named after the vine that grows around a host tree until the original rots away and the fig stands on its own, it replaces a monolith one capability at a time while the system keeps shipping features and serving traffic. This article is the playbook: how to find seams, insert a routing facade, extract the first service with branch-by-abstraction, migrate data ownership off the shared database, keep both systems consistent with change-data-capture, and cut over each slice with a real rollback path.

Why the big-bang rewrite fails

The big-bang rewrite is seductive because it promises a clean break. In practice it fails for structural reasons, not execution ones:

The strangler fig inverts every one of these. Risk is amortized across many small cutovers, each independently reversible. Value ships continuously because every extracted slice goes live on its own. And because the monolith and the new services run side by side behind a facade, you always have a fallback: route the slice back to the monolith.

The mental model: you are not building a replacement and then switching. You are growing a new system through the old one, slice by slice, until the monolith has nothing left to do and you delete it.

Step 1 – Find seams and bounded contexts

You cannot strangle a monolith you do not understand. The first job is to find seams – natural fracture lines where a capability can be separated with a narrow, well-defined interface. The best seams align with bounded contexts: areas of the domain with their own language, their own data, and their own reasons to change. Billing, Inventory, Notifications, and Catalog are typically separate contexts even if they live in one codebase.

Do not guess. Measure coupling against change. Two signals matter most:

  1. Data coupling – which tables are read and written together. Tables that are always touched in the same transactions belong to the same context.
  2. Change coupling – which files change together in commits. Modules that co-change are coupled regardless of how the package structure looks.

Mine git history for change coupling cheaply:

# Files most frequently changed together with billing code, last 2 years
git log --since="2 years ago" --name-only --pretty=format: -- 'src/billing/**' \
  | grep -v '^$' \
  | sort | uniq -c | sort -rn | head -30

For data coupling, pull the foreign-key graph straight from the database and look for clusters that the application logic also keeps together:

-- PostgreSQL: every FK edge, to find tightly bound table clusters
SELECT
    tc.table_name      AS from_table,
    ccu.table_name     AS to_table,
    kcu.column_name    AS via_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
     ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
     ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
ORDER BY from_table;

Rank candidate slices by value over difficulty. Pick a first slice that is high-value enough to justify the platform work but loosely coupled enough to extract safely. Notifications or Catalog read are classic first slices: they are mostly read-heavy, have few inbound foreign keys, and a bug there does not corrupt orders or money.

Step 2 – Insert a routing facade

Before you extract anything, put a facade in front of the monolith so you control routing. Today it forwards 100% of traffic to the monolith. As you extract slices, you redirect specific routes to new services – without clients knowing anything moved. This facade is the linchpin of the whole pattern: it is the layer where you shift traffic incrementally and roll back instantly.

Keep the facade dumb. It is an HTTP reverse proxy with route matching, not a place for business logic, or it becomes a new monolith. Envoy, NGINX, or a cloud gateway (Azure Application Gateway, AWS ALB, API Gateway) all work. Here is the routing core in NGINX – everything to the monolith except the one path we are about to own:

upstream monolith   { server monolith.internal:8080; }
upstream catalog_svc { server catalog-service.internal:8443; }

server {
    listen 443 ssl;
    server_name api.example.com;

    # Extracted slice: catalog reads now go to the new service
    location /api/catalog/ {
        proxy_pass         https://catalog_svc;
        proxy_set_header   Host $host;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    # Everything else still served by the monolith
    location / {
        proxy_pass http://monolith;
        proxy_set_header Host $host;
    }
}

Two non-negotiables for the facade:

The facade is where “incremental” stops being a slogan. With it in place, every extraction becomes: build the service, point its route at it, watch metrics, and if anything is wrong, point the route back. No client ever knows.

Step 3 – Extract the first service with branch-by-abstraction

Now extract a slice. The unsafe way is to rip the code out, stand up a service, and flip traffic. Branch-by-abstraction makes the same extraction incremental and reversible, so the monolith stays releasable the entire time.

The technique has four moves:

  1. Introduce an abstraction in the monolith over the capability you are extracting – an interface that all callers go through.
  2. Keep the existing in-process implementation behind that interface so nothing changes yet.
  3. Add a second implementation that calls the new remote service.
  4. Switch implementations behind a flag, percentage by percentage, then delete the old one once the remote path is proven.

In code, the abstraction is just an interface with two implementations chosen by a flag:

public interface CatalogGateway {
    Product findById(String sku);
}

// Old path: still the in-process monolith module
public class InProcessCatalogGateway implements CatalogGateway { /* legacy code */ }

// New path: calls the extracted service over HTTP
public class RemoteCatalogGateway implements CatalogGateway {
    private final HttpClient http;
    public Product findById(String sku) {
        return http.get("/api/catalog/" + sku, Product.class);
    }
}

// Selection driven by a runtime flag, not a deploy
CatalogGateway gateway = flags.isEnabled("catalog.remote", request)
        ? remoteCatalogGateway
        : inProcessCatalogGateway;

Because the switch is a flag evaluated per request, you can ramp the remote path from 1% to 100% gradually and watch error rate and latency at each step. If the new service misbehaves, set the flag to 0% and you are instantly back on the in-process code – no redeploy, no rollback drama. Only after the remote path has carried full traffic cleanly do you delete InProcessCatalogGateway. That deletion is the moment a slice of the monolith actually dies.

Step 4 – Migrate data ownership: shared database to service-owned stores

Routing is the easy half. The hard half is data. A monolith almost always has a shared database where every module reads every table. A service that still reaches into the monolith’s tables is not a service – it is a distributed module with a network hop, and you have made things worse. The goal is data ownership: each service owns its data and exposes it only through its API.

The failure mode to ban first: the new service connecting directly to the monolith’s database. Do that and you have shared-database coupling forever, where a schema change in the monolith silently breaks the service. Ownership means exactly one writer per table.

Migrate in deliberate phases rather than one move:

Phase Writes Reads What it buys you
0. Shared DB Monolith Both Starting point; tight coupling
1. Encapsulate Monolith Service via monolith API Break direct table reads first
2. Replicate Monolith (source of truth) Service from its own copy, kept in sync Service runs on owned store, still safe
3. Flip writes Service (source of truth) Service; monolith reads via service API Service now owns the data
4. Sever Service Service Drop the old tables; coupling gone

The ordering is the whole point: stop direct cross-boundary reads before you move the data, and move the data before you move write ownership. Each phase is independently deployable and independently reversible. The dangerous step – flipping write ownership – happens last, after the service has already been serving reads from its own store long enough to trust it.

Carve the new store’s schema to the bounded context. Foreign keys that crossed into other contexts become API calls or denormalized copies. If Catalog referenced supplier_id from a Suppliers table now owned by another service, the catalog store keeps the supplier_id value but loses the database-level FK; integrity across contexts is now enforced by the application, not the engine. That is the cost of a clean boundary, and it is worth paying.

Step 5 – Keep systems in sync with change-data-capture

Phase 2 above requires the service’s store to mirror the monolith’s data continuously while the monolith remains the source of truth. The correct tool is change-data-capture (CDC): tail the database transaction log and stream every committed change as an event. CDC reads the write-ahead log, so it captures every committed change with no application code change in the monolith and no missed writes from a side-door SQL update.

Debezium is the standard implementation. Point it at the monolith’s Postgres write-ahead log and it emits one message per row change:

# Debezium PostgreSQL connector: stream catalog changes off the WAL
name: catalog-cdc-connector
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: monolith-db.internal
  database.port: "5432"
  database.user: debezium
  database.dbname: monolith
  plugin.name: pgoutput              # logical decoding plugin shipped with Postgres
  slot.name: catalog_cdc_slot
  table.include.list: public.products,public.product_prices
  topic.prefix: monolith
  snapshot.mode: initial            # backfill existing rows once, then stream

snapshot.mode: initial backfills the existing rows once, then switches to streaming incremental changes – so the service store is seeded and kept current. A small consumer applies each change event to the service’s own database, transforming the monolith’s schema into the service’s schema as it goes:

@KafkaListener(topics = "monolith.public.products")
public void onProductChange(ChangeEvent event) {
    switch (event.op()) {                       // c=create, u=update, d=delete, r=snapshot read
        case "c", "u", "r" -> catalogRepo.upsert(map(event.after()));
        case "d"           -> catalogRepo.delete(event.before().get("id"));
    }
}

Two correctness rules apply during the sync window:

CDC turns the riskiest part of decomposition – moving live data without losing writes – into a boring, observable stream. The monolith stays the source of truth the entire time the service is proving itself on a replica. You only flip the source of truth once the copy has demonstrably kept pace.

Step 6 – Cutover, parity testing, and safe rollback per slice

Each slice cuts over independently, and each cutover needs the same three guarantees: parity before, observability during, rollback after.

Parity testing. Before sending real traffic to the new service, prove it produces the same answers as the monolith. Run both in parallel and compare – a technique sometimes called shadowing or dark traffic. Mirror live requests to the new service, compare responses to the monolith’s, but only return the monolith’s answer to the client:

# Envoy: mirror live traffic to the new service for parity checks.
# The mirrored response is discarded; the client still gets the monolith.
route:
  cluster: monolith
  request_mirror_policies:
    - cluster: catalog_svc
      runtime_fraction:
        default_value: { numerator: 10, denominator: HUNDRED }   # mirror 10%

Log diffs between the two responses. Investigate every mismatch – it is either a bug in the new service or an undocumented behavior of the monolith you just discovered. Do not cut over until the diff rate is effectively zero.

Progressive cutover. Once parity holds, shift real traffic in stages – 1% -> 10% -> 50% -> 100% – watching error rate, latency, and business metrics at each step. The facade (Step 2) plus the per-request flag (Step 3) make each increment a config change.

Rollback. This is the property the big-bang rewrite never has. For a read slice, rollback is trivial: point the route back at the monolith. For a write slice, rollback is only safe if the monolith’s data is still current – which is exactly why CDC sync runs bidirectionally during the write-flip window: while the service owns writes, stream the service’s changes back into the monolith’s tables. As long as the monolith stays current, flipping the route back is a safe, instant escape hatch. The moment you stop syncing back to the monolith, you have crossed the point of no return for that slice – so do it only after the slice has been stable in production.

Step 7 – Decommission the monolith, avoid the distributed big ball of mud

As slices peel off, the monolith shrinks until it owns nothing. Decommission deliberately. A module is safe to delete only when no traffic reaches it – prove that with the data, not a guess:

// Azure Monitor / App Insights: confirm a route is truly dead before deleting it
requests
| where timestamp > ago(30d)
| where url has "/api/catalog/"
| summarize hits = count() by bin(timestamp, 1d)
| order by timestamp asc
// Zero across the full window => the monolith path is dead and removable

Thirty days of zeros across a full business cycle (including month-end batch jobs that the daily view can hide) means the code is dead and you can delete it. Deleting dead code is not optional housekeeping; leaving it invites someone to wire it back in and resurrect the coupling you worked to remove.

The trap on the far side is the distributed big ball of mud: you decomposed the monolith into services that are still mutually entangled – chatty synchronous call chains, shared databases that never got split, no clear ownership. That is strictly worse than the monolith, because now the coupling has network latency and partial-failure modes too. Guardrails that prevent it:

Enterprise scenario

A retail platform team I worked with ran a ten-year-old Java monolith on a single shared Oracle database. Order placement, pricing, and inventory all lived in one deployable and all wrote the same orders, inventory, and prices tables in shared transactions. Black Friday was the forcing function: the monolith could not scale pricing independently of order capture, and every pricing change required a full monolith deploy with a four-hour change window. Leadership wanted a rewrite. The architecture team pushed back and ran a strangler fig instead.

The constraint that shaped everything: zero downtime and zero tolerance for incorrect prices. A wrong price shown to a customer was a regulatory and trust problem, so parity had to be provable, not assumed.

They started with Pricing, the highest-value, most read-heavy slice. The sequence:

  1. Stood up an Envoy facade in front of the monolith; 100% of traffic still hit the monolith on day one.
  2. Introduced a PriceProvider abstraction in the monolith (branch-by-abstraction) with the legacy in-process implementation behind it.
  3. Built a Pricing service with its own Postgres store, seeded and kept current from Oracle by a Debezium connector reading the redo log via LogMiner. The monolith stayed the source of truth.
  4. Mirrored 100% of live pricing requests to the new service for two weeks via Envoy request mirroring, logging every response diff. The mismatch rate started at 0.3% – all traced to a rounding rule in a legacy stored procedure no one had documented. They fixed the service to match, and the diff rate went to zero.
  5. Ramped real traffic 1% -> 100% over a week behind the flag, then flipped write ownership to the Pricing service with bidirectional CDC back to Oracle so rollback stayed available.

The rounding-rule diff is the part worth dwelling on. Without parity mirroring it would have shipped as a subtle pricing bug across millions of requests. The mirror config that caught it was deliberately simple:

# Envoy: 100% mirror of pricing traffic during the parity window
route:
  cluster: monolith
  request_mirror_policies:
    - cluster: pricing_svc
      runtime_fraction:
        default_value: { numerator: 100, denominator: HUNDRED }

Outcome: Pricing now scales and deploys independently, the four-hour change window for price updates is gone, and the team has reused the same facade-mirror-CDC-cutover template to extract Inventory and Notifications. The Oracle monolith is still running – now down to order capture only – and is on a roadmap to deletion rather than a doomed weekend cutover. The lesson they kept repeating: the strangler fig was slower to start than a rewrite would have been, and far faster to deliver value, because every slice went live on its own and nothing ever required a big-bang flip.

Verify

Before you call any slice “migrated,” confirm it against reality, not intent:

Migration checklist

migrationmodernizationmicroservicesstrangler-figarchitecture

Comments

Keep Reading