Event sourcing is sold as “store events instead of state,” which is true and also the least interesting thing about it. The hard parts show up six months in: an aggregate that needs ten thousand events to rehydrate, a projection that drifted from the source of truth, a privacy request to delete data from a store whose entire contract is that you never delete, and a v1 event you must keep deserializing forever. This article walks the system end to end, from aggregate boundaries to retention at scale, with the decisions that separate a design that survives production from one nobody wants to touch.
Step 1 - Know when event sourcing is the wrong choice
Event sourcing earns its complexity when the history of changes is itself a business asset: ledgers, trading, inventory, order lifecycles, anything audited, anything where “how did we get here” is a question someone will pay to answer. If you only need current state and would throw the history away, you are buying a large tax for nothing.
The honest cost, before any benefits: you run two eventually-consistent models – a write model (events) and one or more read models (projections); every event is part of your public contract forever, so you cannot ALTER a column, you version a schema and keep old readers alive; and debugging means replaying history, not reading a row, on a team where most engineers have never done it.
Rule of thumb: reach for event sourcing when at least two of these are true – you need a provable audit trail, you have genuinely concurrent writers to the same entity, you need temporal queries (“state as of last Tuesday”), or you must feed several independently-evolving read models. If only one is true, a transactional database with an audit table is almost always the better trade. Do not event-source the whole system; event-source the few aggregates that need it.
A frequent failure mode is event-sourcing a CRUD admin screen because the architecture document said so. Scope it to bounded contexts where the model pays for itself.
Step 2 - Design aggregates and consistency boundaries
The aggregate is the unit of consistency. It is the only place you enforce invariants transactionally, and in event sourcing it is also your concurrency boundary and (usually) your event stream. Get this boundary wrong and everything downstream fights you.
Two rules from domain-driven design that matter more here than anywhere else:
- An aggregate is the smallest set of objects that must change together to keep an invariant true. If two pieces of data can be slightly out of sync without breaking a rule, they belong in different aggregates.
- One transaction commits to exactly one aggregate. Cross-aggregate changes are coordinated with events and (where needed) sagas/process managers, not a distributed transaction.
Keep aggregates small. A large aggregate – say, a Customer that owns every order they ever placed – produces an enormous stream, serializes all that customer’s writes through one optimistic-concurrency lock, and rehydrates slowly. Prefer an Order aggregate referencing a CustomerId. Should OrderLine totals live inside Order? Yes – “order total must equal the sum of lines” is an invariant that must hold atomically. Should the customer’s loyalty balance update in the same transaction? No; that is a different invariant – raise an OrderPlaced event and let a process manager adjust loyalty.
A minimal aggregate that protects an invariant and emits events rather than mutating state directly:
type DomainEvent =
| { type: "OrderPlaced"; orderId: string; customerId: string; at: string }
| { type: "OrderLineAdded"; orderId: string; sku: string; qty: number; unitCents: number }
| { type: "OrderSubmitted"; orderId: string };
class Order {
private id!: string;
private status: "draft" | "submitted" = "draft";
private lineCount = 0;
// version = number of events already applied; drives optimistic concurrency
public version = 0;
static rehydrate(events: DomainEvent[]): Order {
const o = new Order();
for (const e of events) o.apply(e, false);
return o;
}
// apply mutates in-memory state; only NEW events go to `pending`
private apply(e: DomainEvent, isNew: boolean) {
switch (e.type) {
case "OrderPlaced": this.id = e.orderId; this.status = "draft"; break;
case "OrderLineAdded": this.lineCount++; break;
case "OrderSubmitted": this.status = "submitted"; break;
}
this.version++;
if (isNew) this.pending.push(e);
}
public pending: DomainEvent[] = [];
submit() {
if (this.status !== "draft") throw new Error("already submitted");
if (this.lineCount === 0) throw new Error("cannot submit empty order"); // the invariant
this.apply({ type: "OrderSubmitted", orderId: this.id }, true);
}
}
This is the heart of the model: commands validate against current in-memory state and produce events; events are the only thing that mutate state, on replay and for new changes alike. That symmetry is what lets you rebuild any aggregate from its log.
Step 3 - Design the event store: append semantics and optimistic concurrency
An event store needs surprisingly little: an append-only log partitioned by stream, total ordering within a stream, and a way to reject a write based on a stale version. You can run a purpose-built store (EventStoreDB, AxonServer) or build one on Postgres – the latter is the most common and maintainable choice for teams who already operate Postgres.
A workable Postgres schema:
CREATE TABLE events (
global_position BIGINT GENERATED ALWAYS AS IDENTITY, -- total order for projections
stream_id TEXT NOT NULL, -- e.g. 'order-9f3c...'
stream_version INT NOT NULL, -- 1-based position within the stream
event_type TEXT NOT NULL,
event_version SMALLINT NOT NULL DEFAULT 1, -- schema version of THIS event type
data JSONB NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}', -- correlation id, causation id, actor
occurred_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (stream_id, stream_version)
);
-- The optimistic-concurrency guard: two writers cannot both claim version N.
CREATE UNIQUE INDEX uq_stream_version ON events (stream_id, stream_version);
-- Projections subscribe in global order; this index makes the catch-up scan cheap.
CREATE INDEX ix_global_position ON events (global_position);
Appending is a single statement per event whose success or failure is the concurrency check. You read the aggregate at version = N, decide the new events, and insert them at stream_version = N+1, N+2, .... If a concurrent writer already took N+1, the unique index raises a violation and you retry the command against the now-newer state:
-- Append events 6 and 7, expecting the stream to currently be at version 5.
-- If anyone else wrote version 6 first, this fails on uq_stream_version.
INSERT INTO events (stream_id, stream_version, event_type, event_version, data, metadata)
VALUES
('order-9f3c', 6, 'OrderLineAdded', 1, '{"sku":"A1","qty":2,"unitCents":1999}', '{"actor":"u-42"}'),
('order-9f3c', 7, 'OrderSubmitted', 1, '{}', '{"actor":"u-42"}');
Two common mistakes here. First: do not reach for
SERIALIZABLEtransactions when a unique index already gives you exactly the guarantee you need at a fraction of the cost. Second:global_positionfrom an identity column is monotonic but not gap-free under concurrency – a transaction can reserve position 100 and commit slowly while 101 commits first and becomes visible to a reader before 100. Projections that page byglobal_position > last_seencan therefore skip events. The fix: make subscribers gap-tolerant (a transactional outbox or apg_logical/wal2jsonchange-feed gives commit-order delivery), or track a low-water mark and reprocess a small overlap. Treat the global position as a cursor, not a contiguous counter.
One more rule: publishing an appended event to subscribers must be atomic with the insert. The transactional outbox pattern – insert events and an outbox row in one transaction, then relay – or logical replication off the same table avoids the dual-write problem where an event is stored but never published.
Step 4 - Snapshot to bound rehydration cost
Rehydration replays a stream from version 1. Fine for a short-lived aggregate; unacceptable for a long-lived one (a bank account open for years) where you replay tens of thousands of events on every command. Snapshots cap that cost: periodically persist the aggregate’s materialized state at a known version, then load snapshot + events after the snapshot version.
CREATE TABLE snapshots (
stream_id TEXT NOT NULL,
stream_version INT NOT NULL, -- the event version this snapshot reflects
state JSONB NOT NULL,
snapshot_version SMALLINT NOT NULL DEFAULT 1, -- schema version of the snapshot shape
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (stream_id) -- keep only the latest; or PK(stream_id, stream_version) to keep history
);
Loading then reads the snapshot and only the tail after it:
-- 1) Get the snapshot (if any)
SELECT stream_version, state FROM snapshots WHERE stream_id = 'acct-7';
-- 2) Get only the tail
SELECT event_type, event_version, data
FROM events
WHERE stream_id = 'acct-7' AND stream_version > $snapshotVersion
ORDER BY stream_version;
Design choices that matter:
- Snapshot frequency is a throughput-vs-storage knob, not a correctness one. A common policy is “snapshot every N events” (e.g. every 100) checked at write time. Pick N so the post-snapshot tail you replay stays small relative to your latency budget.
- Snapshots are a cache, never the source of truth. You must be able to delete every snapshot and rebuild from events with zero data loss – which is what makes them safe to discard when their schema changes.
- Version your snapshot schema. When the aggregate’s in-memory shape changes, bump
snapshot_version. On load, if the stored snapshot is older than the code expects, ignore it and replay from events rather than deserialize a stale shape – treating a stale snapshot as valid is a classic source of silent corruption. - Do not snapshot synchronously on the hot path if it hurts write latency; compute snapshots asynchronously from the same subscription that drives projections.
Step 5 - Build read-side projections and rebuild them without downtime
Projections (the read side of CQRS) consume the event log in global order and write denormalized views optimized for queries – a SQL table, an Elasticsearch index, a cache. Each projection tracks its position in the log, must be idempotent (the same event may be delivered more than once), and must process events in order within a stream.
CREATE TABLE projection_checkpoints (
projection_name TEXT PRIMARY KEY,
last_position BIGINT NOT NULL DEFAULT 0
);
A projector loop, with the checkpoint advanced in the same transaction as the view write so a crash can only replay, never lose:
async function runProjector(name: string, handle: (e: StoredEvent) => Promise<void>) {
const { last_position } = await db.one(
`SELECT last_position FROM projection_checkpoints WHERE projection_name = $1`, [name]);
const batch = await db.any(
`SELECT global_position, stream_id, event_type, event_version, data
FROM events WHERE global_position > $1
ORDER BY global_position LIMIT 500`, [last_position]);
for (const e of batch) {
await db.tx(async t => {
await handle(e); // idempotent upsert into the read model
await t.none(`UPDATE projection_checkpoints
SET last_position = $1 WHERE projection_name = $2`,
[e.global_position, name]);
});
}
}
Event sourcing wins for read models because of rebuildability: a projection is a pure function of the log, so you can throw it away and recompute it. The technique for reshaping a projection without downtime is a blue/green rebuild:
- Create the new projection table/index alongside the live one (
orders_read_v2). - Reset a new checkpoint to 0 and let a rebuild worker replay the entire log into v2 while v1 keeps serving reads. Replay is fast – a sequential scan with no command validation.
- When v2 catches up to head and stays caught up, flip reads to v2 atomically (swap a view name, feature flag, or alias). For Elasticsearch this is an index alias swap:
curl -XPOST "$ES/_aliases" -H 'Content-Type: application/json' -d '{
"actions": [
{ "remove": { "index": "orders_read_v1", "alias": "orders_read" } },
{ "add": { "index": "orders_read_v2", "alias": "orders_read" } }
]
}'
- Keep v1 for a rollback window, then drop it.
Because the rebuild reads from an immutable log, you can run it repeatedly with no risk to the write side. This is the most liberating property of the architecture: read-side schema changes stop being scary migrations and become “replay into a new table.”
Step 6 - Evolve event schemas: versioning, upcasting, weak schema
Events are immutable and permanent, so the schema problem is “how do new code read old events,” not “how do I change old events.” Three layered techniques, weakest to strongest:
Weak-schema serialization. Use a format that tolerates added/removed optional fields – JSON with lenient deserialization, or Avro/Protobuf with their compatibility rules. Adding an optional field is then a non-event: old events lack it and you default it. Make additive change your default and you avoid most versioning work.
Versioned events plus upcasting. When a change is not additive (a field is split, renamed, or its meaning changes), bump event_version and write an upcaster – a pure function that transforms a v1 payload into the v2 shape at read time, before it reaches your aggregate or projector. The store keeps the original v1 bytes forever; only the in-memory representation is upgraded.
// v1: { name: "Ada Lovelace" } -> v2: { firstName, lastName }
function upcastCustomerRegistered(v: number, data: any): any {
if (v >= 2) return data;
const [firstName, ...rest] = (data.name ?? "").split(" ");
return { firstName, lastName: rest.join(" ") };
}
// Apply on the read path so the rest of the code only ever sees the latest shape.
function deserialize(e: StoredEvent) {
if (e.event_type === "CustomerRegistered")
return upcastCustomerRegistered(e.event_version, e.data);
return e.data;
}
Copy-and-replace / event migration. Reserved for breaking changes you cannot upcast cleanly (splitting one event type into two, fixing genuinely corrupt history). Replay the old stream through a transformation into a new stream, then cut over. This rewrites history, so it is the last resort and must be auditable.
Practical guidance: prefer additive changes; reach for upcasting when you must reshape; avoid in-place migration unless you truly have no alternative. Never delete an old event type’s deserializer while any unupcasted instance of it exists in the store – and in an immutable store, that is forever. Keep upcasters as a versioned chain (v1->v2->v3) so each step stays small and testable.
Step 7 - Handle GDPR deletion and PII in an append-only log
This requirement breaks naive event sourcing: GDPR’s right to erasure obliges you to delete a person’s PII, while event sourcing’s contract is that you never delete. You reconcile them with crypto-shredding:
- Generate a per-subject encryption key (one key per data subject, e.g. per customer).
- Encrypt the PII fields of each event with that subject’s key before appending; store only ciphertext in the immutable log. Keep the keys in a separate, mutable store (a KMS, Vault, or a
subject_keystable). - To honor erasure, delete the key. The events remain byte-for-byte and ordered, but their PII fields are now permanently unrecoverable ciphertext. Non-PII facts (an order happened, totals, timestamps) survive, so invariants and the audit trail stay intact; only the personal data is gone.
// On append: encrypt PII with the subject's key; non-PII stays clear.
const key = await keystore.getOrCreate(customerId); // mutable store
const event = {
type: "CustomerRegistered",
customerId, // pseudonymous id, not PII
emailEnc: encrypt(key, email), // ciphertext in the immutable log
at: new Date().toISOString(),
};
// On erasure request: irreversibly drop the key. Events are untouched.
await keystore.delete(customerId); // emailEnc can now never be decrypted again
Complementary practices: keep PII out of stream_ids and projection keys (use opaque ids), and treat projections as deletable – being derived, you can purge a subject’s rows from read models directly and rebuild. Crypto-shredding protects the immutable log; projection purge handles the queryable copies. Confirm with your DPO; authorities have generally accepted irreversible key destruction as equivalent to erasure, but get it in writing.
Step 8 - Capacity, retention, and partitioning at scale
An append-only store only grows, so plan partitioning and retention up front rather than when the table is a terabyte.
- Partition by time. Use native partitioning (Postgres declarative partitioning by
occurred_at, monthly or weekly). New writes land in the latest partition; old partitions become read-mostly and can be moved to cheaper storage or detached.
CREATE TABLE events (
global_position BIGINT GENERATED ALWAYS AS IDENTITY,
stream_id TEXT NOT NULL, stream_version INT NOT NULL,
event_type TEXT NOT NULL, data JSONB NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT now()
) PARTITION BY RANGE (occurred_at);
CREATE TABLE events_2026_06 PARTITION OF events
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
- Do not “retain” by deleting events – that destroys the source of truth and your audit trail. If a regulator or storage budget forces a horizon, archive cold partitions (export to object storage as Parquet, detach the partition) rather than dropping rows, and keep snapshots so aggregates whose early events are archived can still rehydrate from
snapshot + recent tail. - Streaming throughput: if you fan events out through Kafka, partition the topic by
stream_id(hashed) so all events for one aggregate land on one partition and keep order; never split a single stream across partitions, or you lose ordering. Size partition count for consumer parallelism, not today’s volume. - Capacity math: estimate events/sec x average event bytes x retention. Watch JSONB overhead – it stores keys per row, so short field names and shallow payloads materially reduce footprint at billions of events.
Verify
Treat these as the acceptance tests for an event-sourced system; if any fails, you have a latent production incident.
- Concurrency: fire two commands against the same stream at the same expected version; exactly one commits and the other gets a concurrency violation and retries. Assert no event is silently lost or double-applied.
-- Reconstruct any aggregate from scratch to prove the log is sufficient.
SELECT event_type, event_version, data
FROM events WHERE stream_id = 'order-9f3c' ORDER BY stream_version;
- Snapshot equivalence: load an aggregate via
snapshot + tailand via full replay; assert byte-identical state. Then delete all snapshots and confirm the system still works – snapshots must be disposable. - Projection rebuildability: truncate a read model, reset its checkpoint to 0, replay, and diff the result against the pre-truncation copy. They must match exactly. This is your standing proof that projections are pure functions of the log.
- Upcasting: keep golden v1 payloads in a fixture and assert each upcaster still produces the current shape on every build, so you never lose the ability to read old events.
- Crypto-shredding: after deleting a subject’s key, assert that decrypting their events fails and that non-PII facts and downstream invariants are unaffected.
- Ordering under load: run a soak test and assert per-stream ordering holds end to end (store -> subscription -> projection), with idempotent reprocessing on redelivery.
Enterprise scenario
A payments platform ran account balances as event-sourced aggregates on Postgres. Most accounts were tiny, but a handful of merchant settlement accounts accumulated hundreds of thousands of FundsCaptured / FundsSettled events. Rehydrating one to authorize a new capture had crept to several seconds, and during end-of-month settlement those accounts were exactly the hot path. Worse, two settlement workers occasionally hit the same account at once; one lost its write to a concurrency violation, retried against a stream that took seconds to reload, and retries stacked until a few captures timed out.
The constraint: they could not split a settlement account (its running balance is a single invariant) nor relax the concurrency guard (double-spend was unacceptable). The fix was snapshots tuned to the workload plus a tighter retry – snapshot every 200 events, load snapshot + tail, and cap concurrency retries with jittered backoff so a contended account did not amplify load:
-- Rehydrate a heavy settlement account cheaply: state at the snapshot + only the recent tail.
WITH snap AS (
SELECT stream_version, state FROM snapshots WHERE stream_id = 'acct-merchant-88'
)
SELECT e.event_type, e.event_version, e.data
FROM events e, snap
WHERE e.stream_id = 'acct-merchant-88'
AND e.stream_version > COALESCE((SELECT stream_version FROM snap), 0)
ORDER BY e.stream_version;
Rehydration dropped from seconds to single-digit milliseconds, the retry cap stopped the stacking, and settlement throughput recovered – without touching the aggregate boundary or weakening concurrency. The lesson: snapshot frequency is an operational knob you set per workload, not a one-size constant, and you only learn the right N from the streams that actually hurt.
Checklist
Event sourcing rewards discipline and punishes shortcuts. If aggregates are small, appends are guarded, snapshots are disposable, projections are rebuildable, and PII is crypto-shredded, you get a system whose history is provable, whose read models are cheap to change, and whose worst-case recovery is “replay the log.” Skip any one and you get an immutable monument to a decision you can no longer reverse.