The architecture review stalls on one slide: “Which database?” Someone says “just use Postgres,” someone else says “DynamoDB scales infinitely,” and a third says “Aurora is faster.” All three are right and all three are dangerous, because the question was never which database is best — there is no best. The question is which store fits this workload’s data model, access pattern, scale curve and consistency requirement, and on AWS three managed services cover the overwhelming majority of that space: Amazon RDS (managed relational engines you already know — PostgreSQL, MySQL, MariaDB, Oracle, SQL Server), Amazon Aurora (AWS’s cloud-native relational engine that speaks MySQL and PostgreSQL wire protocols but rebuilds the storage layer underneath), and Amazon DynamoDB (a fully managed, serverless key-value and document store built for single-digit-millisecond latency at any scale).
Pick wrong and you pay for it for years. Force relational, join-heavy, ad-hoc-query data into DynamoDB and you end up scanning tables, fanning out reads in application code, and rewriting access patterns every time the product changes. Force a write-heavy, 100,000-events-per-second firehose onto a single RDS instance and you drown in read replicas, replica lag and vertical-scaling ceilings. Run Aurora when a db.t4g.micro RDS instance would do and you burn money on capacity you never touch. This article is the decision framework a 22-year architect uses to get it right the first time — and to recognise, fast, when an existing choice has gone wrong.
By the end you will choose data-model-first, not hype-first. You will know exactly what RDS gives you that Aurora does not (and vice versa), why DynamoDB’s single-digit-millisecond promise depends entirely on your partition key, what each service’s real limits are (connections, IOPS, item size, RU/s, storage ceilings), how the consistency models actually differ, and what each one costs. Because this is a reference you will return to mid-design and mid-incident, every comparison — engines, instance classes, capacity modes, indexes, endpoints, limits, failure modes — is laid out as a scannable table. Read the prose once; keep the tables open when the decision is live.
What problem this solves
Choosing a database is a one-way door dressed up as a two-way door. Migrating between relational engines is painful; migrating from relational to NoSQL (or back) is a re-architecture, because the data model, the access patterns and half your application code change with it. The cost of a wrong pick is not a config tweak — it is months of remodelling, a scaling wall hit during your biggest traffic event, or a bill that grows faster than revenue.
What breaks without a clear framework: teams default to “the database we know” and put everything on one RDS instance, then discover at 10× scale that vertical scaling has a ceiling and read replicas don’t help writes. Or they over-rotate on “NoSQL scales” and put relational, reporting-heavy data on DynamoDB, then bolt on a second system (often back to a SQL store, or OpenSearch, or Athena over S3) to answer the queries DynamoDB can’t. Or they reach for Aurora reflexively for a tiny internal app and pay a premium for a cluster they’ll never stress. Each of these is a data-model mistake wearing a service-selection costume.
Who hits this: every team standing up a new service, every monolith being decomposed (where one database becomes several, each fitted to its bounded context), and every product that succeeds — because success is exactly when the under-considered database choice fails. The fix is a disciplined decision: name the data model, enumerate the access patterns, project the scale curve, state the consistency requirement, then map those four facts to the store that fits.
To frame the whole field before the deep dive, here is the one-glance summary — the four facts that decide it and what each service answers:
| Deciding factor | RDS | Aurora | DynamoDB |
|---|---|---|---|
| Data model | Relational (normalised, joins, transactions) | Relational (same SQL, cloud-native storage) | Key-value / document (denormalised, access-pattern-shaped) |
| Query style | Ad-hoc SQL, joins, aggregates, reports | Ad-hoc SQL at higher throughput | Known key lookups; queries via designed keys/indexes only |
| Scale ceiling | Vertical (instance size) + read replicas | Higher write throughput; 15 read replicas; storage to 128 TiB | Horizontal, effectively unbounded if keys are well-distributed |
| Consistency | Strong (ACID, single-writer) | Strong (ACID); reader nodes near-real-time | Eventually consistent by default; strongly consistent reads optional |
| Latency profile | Good; degrades with load/locks | Better at scale; sub-10 ms reads on replicas | Single-digit ms at any scale (with the right key) |
| Ops model | You patch/size/tune; managed backups/HA | More managed (storage auto-grows); you size compute | Serverless: no instances, no patching, no capacity tuning (on-demand) |
| Pick it when | Existing SQL app; lift-and-shift; commodity relational | High-throughput SQL; MySQL/PG compatibility + scale/HA | Massive scale; predictable latency; well-defined access patterns |
Learning objectives
By the end of this article you can:
- Choose between RDS, Aurora and DynamoDB from four facts — data model, access patterns, scale curve and consistency requirement — and defend the choice.
- Enumerate what each RDS engine gives you, pick the right instance class and storage type, and configure Multi-AZ vs read replicas for the right reason.
- Explain Aurora’s decoupled compute/storage architecture, choose between provisioned and Serverless v2 capacity, and use the right cluster endpoint (writer / reader / custom) for each workload.
- Design a DynamoDB table around access patterns — partition/sort keys, GSIs vs LSIs, on-demand vs provisioned capacity, and consistency — and avoid hot partitions.
- Reason about each service’s real limits (connections, IOPS, item size, RU/s, storage ceilings) and recognise when you’ve hit one.
- Diagnose the common production failures — replica lag, connection exhaustion, hot partitions, throttling, runaway costs — with the exact CLI/console path to confirm and fix each.
- Right-size and cost-model each option in INR/USD, including free-tier limits, and pick the cheapest store that meets the requirement.
Prerequisites & where this fits
You should be comfortable with core AWS concepts: a VPC with private subnets, security groups, IAM roles and policies, and the difference between the AWS Management Console, the aws CLI and infrastructure-as-code. Basic SQL (SELECT/JOIN/transaction) and the idea of an index help for the relational half; familiarity with key-value thinking (a hash map at planet scale) helps for DynamoDB. You do not need prior Aurora or DynamoDB experience — that’s what this builds.
This sits in the Data & Storage track and is upstream of almost every application design. It assumes the networking foundation from Amazon VPC: Subnets, Route Tables & Security Groups (databases live in private subnets, reachable only through security groups) and the account/identity foundation from AWS Organizations & IAM Foundations. It pairs with AWS Storage: S3 Storage Classes & Lifecycle (S3 is the data-lake target DynamoDB and Aurora export to), with AWS Compute: EC2 vs Lambda vs ECS vs EKS (what connects to these databases), and with AWS Backup & Disaster-Recovery Strategies (how you protect them). The front-door choice in ALB vs NLB vs API Gateway often sits in front of the same compute that talks to these stores.
A quick map of who owns what during a design or an incident, so you pull in the right person:
| Layer | What lives here | Who usually owns it | What it decides / can break |
|---|---|---|---|
| Application / data access | ORM, queries, access patterns, connection pool | App / dev team | Wrong store choice; connection exhaustion; N+1 queries |
| Database engine | RDS engine / Aurora / DynamoDB table | DBA / platform team | Schema, indexes, capacity mode, consistency |
| Storage layer | EBS (RDS), Aurora distributed storage, DynamoDB partitions | AWS (managed) | IOPS, throughput, storage ceiling, replication |
| Network | VPC, subnets, security groups, endpoints | Network team | Reachability; private access; SNAT-free PrivateLink |
| Identity | IAM roles, DB auth, KMS keys | Security team | Who can connect; encryption; least privilege |
| Cost / FinOps | Instance size, RU/s, storage, I/O, backups | FinOps + owners | The bill; over-provisioning; runaway on-demand |
Core concepts
Six mental models make every later decision obvious.
Data model first, service second. The single most consequential property is how your data is shaped and queried. Relational data — entities with relationships, queried with joins, aggregates and ad-hoc filters you can’t fully predict — wants RDS or Aurora. Key-value/document data — accessed by a known identifier with a small, well-defined set of patterns — wants DynamoDB. Everything else (latency, scale, cost) is a second-order tie-breaker. Pick the model wrong and no amount of tuning saves you.
RDS is managed engines; Aurora is a re-engineered engine. RDS takes the database engines you already run (PostgreSQL, MySQL, MariaDB, Oracle, SQL Server) and operates them for you — provisioning, patching, backups, Multi-AZ failover. The engine is the same software you’d run on a server, on EBS storage. Aurora keeps the MySQL/PostgreSQL wire protocol and SQL but replaces the storage engine with a purpose-built, distributed, log-structured store that spreads six copies of your data across three Availability Zones and lets storage grow automatically. Aurora is “RDS-compatible SQL with a different, faster, more available storage layer,” not a drop-in for every extension and quirk of vanilla Postgres/MySQL.
DynamoDB is a partitioned hash map, and the partition key is everything. DynamoDB stores items (rows) in partitions chosen by hashing the partition key. A read or write goes straight to the partition for that key — O(1), single-digit milliseconds, at any table size. The catch: throughput is per partition, so if your access concentrates on one key (a “hot partition”), you throttle even though the table as a whole is far under capacity. Every DynamoDB design decision — keys, indexes, item collections — exists to spread load evenly across partitions. Get the key right and it scales forever; get it wrong and it throttles at modest load.
Consistency is a spectrum you choose, not a given. RDS and Aurora are ACID: a single writer, strong consistency, transactions. Aurora reader endpoints serve reads from replicas that lag the writer by milliseconds (near-real-time but not the writer’s exact instant). DynamoDB defaults to eventual consistency (a read may not reflect the most recent write for a short window, because it might hit a replica that hasn’t caught up) and offers strongly consistent reads as an opt-in (more RCU, only on the base table, not GSIs). Knowing which guarantee each store gives — and which your workload actually needs — prevents both correctness bugs and over-paying for consistency you don’t require.
Capacity is sized differently per service. RDS capacity is an instance class (vCPU + RAM, e.g. db.r6g.xlarge) plus a storage type (gp3/io2) — you pick the box. Aurora is the same instance idea, but storage auto-scales; Aurora Serverless v2 even scales the compute in fine-grained ACUs (Aurora Capacity Units) with load. DynamoDB capacity is read/write units — on-demand (pay per request, no planning) or provisioned (you set RCU/WCU, optionally with auto-scaling). Knowing the unit each service bills in is how you size and cost it.
Managed ≠ zero-ops, except where it is. RDS and Aurora still need you to choose instance sizes, manage connection pools, tune parameters, schedule maintenance windows and watch metrics — AWS manages the infrastructure, you manage the database. DynamoDB on-demand is the closest to genuinely serverless: no instances, no patching, no capacity tuning, scaling handled for you — your only job is the data model and the keys. The further right you go (RDS → Aurora → Aurora Serverless v2 → DynamoDB on-demand), the less operational surface you own.
The vocabulary in one table
Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:
| Term | One-line definition | Which service | Why it matters |
|---|---|---|---|
| Engine | The database software (PostgreSQL, MySQL, …) | RDS | Decides SQL dialect, features, licensing |
| DB instance | The compute box (class = vCPU + RAM) running the engine | RDS / Aurora | Capacity ceiling; cost; failover unit |
| Multi-AZ | A synchronous standby in another AZ for failover | RDS / Aurora | HA; standby is not readable (RDS classic) |
| Read replica | An async copy serving read-only traffic | RDS / Aurora | Scales reads; can lag the primary |
| Cluster | Aurora’s writer + readers sharing one storage volume | Aurora | The unit you create; endpoints point into it |
| Cluster endpoint | DNS that routes to writer / readers / custom set | Aurora | Send writes vs reads to the right node |
| ACU | Aurora Capacity Unit (≈2 GiB RAM + CPU/network) | Aurora Serverless v2 | The fine-grained scaling/billing unit |
| Partition key | The attribute hashed to place an item | DynamoDB | Distribution; hot-partition risk |
| Sort key | Second key for ordering within a partition | DynamoDB | Range queries; item collections |
| RCU / WCU | Read / Write Capacity Unit (one unit of throughput) | DynamoDB | Provisioned capacity + billing unit |
| GSI / LSI | Global / Local Secondary Index | DynamoDB | Query by non-key attributes |
| Item | One record (≤ 400 KB) | DynamoDB | The unit of storage and capacity math |
| PITR | Point-in-time recovery | All three | Restore to any second in the window |
The service comparison reference
Before the per-service deep dives, here is the master comparison you scan first — every dimension that separates the three, side by side. The non-obvious rows are failover time (Aurora is much faster than RDS classic Multi-AZ), read scaling (DynamoDB and Aurora beat RDS), and operational surface (which shrinks left to right).
| Dimension | RDS | Aurora | DynamoDB |
|---|---|---|---|
| Type | Managed relational | Cloud-native relational | Managed NoSQL (key-value/doc) |
| Engines / API | PostgreSQL, MySQL, MariaDB, Oracle, SQL Server | MySQL-compatible, PostgreSQL-compatible | DynamoDB API (PartiQL optional) |
| Storage model | EBS volume per instance | Distributed, 6 copies / 3 AZs, auto-grow | Partitioned, replicated across 3 AZs |
| Max storage | 64 TiB (gp3, varies by engine) | 128 TiB per cluster | Unbounded (per-item ≤ 400 KB) |
| Write scaling | Vertical (bigger instance) | Vertical; higher ceiling per instance | Horizontal (add partitions automatically) |
| Read scaling | Up to 5 (15 some engines) read replicas | Up to 15 low-lag reader replicas | Horizontal; eventually-consistent reads cheap |
| Replica lag | Async, seconds possible | Typically < 100 ms (shared storage) | N/A (eventual reads ~ms behind) |
| Failover time | 60–120 s (Multi-AZ classic) | Typically < 30 s (often < 15 s) | Transparent (no failover concept) |
| Consistency | Strong (ACID) | Strong (ACID) | Eventual (default) / strong (opt-in) |
| Transactions | Full, multi-statement | Full, multi-statement | Limited (TransactWrite/Get, ≤ 100 items) |
| Joins / ad-hoc queries | Yes | Yes | No (design keys/indexes up front) |
| Global / multi-region | Cross-region read replicas | Aurora Global Database (< 1 s lag) | Global Tables (active-active) |
| Backups | Automated + snapshots, PITR | Continuous to S3, PITR, fast clone | Continuous PITR (35 days), on-demand |
| Scaling effort | Resize instance (downtime/replica) | Resize / Serverless v2 auto | On-demand: none; provisioned: auto-scaling |
| Pricing unit | Instance-hour + storage + I/O | Instance-hour (or ACU-hr) + storage + I/O | Per-request (on-demand) or RCU/WCU + storage |
| Best fit | Lift-and-shift, commodity SQL | High-throughput SQL, HA, MySQL/PG compat | Massive scale, predictable latency, key access |
| Worst fit | Web-scale writes; unpredictable spikes | Tiny apps (overkill); niche engine features | Relational/reporting; ad-hoc analytics |
And the decision as a data-model-first flow in table form — start at the top, stop at the first row that matches:
| If your data / workload is… | …then the right store is | Because |
|---|---|---|
| An existing PostgreSQL/MySQL/Oracle/SQL Server app to migrate with minimal change | RDS (same engine) | Lift-and-shift; keep the engine, gain management |
| Relational, but you need higher write throughput, faster failover, or 5–15 replicas | Aurora | Same SQL, cloud-native storage & HA |
| Relational with spiky/unpredictable load you don’t want to size | Aurora Serverless v2 | Compute auto-scales in ACUs |
| Key-value or document, accessed by a known ID, at large or unpredictable scale | DynamoDB | O(1) partitioned access, serverless scale |
| Time-series / event firehose at very high write rate | DynamoDB (or purpose-built TS store) | Horizontal writes; no replica sprawl |
| Heavy ad-hoc analytics / joins over big data | Redshift / Athena (not these three) | OLAP, not OLTP — different tool |
| In-memory caching / leaderboards / sub-ms | ElastiCache in front of any of the above | Cache layer, not the system of record |
Amazon RDS, option by option
RDS is the safe default for relational data: pick the engine you know, the box that fits, the storage that performs, and let AWS run it. Every choice below is a lever.
Engine choice
RDS runs five engines; the choice is driven by your existing app, licensing, and feature needs. What each gives you and the gotcha:
| Engine | Pick it when | Licensing | Key strength | Gotcha |
|---|---|---|---|---|
| PostgreSQL | Modern apps; rich SQL, extensions (PostGIS, JSONB) | Open-source (no licence) | Extensions, standards-compliance, JSON | Major-version upgrades need care |
| MySQL | LAMP-style apps, broad ecosystem | Open-source | Ubiquity, tooling, replication | Some features lag Postgres |
| MariaDB | MySQL drop-in, prefer the fork | Open-source | MySQL-compatible, some extra engines | Smaller managed-feature parity |
| Oracle | Existing Oracle estate, PL/SQL, specific features | BYOL or License-Included | Enterprise features, compatibility | Cost; licence complexity |
| SQL Server | .NET shops, T-SQL, SSRS/SSAS-adjacent | License-Included (editions) | Windows/.NET integration | Cost; edition limits (RAM/cores) |
# Create a PostgreSQL RDS instance in private subnets (gp3, Multi-AZ)
aws rds create-db-instance \
--db-instance-identifier orders-prod \
--engine postgres --engine-version 16.4 \
--db-instance-class db.r6g.large \
--allocated-storage 100 --storage-type gp3 \
--multi-az --no-publicly-accessible \
--master-username appadmin --manage-master-user-password \
--db-subnet-group-name db-private --vpc-security-group-ids sg-0abc123
# Terraform: the same instance, with secrets in Secrets Manager
resource "aws_db_instance" "orders" {
identifier = "orders-prod"
engine = "postgres"
engine_version = "16.4"
instance_class = "db.r6g.large"
allocated_storage = 100
max_allocated_storage = 500 # storage autoscaling ceiling
storage_type = "gp3"
multi_az = true
publicly_accessible = false
db_subnet_group_name = aws_db_subnet_group.private.name
vpc_security_group_ids = [aws_security_group.db.id]
manage_master_user_password = true # AWS-managed master secret
backup_retention_period = 14
storage_encrypted = true
deletion_protection = true
}
Instance classes
The instance class is the box: vCPU, RAM, network. Families differ in CPU/RAM ratio and price. Match the workload:
| Family | Profile | RAM:vCPU | Pick it for | Avoid for |
|---|---|---|---|---|
| db.t4g / t3 | Burstable (CPU credits) | ~4:1 | Dev/test, low/spiky steady load | Sustained high CPU (credits run out) |
| db.m6g / m7g | General purpose | ~4:1 | Balanced OLTP workloads | Memory-bound big working sets |
| db.r6g / r7g | Memory-optimised | ~8:1 | Large buffer pool, cache-heavy | Pure CPU-bound compute |
| db.x2g | Extra memory-optimised | ~16:1 | Very large in-memory data sets | Cost-sensitive small apps |
| Graviton (g) | ARM-based, better price/perf | — | Almost everything (cheaper) | Engine/version not yet supporting ARM |
A burstable db.t4g runs on CPU credits: cheap at idle, but a sustained workload exhausts credits and throttles (or bills for surplus credits). The first scaling mistake on RDS is leaving production on a t-class and hitting the credit wall under load. The fix is the right class and, where you can, Graviton for the price/performance.
# Resize the instance class (incurs a brief failover on Multi-AZ)
aws rds modify-db-instance --db-instance-identifier orders-prod \
--db-instance-class db.r6g.xlarge --apply-immediately
Storage types
RDS storage is EBS underneath; the type decides IOPS and throughput characteristics and cost:
| Storage type | IOPS model | Throughput | Pick it for | Limit / note |
|---|---|---|---|---|
| gp3 (general purpose SSD) | 3,000 baseline, provisionable to 16,000 | 125–1,000 MB/s | The default for most OLTP | Decouples IOPS from size — cheaper than gp2 |
| gp2 (legacy SSD) | 3 IOPS/GB (burst to 3,000) | Scales with size | Legacy; migrate to gp3 | IOPS tied to volume size |
| io2 Block Express | Provisioned, up to 256,000 | Very high | Latency-sensitive, high-IOPS OLTP | Highest cost; for demanding workloads |
| Magnetic | Low | Low | Never for production | Legacy only |
The classic storage mistake is sizing gp2 and discovering your IOPS are capped by volume size, not by need — a 100 GB gp2 volume gives ~300 baseline IOPS. gp3 decouples IOPS from size, so you provision the IOPS you need without inflating storage. Watch the ReadIOPS/WriteIOPS and DiskQueueDepth CloudWatch metrics — a rising queue depth means you’re IOPS-starved.
Multi-AZ vs read replicas — two different jobs
These are constantly confused. Multi-AZ is for availability (a hot standby that takes over on failure); read replicas are for read scaling (extra readable copies). They solve different problems and you often want both:
| Property | Multi-AZ (instance) | Multi-AZ DB cluster | Read replica |
|---|---|---|---|
| Purpose | Failover / HA | HA + 2 readable standbys | Scale reads |
| Standby readable? | No | Yes (2 readers) | Yes (it’s a replica) |
| Replication | Synchronous | Semi-sync to 2 readers | Asynchronous |
| Failover time | 60–120 s | Often < 35 s | Promote manually (minutes) |
| Cross-region? | No (same region) | No | Yes |
| Data loss risk | None (sync) | Minimal | Possible (async lag) |
| Cost | ~2× (standby) | ~3× (two extra) | Per-replica instance |
| Use when | Any production DB | HA + some read offload | Read-heavy; reporting; cross-region DR |
# Add a read replica (can be in another region for DR)
aws rds create-db-instance-read-replica \
--db-instance-identifier orders-prod-replica-1 \
--source-db-instance-identifier orders-prod \
--db-instance-class db.r6g.large
A read replica replicates asynchronously, so it can lag the primary — reads from it may be stale by seconds under write load. Never route a read-your-own-write flow (user updates a profile, immediately reloads it) to a lagging replica without thought. Watch ReplicaLag; if it climbs, the replica is undersized or the write rate is too high.
Parameters, options and maintenance
RDS exposes engine tuning through parameter groups (engine settings like max_connections, work_mem) and option groups (engine add-ons). The settings you change most and why:
| Setting / control | What it does | Default | When to change | Trade-off / gotcha |
|---|---|---|---|---|
max_connections |
Cap on concurrent connections | Formula of instance RAM | App opens too many; or limit a noisy app | Too high → memory pressure; use a pooler |
work_mem (PG) |
Memory per sort/hash op | Modest | Heavy analytical queries | Too high × many connections → OOM |
| Backup retention | Days of automated backups + PITR | 7 (1 disables) | Compliance / recovery window | Storage cost; 0 disables PITR |
| Maintenance window | When patches/upgrades apply | AWS-assigned | Align to low-traffic hours | Patches can cause brief failover |
| Performance Insights | Captures wait events / top SQL | Off (enable it) | Always in prod | Small cost; huge diagnostic value |
| Deletion protection | Blocks accidental delete | Off | Always in prod | Must disable before intended delete |
| Storage autoscaling | Grows storage on the fly | Off | Avoid full-disk outages | Set a max_allocated_storage ceiling |
| IAM DB authentication | Auth via IAM tokens, not passwords | Off | Centralise auth; rotate-free | Token TTL ~15 min; pooling considerations |
RDS limits and quotas
The real numbers that bite. These are the ones you hit in production:
| Limit | Typical value | What hitting it looks like | Mitigation |
|---|---|---|---|
| Max storage (gp3) | 64 TiB (engine-dependent) | Can’t grow further | Archive/shard; consider Aurora (128 TiB) |
| Max connections | RAM-derived (hundreds–few thousand) | “too many connections” errors | RDS Proxy / app pooler; bigger instance |
| Read replicas | 5 (15 for MySQL/MariaDB) | Can’t add another | Aurora (15) or cache layer |
| Provisioned IOPS (io2) | Up to 256,000 | I/O-bound; high DiskQueueDepth |
More IOPS; faster storage; query tuning |
| Backup retention | 0–35 days | Can’t restore beyond window | Snapshot to longer-term; export to S3 |
| Instance RAM (largest) | Hundreds of GB (e.g. x2g) | Working set won’t fit | Memory-optimised class; or partition data |
| DB name / identifier rules | Engine-specific length/charset | Create fails | Follow naming constraints |
Amazon Aurora, the cloud-native relational engine
Aurora keeps the SQL you know and rebuilds everything beneath it. Understand the storage architecture first; the rest follows.
The decoupled storage architecture
In classic RDS, one instance owns one EBS volume. Aurora splits compute (the DB instances) from storage (a shared, distributed volume). The storage layer keeps six copies of every 10 GB segment across three AZs and is log-structured — instances ship redo log records to storage, which materialises pages. The consequences are the whole point of Aurora:
| Architectural property | What it gives you | Contrast with RDS classic |
|---|---|---|
| 6 copies / 3 AZs | Survives an AZ + one more failure with no data loss | Single EBS volume per instance |
| Shared storage volume | All replicas read the same data — low lag | Each replica has its own copy (more lag) |
| Storage auto-grows | Up to 128 TiB, no pre-provisioning | You set/extend allocated_storage |
| Log-structured (ship redo) | Less network/IO amplification, faster writes | Full-page writes over the network |
| Fast failover | Replica is promoted in seconds (shared storage) | 60–120 s standby promotion |
| Fast clone / backtrack | Copy-on-write clones; rewind in time | Restore from snapshot (slower) |
Cluster endpoints — send the right traffic to the right node
An Aurora cluster has one writer and up to 15 readers sharing storage. You don’t connect to instances directly; you connect to endpoints that route for you. Using the wrong endpoint is a common, silent mistake (sending reads to the writer wastes its capacity; sending writes to a reader fails):
| Endpoint | Routes to | Use for | Behaviour on failover |
|---|---|---|---|
| Cluster (writer) endpoint | Current writer | All writes; read-after-write | Auto-points to the new writer |
| Reader endpoint | Load-balanced across readers | Read-only traffic, reports | Drops failed readers; balances rest |
| Custom endpoint | A chosen subset of instances | Isolate (e.g. analytics on big readers) | You define membership |
| Instance endpoint | One specific instance | Debugging a single node | Doesn’t move on failover |
# Create an Aurora PostgreSQL cluster, then add a reader
aws rds create-db-cluster --db-cluster-identifier shop-aurora \
--engine aurora-postgresql --engine-version 16.4 \
--master-username appadmin --manage-master-user-password \
--db-subnet-group-name db-private --vpc-security-group-ids sg-0abc123
aws rds create-db-instance --db-instance-identifier shop-aurora-1 \
--db-cluster-identifier shop-aurora --engine aurora-postgresql \
--db-instance-class db.r6g.large
aws rds create-db-instance --db-instance-identifier shop-aurora-reader-1 \
--db-cluster-identifier shop-aurora --engine aurora-postgresql \
--db-instance-class db.r6g.large
Capacity modes — provisioned vs Serverless v2
Aurora compute comes two ways. Provisioned = you pick instance classes (like RDS). Serverless v2 = the cluster scales compute up and down in fine-grained ACUs (each ACU ≈ 2 GiB RAM with proportional CPU/network) based on live load. Choose by predictability:
| Capacity mode | How it scales | Billing | Pick it when | Watch-out |
|---|---|---|---|---|
| Provisioned | You resize the instance class | Per instance-hour | Steady, predictable load | Pay for peak even at idle |
| Serverless v2 | Auto, in 0.5-ACU steps, near-instant | Per ACU-hour (min…max you set) | Spiky / unpredictable / dev | Set min ACU > 0 to avoid cold-ish ramp; cost if always-busy |
# Serverless v2: set the ACU range on the cluster
aws rds modify-db-cluster --db-cluster-identifier shop-aurora \
--serverless-v2-scaling-configuration MinCapacity=0.5,MaxCapacity=16
resource "aws_rds_cluster" "shop" {
cluster_identifier = "shop-aurora"
engine = "aurora-postgresql"
engine_mode = "provisioned" # Serverless v2 uses provisioned mode + scaling config
serverlessv2_scaling_configuration {
min_capacity = 0.5
max_capacity = 16
}
storage_encrypted = true
deletion_protection = true
}
Aurora replication and global reach
Within a region you add up to 15 readers with sub-100 ms lag (they read shared storage). Across regions, Aurora Global Database replicates with typically < 1 s lag and supports a fast managed failover for DR/low-latency global reads:
| Feature | Lag | Region scope | Use for | Note |
|---|---|---|---|---|
| Reader replicas (in-region) | < 100 ms (shared storage) | Same region | Read scaling, HA | Up to 15 |
| Aurora Global Database | < 1 s typical | Secondary regions | Global reads, DR | Managed cross-region failover |
| Cross-region snapshot copy | N/A (point-in-time) | Any | Compliance, migration | Not continuous |
| Backtrack (MySQL-compat) | N/A (rewind) | Same cluster | Undo bad writes fast | Set retention window |
Aurora limits
| Limit | Value | Note |
|---|---|---|
| Max cluster storage | 128 TiB | Auto-grows; no pre-provisioning |
| Reader replicas | 15 per cluster | Shared storage → low lag |
| Serverless v2 ACUs | 0.5 up to 128 ACU | ≈ 1 GiB to 256 GiB RAM range |
| Global Database secondary regions | Multiple (region-dependent) | Add for DR / locality |
| Connections | Instance-class dependent | Use RDS Proxy for pooling at scale |
| Engine compatibility | MySQL 5.7/8.0-compat; PostgreSQL versions | Not every extension/feature of vanilla |
Amazon DynamoDB, designed around access patterns
DynamoDB inverts relational design: you don’t model entities and then query them; you enumerate the queries and design keys so those queries are O(1). Get this right and it scales without limit; get it wrong and you scan, throttle and overpay.
Keys, partitions and the hot-partition trap
Every item has a partition key (hashed to choose a partition). Optionally a sort key gives ordering within a partition and enables range queries — items sharing a partition key form an item collection. The cardinal rule: spread load evenly across partition keys, because throughput is per-partition. Key-design choices and their consequences:
| Key design | Distribution | Query power | Risk | Use when |
|---|---|---|---|---|
High-cardinality PK (e.g. userId) |
Even | Get one item by key | Low | Per-entity lookups |
PK + sort key (e.g. userId + timestamp) |
Even (per user) | Range/sorted within user | Hot if one user dominates | Time-ordered per entity |
Low-cardinality PK (e.g. status) |
Skewed → hot partition | Limited | High — throttles | Almost never |
| Composite / synthetic key (sharding suffix) | Even (forced) | Needs fan-out read | Read complexity | Unavoidably hot keys |
| Single-table design (overloaded keys) | Even (by design) | Many patterns, one table | Modeling complexity | Microservice owning many patterns |
A hot partition is the number-one DynamoDB failure: one partition key takes disproportionate traffic and throttles while the table is far under total capacity. The fix is key design — write sharding (append a calculated suffix to spread writes), choosing a higher-cardinality attribute, or restructuring item collections. Adaptive capacity absorbs mild imbalance automatically, but it is not a substitute for an evenly distributed key.
# Create a table: PK=PK (string), SK=SK (string), on-demand billing, PITR + encryption
aws dynamodb create-table --table-name shop-events \
--attribute-definitions AttributeName=PK,AttributeType=S AttributeName=SK,AttributeType=S \
--key-schema AttributeName=PK,KeyType=HASH AttributeName=SK,KeyType=RANGE \
--billing-mode PAY_PER_REQUEST \
--sse-specification Enabled=true
aws dynamodb update-continuous-backups --table-name shop-events \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
resource "aws_dynamodb_table" "shop_events" {
name = "shop-events"
billing_mode = "PAY_PER_REQUEST"
hash_key = "PK"
range_key = "SK"
attribute { name = "PK" type = "S" }
attribute { name = "SK" type = "S" }
point_in_time_recovery { enabled = true }
server_side_encryption { enabled = true }
}
Capacity modes — on-demand vs provisioned
Two ways to pay for throughput. On-demand scales automatically and bills per request — zero capacity planning. Provisioned sets RCU/WCU (optionally auto-scaled) — cheaper for steady, predictable traffic. Choose by predictability and spikiness:
| Capacity mode | Scaling | Billing | Pick it when | Watch-out |
|---|---|---|---|---|
| On-demand | Instant, automatic | Per read/write request | Spiky, unpredictable, new, dev | More expensive per-request at high steady volume |
| Provisioned | You set RCU/WCU (+ auto-scaling) | Per provisioned unit-hour | Steady, predictable load | Under-provision → throttling; over → waste |
| Provisioned + auto-scaling | Target-utilisation tracking | Per unit-hour | Predictable with gentle ramps | Can’t react to instant flash spikes |
| Reserved capacity | N/A (commitment) | Discounted RCU/WCU | Large stable provisioned baseline | 1/3-yr commitment |
The capacity-unit math you must know: 1 WCU = one write/sec of an item up to 1 KB; 1 RCU = one strongly consistent read/sec up to 4 KB, or two eventually-consistent reads/sec of 4 KB. Bigger items and strong reads cost more units. Mis-estimating this is the second-most-common throttling cause after hot keys.
| Operation | Item size | Consistency | Capacity consumed |
|---|---|---|---|
| Write | 1 KB | — | 1 WCU |
| Write | 3.5 KB | — | 4 WCU (round up per KB) |
| Read | 4 KB | Strong | 1 RCU |
| Read | 4 KB | Eventual | 0.5 RCU |
| Read | 12 KB | Eventual | 1.5 RCU |
| Transactional write | 1 KB | — | 2 WCU |
| Transactional read | 4 KB | — | 2 RCU |
Secondary indexes — GSI vs LSI
Indexes let you query by non-key attributes. The two kinds differ sharply; choosing wrong forces a redesign:
| Property | Global Secondary Index (GSI) | Local Secondary Index (LSI) |
|---|---|---|
| Partition key | Any attribute (different from base) | Same as base table PK |
| Sort key | Any attribute | Different attribute |
| When created | Any time (add/remove later) | Only at table creation |
| Consistency | Eventual only | Strong or eventual |
| Capacity | Its own RCU/WCU (or on-demand) | Shares base table capacity |
| Count limit | 20 per table (default) | 5 per table |
| Item-collection size | Independent | 10 GB cap per partition key |
| Use when | Query by a totally different key | Alternate sort within same PK |
The big traps: an LSI can only be created with the table (you cannot add one later — you’d rebuild the table), and an LSI ties you to a 10 GB per-partition-key item-collection limit. GSIs are flexible (add anytime, own capacity) but eventually consistent only. Most designs favour GSIs.
# Add a GSI on an existing table (GSIs can be added later; LSIs cannot)
aws dynamodb update-table --table-name shop-events \
--attribute-definitions AttributeName=GSI1PK,AttributeType=S AttributeName=GSI1SK,AttributeType=S \
--global-secondary-index-updates \
'[{"Create":{"IndexName":"GSI1","KeySchema":[{"AttributeName":"GSI1PK","KeyType":"HASH"},{"AttributeName":"GSI1SK","KeyType":"RANGE"}],"Projection":{"ProjectionType":"ALL"}}}]'
Consistency and read options
DynamoDB lets you choose per read. Knowing the matrix prevents both stale-read bugs and over-paying:
| Read type | Returns | Cost | Where allowed | Use when |
|---|---|---|---|---|
| Eventually consistent (default) | May miss the most recent write (~ms) | 0.5 RCU / 4 KB | Base table + GSI | Default; high-volume reads |
| Strongly consistent | Latest committed write | 1 RCU / 4 KB | Base table + LSI (not GSI) | Read-after-write correctness |
| Transactional (TransactGetItems) | Snapshot-isolated set | 2 RCU / 4 KB | Base table | All-or-nothing reads |
Streams, TTL and the feature set
DynamoDB’s surrounding features turn it from a key-value store into an event source and a self-pruning store. The ones you’ll actually use:
| Feature | What it does | Why it matters | Note |
|---|---|---|---|
| DynamoDB Streams | Ordered change log of item modifications | Event-driven pipelines; CDC; replication | 24 h retention; triggers Lambda |
| TTL | Auto-deletes items past an epoch attribute | Free expiry of sessions/events | Deletes within ~48 h of expiry (not exact) |
| Global Tables | Multi-region active-active replication | Low-latency global writes; DR | Last-writer-wins conflict resolution |
| DAX | In-memory cache in front of DynamoDB | Microsecond reads for hot items | Only after proving DynamoDB is the bottleneck |
| PITR | Continuous backup, restore to any second | 35-day recovery window | Enable on every prod table |
| PartiQL | SQL-like query syntax | Familiar syntax over DynamoDB | Still bound by key/index design |
| Export to S3 | Full-table export without consuming capacity | Analytics in Athena/Redshift | Point-in-time; no RCU burn |
| Contributor Insights | Most-accessed keys / throttled keys | Find hot partitions | Turn on when diagnosing skew |
DynamoDB limits and quotas
The hard numbers that shape designs:
| Limit | Value | What hitting it means | Mitigation |
|---|---|---|---|
| Max item size | 400 KB | Write rejected | Split item; store blob in S3, pointer in item |
| Partition key length | 1–2048 bytes | Validation error | Shorten key |
| Sort key length | 1–1024 bytes | Validation error | Shorten key |
| GSIs per table | 20 (default, raisable) | Can’t add another | Consolidate access patterns |
| LSIs per table | 5 (creation-time only) | Can’t add | Rethink at design time |
| Item-collection (LSI) | 10 GB per partition key | Writes blocked for that key | Avoid LSI for large collections; use GSI |
| Query/Scan page | 1 MB per call | Paginated results | Paginate with LastEvaluatedKey |
| Transaction items | 100 items / 4 MB | Transaction rejected | Smaller batches |
| BatchWriteItem | 25 items / 16 MB | Batch rejected | Chunk the batch |
| On-demand throughput | Scales to high default ceilings | Sudden 2× spike may briefly throttle | Pre-warm or use provisioned + scaling |
| Throughput (provisioned) | Per-table/account RCU/WCU quotas | ProvisionedThroughputExceeded |
Raise capacity / quota; fix hot key |
Consistency, transactions and durability across the three
This is the dimension teams under-think. Side by side, what each guarantees and how to reason about it:
| Property | RDS | Aurora | DynamoDB |
|---|---|---|---|
| Default read consistency | Strong (from primary) | Strong (writer); ~ms-lag (reader) | Eventual |
| Strong read option | Always (primary) | Always (writer) | Opt-in (ConsistentRead=true) |
| Transactions | Full ACID, multi-statement | Full ACID, multi-statement | TransactWrite/Get, ≤ 100 items |
| Isolation levels | Engine-configurable | Engine-configurable | Serializable (within a transaction) |
| Durability | Multi-AZ sync (if enabled) | 6 copies / 3 AZs | 3-AZ replication, always |
| Cross-region consistency | Async replica (lag) | Global DB < 1 s | Global Tables (last-writer-wins) |
| Read-your-own-write | Yes (primary) | Yes (writer endpoint) | Yes only with strong read on base table |
The practical rule: if your correctness depends on reading exactly what you just wrote, read from the RDS/Aurora primary/writer or use a DynamoDB strongly consistent read on the base table (never a GSI). If “a few milliseconds stale” is fine (most read-heavy traffic), use replicas/reader endpoints and DynamoDB eventual reads — they’re cheaper and scale further.
Backup, recovery and DR across the three
How you protect each store, and how fast you recover:
| Mechanism | RDS | Aurora | DynamoDB |
|---|---|---|---|
| Automated backups | Daily + transaction logs | Continuous to S3 | Continuous (PITR) |
| PITR window | 0–35 days | Up to 35 days | Up to 35 days |
| On-demand snapshot | Yes | Yes (fast) | Yes (instant, no capacity burn) |
| Restore speed | Provisions a new instance (minutes+) | Fast; clone is near-instant (CoW) | New table from backup |
| Cross-region copy | Snapshot copy / cross-region replica | Snapshot copy / Global DB | Backup copy / Global Tables |
| RPO (typical) | Seconds (PITR) | Seconds (PITR) | Seconds (PITR) |
| RTO (typical) | Minutes (restore/failover) | < 30 s failover; minutes restore | Seconds (multi-region) |
For the full strategy — vaults, cross-account isolation, tested restores — see AWS Backup & Disaster-Recovery Strategies. The rule that saves you: a backup you have never restored is a hope, not a plan — schedule restore drills.
Architecture at a glance
The diagram traces one application’s data plane left to right and shows where each store fits and where each one bites. Read it from the left: clients reach compute (EC2/ECS/EKS/Lambda — see AWS Compute: EC2 vs Lambda vs ECS vs EKS) inside the VPC, and that compute talks to three persistence options chosen by data model. Amazon RDS sits in private subnets as a Multi-AZ primary with a read replica — the writer takes transactional writes, the replica absorbs read-heavy reporting (and can lag, badge 1). Amazon Aurora is the same SQL but as a cluster: a writer plus reader endpoint over a shared 6-copy/3-AZ storage volume, so failover is seconds and reader lag is sub-100 ms (badge 2 marks the connection-pool ceiling you hit at scale). Amazon DynamoDB is reached over the AWS API (often via a VPC endpoint, no SNAT) as partitioned key-value storage — O(1) by partition key, but throttling if one key runs hot (badge 3). A fourth zone shows the shared backbone every store leans on: KMS encryption, CloudWatch metrics/alarms, DynamoDB Streams and S3 export feeding analytics.
Notice the convergence: whichever store you pick, the operational truth lives in the same instruments — CloudWatch metrics (ReplicaLag, DatabaseConnections, ThrottledRequests), Performance Insights for the SQL engines, and Contributor Insights for DynamoDB hot keys. The badges map the four failures that actually page you (replica lag, connection exhaustion, hot partition, and a runaway capacity/cost spike) onto the exact node where each bites, and the legend narrates each as symptom · confirm · fix. The whole method is: localise the workload to the right store by data model, then watch the one metric that store fails on.
Real-world scenario
Trackwise Logistics runs a parcel-tracking platform across India: a web/mobile front end, a fleet of delivery scanners emitting events, and a back office for billing and reporting. The platform team is six engineers; the original design put everything on a single RDS PostgreSQL db.r6g.xlarge Multi-AZ instance in ap-south-1 (Mumbai), with two read replicas. Monthly database spend was about ₹95,000 and climbing.
The trouble started as the business grew. Scanner events — “parcel X scanned at hub Y at time Z” — reached 40,000 writes per second at peak, all hammering one PostgreSQL primary. The team’s reflex was vertical: bigger instance, then more read replicas. But replicas don’t help writes, and the primary’s WriteIOPS and DiskQueueDepth were pinned. ReplicaLag on the reporting replica climbed to 45 seconds during peaks, so the customer-facing “where is my parcel” page — which read from a replica — showed stale locations and generated support tickets. Adding a fifth replica was both expensive and useless for the write bottleneck. The architecture had a data-model mismatch wearing a scaling costume: a high-velocity, append-only, key-accessed event stream was living in a normalised relational table.
The breakthrough was separating concerns by data model. Tracking events are key-value, write-heavy, accessed by tracking ID — a textbook DynamoDB workload. Orders, invoices and financial reports are relational, transactional, join-heavy — they belong in SQL. The team split the system:
- DynamoDB for shipment events: partition key
trackingId, sort keyeventTimestamp, on-demand capacity (the load was spiky around delivery windows). Item collections per tracking ID gave instant, sorted event history with a singleQuery. A GSI onhubId + timestampanswered “all events at hub Y in the last hour” for operations. TTL auto-expired events older than 18 months. Streams fed a Lambda that updated a materialised “current status” item. - Aurora PostgreSQL (migrated from RDS) for orders, invoices and reports: a writer plus two reader endpoints, with the reporting workload isolated on a custom endpoint pointing at the larger readers. Aurora’s sub-100 ms reader lag killed the stale-report problem, and faster failover improved availability.
- ElastiCache Redis in front of the “current status” lookups for sub-millisecond reads on the hottest tracking IDs.
The early DynamoDB design had one scare: an initial key of partition = "EVENT" (a constant) created a catastrophic hot partition — everything hashed to one place and throttled at a fraction of expected load. Contributor Insights showed a single partition key taking 100% of traffic. The fix was the proper high-cardinality key (trackingId), and throughput problems vanished.
The outcome: tracking-page latency fell from “up to 45 s stale” to single-digit milliseconds and always current; the event firehose scaled with zero replica management; and total database spend dropped to about ₹78,000/month because DynamoDB on-demand replaced four over-sized RDS instances and Aurora right-sized the relational core. The lesson on the wall: “Don’t scale the wrong database harder — move the workload to the database that fits its data model.”
The migration as a timeline, because the order of moves is the lesson:
| Phase | Symptom | Action taken | Effect | What it should have been |
|---|---|---|---|---|
| Baseline | One RDS instance, all workloads | (original design) | Works at small scale | Split by data model from day one |
| Growth | Writes pinned, replica lag 45 s | Scale up the instance | Brief relief, recurs | Don’t scale up to mask a model mismatch |
| Growth | Stale tracking page | Add a 5th read replica | No help (writes bottleneck) | Move events off relational |
| Redesign | Events identified as key-value | DynamoDB, PK=trackingId |
Firehose absorbed | The correct move |
| Redesign | DynamoDB throttling early | Found constant PK = hot partition | Fix to high-cardinality key | Design keys for distribution first |
| Redesign | Relational core still on RDS | Migrate to Aurora + custom endpoint | Sub-100 ms reports, fast failover | — |
| Steady state | — | ElastiCache for hottest reads | Sub-ms current status | Cache last, after proving the need |
Advantages and disadvantages
Each store earns its place by fitting a data model — and each bites when forced outside it. Weigh them honestly:
| Service | Advantages | Disadvantages |
|---|---|---|
| RDS | Familiar engines (Postgres/MySQL/Oracle/SQL Server); full SQL, joins, transactions; easy lift-and-shift; mature tooling; managed backups/HA | Vertical write-scaling ceiling; replicas lag and don’t help writes; licensing cost (Oracle/SQL Server); you size/tune/patch the box |
| Aurora | Higher throughput than open-source RDS; 6-copy/3-AZ durability; sub-100 ms reader lag; fast failover; storage auto-grows to 128 TiB; Serverless v2 auto-scaling; fast clones/backtrack | Not 100% feature-parity with vanilla Postgres/MySQL (some extensions/quirks differ); overkill (and cost) for tiny apps; still vertical write-scaling per writer |
| DynamoDB | Serverless; single-digit-ms latency at any scale; no patching/sizing (on-demand); horizontal scale; Global Tables; Streams for CDC; per-request pricing | No joins/ad-hoc queries — access patterns up front; hot-partition risk; limited transactions (≤ 100 items); query inflexibility forces redesigns; cost surprises if access patterns are wrong |
When each matters: RDS is right when you have an existing relational app and want management without re-architecture, or when commodity SQL with joins and transactions is genuinely the model. Aurora is right when that same relational model needs more throughput, faster failover, more replicas, or auto-scaling compute — and you can live within its compatibility envelope. DynamoDB is right when the data is key-value/document, the access patterns are known and stable, and the scale or latency requirement exceeds what a single SQL writer can give. The recurring mistake is using scale or familiarity as the deciding factor instead of data model — that’s how relational data ends up throttling in DynamoDB and event firehoses end up drowning a Postgres primary.
Hands-on lab
Stand up one of each — a tiny RDS PostgreSQL instance, an Aurora Serverless v2 cluster, and an on-demand DynamoDB table — observe the difference, then tear it all down. Uses free-tier-eligible / minimal sizes; delete everything at the end to avoid charges. Run in CloudShell (Bash) with a default VPC, or set --db-subnet-group-name to your private subnet group.
Step 1 — Variables.
export AWS_PAGER="" # stop the CLI opening a pager
RG=db-lab
SG=$(aws ec2 describe-security-groups --filters Name=group-name,Values=default \
--query "SecurityGroups[0].GroupId" --output text)
echo "Using default security group $SG"
Step 2 — A small RDS PostgreSQL instance (free-tier class).
aws rds create-db-instance \
--db-instance-identifier ${RG}-rds \
--engine postgres --db-instance-class db.t4g.micro \
--allocated-storage 20 --storage-type gp3 \
--master-username labadmin --manage-master-user-password \
--no-publicly-accessible --vpc-security-group-ids $SG
Expected: a JSON block with "DBInstanceStatus": "creating". It takes a few minutes to become available.
Step 3 — An Aurora PostgreSQL Serverless v2 cluster.
aws rds create-db-cluster --db-cluster-identifier ${RG}-aurora \
--engine aurora-postgresql --engine-mode provisioned \
--master-username labadmin --manage-master-user-password \
--serverless-v2-scaling-configuration MinCapacity=0.5,MaxCapacity=4 \
--vpc-security-group-ids $SG
aws rds create-db-instance --db-instance-identifier ${RG}-aurora-1 \
--db-cluster-identifier ${RG}-aurora --engine aurora-postgresql \
--db-instance-class db.serverless
Expected: a cluster, then a db.serverless instance that scales between 0.5 and 4 ACUs.
Step 4 — A DynamoDB table (on-demand, PITR on).
aws dynamodb create-table --table-name ${RG}-events \
--attribute-definitions AttributeName=PK,AttributeType=S AttributeName=SK,AttributeType=S \
--key-schema AttributeName=PK,KeyType=HASH AttributeName=SK,KeyType=RANGE \
--billing-mode PAY_PER_REQUEST
aws dynamodb wait table-exists --table-name ${RG}-events
aws dynamodb update-continuous-backups --table-name ${RG}-events \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
Step 5 — Write and read a DynamoDB item (note: no schema, no instance).
aws dynamodb put-item --table-name ${RG}-events --item \
'{"PK":{"S":"TRACK#1001"},"SK":{"S":"EVT#2026-06-23T10:00"},"hub":{"S":"BLR"},"status":{"S":"in_transit"}}'
aws dynamodb query --table-name ${RG}-events \
--key-condition-expression "PK = :pk" \
--expression-attribute-values '{":pk":{"S":"TRACK#1001"}}'
Expected: the item comes back instantly — a single-partition Query, no joins, no capacity planning.
Step 6 — Watch the metric that each store fails on.
# DynamoDB: throttling (should be zero on this tiny load)
aws cloudwatch get-metric-statistics --namespace AWS/DynamoDB \
--metric-name ThrottledRequests --dimensions Name=TableName,Value=${RG}-events \
--start-time $(date -u -d '15 min ago' +%FT%TZ 2>/dev/null || date -u -v-15M +%FT%TZ) \
--end-time $(date -u +%FT%TZ) --period 300 --statistics Sum
# RDS: connections (once available)
aws cloudwatch get-metric-statistics --namespace AWS/RDS \
--metric-name DatabaseConnections --dimensions Name=DBInstanceIdentifier,Value=${RG}-rds \
--start-time $(date -u -d '15 min ago' +%FT%TZ 2>/dev/null || date -u -v-15M +%FT%TZ) \
--end-time $(date -u +%FT%TZ) --period 300 --statistics Maximum
Validation checklist. You created a relational instance (you size it), a cloud-native cluster that auto-scales compute (you set a range), and a serverless NoSQL table (you size nothing) — and queried DynamoDB by key with zero schema. That contrast is the lesson: the operational surface shrank from RDS → Aurora → DynamoDB. The steps mapped to what each proves:
| Step | What you did | What it proves |
|---|---|---|
| 2 | Create RDS with an instance class | You choose the box (vCPU/RAM) |
| 3 | Create Aurora Serverless v2 with an ACU range | Compute auto-scales between bounds |
| 4–5 | Create + query DynamoDB by key | No schema, no instance, O(1) by PK |
| 6 | Read the per-store failure metric | Each store fails on a different signal |
Cleanup (avoid lingering charges — do this).
aws dynamodb delete-table --table-name ${RG}-events
aws rds delete-db-instance --db-instance-identifier ${RG}-aurora-1 --skip-final-snapshot
aws rds delete-db-cluster --db-cluster-identifier ${RG}-aurora --skip-final-snapshot
aws rds delete-db-instance --db-instance-identifier ${RG}-rds --skip-final-snapshot
Cost note. A db.t4g.micro and a minimal Serverless v2 cluster left running for an hour are a few rupees; DynamoDB on-demand for a handful of requests is effectively free. The risk is forgetting to delete — RDS/Aurora bill per hour whether you use them or not, so run the cleanup.
Common mistakes & troubleshooting
This is the playbook — the part you bookmark. First as a scannable table you can read mid-incident, then the entries that bite hardest with the full confirm-and-fix detail.
| # | Symptom | Root cause | Confirm (exact path / command) | Fix |
|---|---|---|---|---|
| 1 | Reads return stale data; “where is my X” is behind | Read replica / reader lag under write load | CloudWatch ReplicaLag (RDS) / AuroraReplicaLag; rising under writes |
Read critical paths from primary/writer; reduce write rate; bigger replica; Aurora (lower lag) |
| 2 | App errors “too many connections” / “remaining connection slots reserved” | Connection exhaustion (no pooling; serverless fan-out) | RDS DatabaseConnections near max_connections; PG pg_stat_activity |
RDS Proxy or app pooler; raise max_connections (carefully); bigger instance |
| 3 | DynamoDB ProvisionedThroughputExceededException / throttling |
Hot partition (low-cardinality PK) or under-provisioned RCU/WCU | ThrottledRequests > 0; Contributor Insights shows one hot key |
Re-key for high cardinality / write-sharding; on-demand or raise capacity |
| 4 | DynamoDB bill spikes unexpectedly | On-demand × inefficient access (scans, large items, missing index) | Cost Explorer by usage type; ConsumedReadCapacityUnits; Scan count |
Query not Scan; add the right index; provisioned + auto-scaling for steady load |
| 5 | RDS/Aurora slow under load; high disk queue | IOPS-starved storage or unindexed/expensive queries | DiskQueueDepth high; Performance Insights top SQL/waits |
gp3 → more IOPS / io2; add indexes; fix N+1; cache |
| 6 | Can’t connect to the DB at all (timeout) | Security group / subnet / public-access misconfig | Test from a host in-VPC; check SG inbound on DB port; PubliclyAccessible |
Open SG from app SG on 5432/3306; private subnet routing; VPC endpoint for DynamoDB |
| 7 | Storage full → RDS read-only / outage | allocated_storage exhausted, autoscaling off |
FreeStorageSpace near zero; instance in storage-full |
Enable storage autoscaling with a ceiling; grow now; archive data |
| 8 | DynamoDB strong-read returns “not supported on GSI” | Strongly consistent read attempted on a GSI | Code uses ConsistentRead=true against a GSI |
Strong reads only on base table/LSI; design around it |
| 9 | Tried to add an LSI to an existing table — can’t | LSIs are creation-time only | update-table rejects LSI add |
Recreate the table with the LSI, or use a GSI instead |
| 10 | Aurora reads not load-balancing; one node hot | App connects to a single instance endpoint | Connection string targets an instance, not the reader endpoint | Use the reader endpoint (or a custom endpoint) for read traffic |
| 11 | RDS failover took ~2 minutes; users saw errors | Multi-AZ classic failover time + no app retry | Failover event in console; app has no reconnect logic | Aurora (faster failover) or Multi-AZ cluster; add connection retry/backoff |
| 12 | Item write rejected: “item size has exceeded the maximum” | Item > 400 KB | Write fails with size error | Split the item; store the blob in S3, keep a pointer in DynamoDB |
| 13 | DynamoDB query returns partial data | 1 MB page limit; not paginating | Results truncated; LastEvaluatedKey present |
Paginate using LastEvaluatedKey; narrow the query |
| 14 | Burstable RDS throttles after a while under load | db.t-class CPU credits exhausted |
CPUCreditBalance hits zero; CPU throttled |
Move to m/r class; or accept surplus-credit billing |
The expanded form, for the entries that cost the most time:
1. Reads return stale data; the customer-facing view is behind.
Root cause: A read replica (RDS) or reader (Aurora) lags the primary under write load, and you routed a read-your-own-write or freshness-sensitive read to it.
Confirm: CloudWatch ReplicaLag (RDS) or AuroraReplicaLag climbing as writes rise; the reader returns data that’s seconds old.
Fix: Route freshness-critical reads to the primary/writer; reduce the write rate or size the replica up; on RDS, consider Aurora whose shared storage keeps reader lag sub-100 ms. Don’t add more replicas to fix write pressure — replicas don’t help writes.
2. “FATAL: too many connections” / connection-slot errors.
Root cause: Connection exhaustion — too many app connections (no pooling), or a serverless/Lambda fleet each opening its own connection, against the instance’s max_connections.
Confirm: RDS DatabaseConnections near the limit; in Postgres, SELECT count(*) FROM pg_stat_activity;.
Fix: Put RDS Proxy (or a client-side pooler like PgBouncer) in front to multiplex connections; only then consider raising max_connections (each connection costs RAM, so a bigger instance may be needed). For Lambda at scale, RDS Proxy is almost mandatory.
# Create RDS Proxy to pool connections (needs an IAM role + secret)
aws rds create-db-proxy --db-proxy-name orders-proxy \
--engine-family POSTGRESQL --role-arn arn:aws:iam::111122223333:role/rds-proxy \
--auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:...:secret:orders","IAMAuth":"DISABLED"}]' \
--vpc-subnet-ids subnet-aaa subnet-bbb
3. DynamoDB throttling — ProvisionedThroughputExceededException.
Root cause: A hot partition (a low-cardinality or constant partition key concentrates traffic) or genuinely under-provisioned RCU/WCU.
Confirm: CloudWatch ThrottledRequests / ReadThrottleEvents > 0; turn on Contributor Insights to see the most-accessed key — a single key taking the bulk of traffic is the smoking gun.
Fix: Re-key for high cardinality; apply write sharding (suffix the key to spread writes) for unavoidably hot keys; switch to on-demand (absorbs spikes) or raise provisioned capacity. Adaptive capacity helps mild skew but won’t save a constant key.
4. DynamoDB bill spikes unexpectedly.
Root cause: On-demand billing multiplied by inefficient access — full-table Scans, oversized items, or a missing index forcing reads of more data than needed.
Confirm: Cost Explorer grouped by usage type; ConsumedReadCapacityUnits far above expectation; a high Scan count in CloudWatch.
Fix: Replace Scan with Query (key/index-based); add the GSI that serves the pattern; for steady high volume, move to provisioned + auto-scaling (cheaper per request than on-demand); store large blobs in S3, not in items.
5. RDS/Aurora slow under load, high disk queue.
Root cause: IOPS-starved storage (gp2 capped by size, or insufficient provisioned IOPS) or expensive/unindexed queries.
Confirm: DiskQueueDepth elevated; Performance Insights shows the top SQL and wait events (e.g. IO:DataFileRead).
Fix: Move to gp3 and provision the IOPS you need (or io2 for very high demand); add the missing indexes; fix N+1 query patterns; add ElastiCache for hot reads. Throwing a bigger instance at an unindexed query just delays the wall.
6. Can’t connect at all (timeout).
Root cause: Security group / subnet / public-access misconfiguration — the DB isn’t reachable from the app.
Confirm: From a host inside the VPC, test the port (nc -zv <endpoint> 5432); check the DB security group allows inbound from the app’s SG on the DB port; check PubliclyAccessible and route tables.
Fix: Add an inbound rule from the app security group to the DB port (5432/3306); keep the DB in private subnets; for DynamoDB from a private subnet, add a VPC gateway endpoint so traffic doesn’t need a NAT.
# Allow the app SG to reach Postgres on 5432
aws ec2 authorize-security-group-ingress --group-id $DB_SG \
--protocol tcp --port 5432 --source-group $APP_SG
# Gateway VPC endpoint for DynamoDB (no NAT, no data-transfer cost)
aws ec2 create-vpc-endpoint --vpc-id vpc-0abc --service-name com.amazonaws.ap-south-1.dynamodb \
--route-table-ids rtb-0def
7. Storage full → RDS goes read-only.
Root cause: allocated_storage exhausted with storage autoscaling off.
Confirm: FreeStorageSpace near zero; the instance shows storage-full.
Fix: Enable storage autoscaling with a max_allocated_storage ceiling so it grows before it fills; grow storage now; archive or purge old data. Aurora avoids this entirely (storage auto-grows to 128 TiB).
8–14 are covered crisply in the table above: strong reads aren’t allowed on GSIs (design around it); LSIs are creation-time only (use a GSI or rebuild); use the reader endpoint for Aurora read balancing; add connection retry/backoff for failovers; respect the 400 KB item cap (blob to S3); paginate past the 1 MB page limit; and move off db.t burstable classes when sustained load exhausts CPU credits.
Best practices
- Choose data-model-first, every time. Name the model (relational vs key-value/document), enumerate the access patterns, project the scale curve, state the consistency need — then pick the store. Never let “the database we know” or “NoSQL scales” make the call.
- Put databases in private subnets, reachable only via security groups. No public accessibility for production data stores; reference the app’s security group, not CIDRs, for inbound.
- Pool connections for SQL stores. Use RDS Proxy (especially for Lambda/serverless fan-out) or a client pooler; connection exhaustion is the most common avoidable SQL outage.
- Enable Multi-AZ and tested backups on every production SQL database. Multi-AZ for failover, PITR for recovery, and practise the restore — an untested backup is a hope.
- Turn on Performance Insights (SQL) and Contributor Insights (DynamoDB). They turn a two-hour mystery into a two-minute lookup: top SQL/waits for RDS/Aurora, hottest keys for DynamoDB.
- Design DynamoDB around access patterns, not entities. High-cardinality partition keys, the right GSIs, write-sharding for hot keys; favour
QueryoverScan; remember LSIs are creation-time only. - Right-size capacity to the load curve. On-demand DynamoDB and Aurora Serverless v2 for spiky/unpredictable load; provisioned (with auto-scaling / reserved capacity) for steady, predictable volume to cut cost.
- Use gp3 and provision IOPS for what you need. Don’t let gp2’s size-coupled IOPS throttle you; watch
DiskQueueDepth. - Encrypt at rest with KMS and in transit with TLS. Enable storage encryption at creation (you can’t add it in place easily later); enforce TLS connections.
- Use IAM authentication and Secrets Manager instead of long-lived passwords where possible; rotate managed master secrets.
- Alert on the leading indicators, not just “database down”:
ReplicaLag,DatabaseConnections,DiskQueueDepth,FreeStorageSpace, DynamoDBThrottledRequestsand consumed-capacity vs provisioned. - Add a cache (ElastiCache/DAX) last — only after proving the database itself is the latency bottleneck, never as the first reflex.
The alarms worth wiring before the next incident — the leading indicators per store:
| Alert on | Store | Metric | Threshold (starting point) | Why it’s leading |
|---|---|---|---|---|
| Replica lag | RDS / Aurora | ReplicaLag / AuroraReplicaLag |
> 5 s sustained | Stale reads before users complain |
| Connection pressure | RDS / Aurora | DatabaseConnections |
> 80% of max_connections |
Predicts connection-exhaustion errors |
| Storage exhaustion | RDS | FreeStorageSpace |
< 10% free | Prevents read-only / outage |
| IOPS starvation | RDS / Aurora | DiskQueueDepth |
> 5 sustained | Slow queries before timeouts |
| DynamoDB throttling | DynamoDB | ThrottledRequests |
> 0 sustained | Hot key / under-capacity early |
| Capacity vs provisioned | DynamoDB | Consumed vs Provisioned | > 80% of provisioned | Pre-empts throttling on provisioned tables |
| CPU credits (burstable) | RDS | CPUCreditBalance |
trending to 0 | Predicts t-class throttle |
Security notes
- Encryption at rest with KMS. Enable storage encryption when you create RDS/Aurora and DynamoDB (DynamoDB encrypts by default; choose an AWS-owned, AWS-managed, or customer-managed key). Adding encryption to an existing unencrypted RDS instance means a snapshot-copy-and-restore, so do it from the start.
- Encryption in transit (TLS). Enforce TLS to RDS/Aurora (require SSL in the parameter group / connection string); DynamoDB and its VPC endpoint use HTTPS.
- Least-privilege IAM. Scope DynamoDB access to specific tables and actions (
dynamodb:Queryontable/shop-events, notdynamodb:*on*). Use IAM database authentication for RDS/Aurora to issue short-lived tokens instead of static passwords where it fits. - Network isolation. Databases in private subnets; inbound only from the application’s security group on the DB port; VPC gateway/interface endpoints for DynamoDB so traffic never leaves the AWS network (and you skip NAT cost).
- Secrets management. Store credentials in Secrets Manager with rotation (RDS managed master password); never hard-code connection strings.
- Auditing. Enable engine audit logs (RDS) and CloudTrail for control-plane and DynamoDB data-plane events; ship logs to CloudWatch/S3. See AWS CloudTrail, Config & Audit Compliance.
- Deletion protection + backups. Turn on deletion protection for production stores and keep PITR enabled so an accidental drop or a bad deploy is recoverable.
The security controls mapped to what each defends and how to set it:
| Control | Setting / mechanism | Defends against | How to enable |
|---|---|---|---|
| Encryption at rest | KMS key on RDS/Aurora/DynamoDB | Disk/snapshot exposure | --storage-encrypted (set at create); DynamoDB SSE |
| Encryption in transit | Require SSL / HTTPS | Network sniffing / MITM | rds.force_ssl param; TLS in connection string |
| Network isolation | Private subnets + SG + VPC endpoint | Direct internet access | SG inbound from app SG; gateway endpoint for DynamoDB |
| Least-privilege access | Scoped IAM policies; IAM DB auth | Over-broad credentials | Resource-level ARNs; rds-db:connect |
| Secrets rotation | Secrets Manager managed secret | Leaked/static passwords | --manage-master-user-password; rotation schedule |
| Deletion protection | deletion_protection = true |
Accidental drop | Flag on RDS/Aurora; PITR on DynamoDB |
| Audit trail | CloudTrail + engine audit logs | Undetected access/change | Enable trails; export DB logs to CloudWatch |
Cost & sizing
What drives the bill, per store, and how to right-size:
- RDS / Aurora bill on instance-hours + storage + I/O + backups + data transfer. The instance class dominates; you pay for it whether busy or idle (unless Serverless v2). Multi-AZ roughly doubles the compute cost (the standby); each read replica is another instance. Right-size by watching CPU/RAM/IOPS utilisation and Performance Insights — don’t run prod on an oversized
r-class “to be safe.” - Aurora Serverless v2 bills per ACU-hour between your min and max — cheaper for spiky load (scales down at quiet times) but potentially more than a right-sized provisioned instance if you’re always busy at high ACUs.
- DynamoDB on-demand bills per read/write request + storage — zero capacity planning, ideal for spiky or new workloads, but more per-request than provisioned at high steady volume. Provisioned (with auto-scaling and optionally reserved capacity) is markedly cheaper for predictable load. Watch for Scan-driven and large-item costs.
- Reserved Instances (RDS/Aurora) and reserved capacity (DynamoDB) cut steady-state cost substantially for 1- or 3-year commitments — apply them once load is stable.
- Free tier: RDS gives 750 hours/month of a
db.t2/t3/t4g.micro+ 20 GB storage for 12 months; DynamoDB has a perpetual free tier (25 GB storage + 25 WCU/25 RCU provisioned, or a generous on-demand allowance). Aurora has no free tier — the cheapest path to “try Aurora” is a minimal Serverless v2 cluster, deleted after.
A rough monthly picture (ap-south-1, illustrative — always price for your region/usage):
| Configuration | What you pay for | Rough INR / month | Fits | Watch-out |
|---|---|---|---|---|
RDS db.t4g.micro (free tier) |
One burstable instance + 20 GB | ~₹0 (12 mo) then ~₹1,200 | Dev / tiny apps | Credit throttle under load |
RDS db.r6g.large Multi-AZ |
2× memory-opt instance + gp3 | ~₹35,000–45,000 | Steady production OLTP | Standby doubles compute |
Aurora db.r6g.large writer + 1 reader |
2 instances + storage + I/O | ~₹40,000–55,000 | High-throughput SQL + HA | Per-instance cost adds up |
| Aurora Serverless v2 (0.5–8 ACU) | ACU-hours between bounds | ~₹8,000–60,000 (load-driven) | Spiky / dev SQL | Always-busy = pricey |
| DynamoDB on-demand (moderate) | Per-request + storage | ~₹5,000–30,000 | Spiky key-value at scale | Scans/large items inflate it |
| DynamoDB provisioned + auto-scale | RCU/WCU-hours + storage | ~₹3,000–20,000 | Steady predictable load | Under-provision → throttle |
| DAX / ElastiCache (optional) | Cache node-hours | ~₹6,000–20,000 | Sub-ms hot reads | Only after proving the need |
The Trackwise lesson on cost: moving the event firehose off four oversized RDS instances onto DynamoDB on-demand, and right-sizing the relational core to Aurora, lowered the bill from ₹95,000 to ₹78,000 — proof that the cheapest store is usually the one that fits, not the smallest instance of the wrong one.
Interview & exam questions
1. How do you choose between RDS, Aurora and DynamoDB? Data-model first: relational data with joins/transactions/ad-hoc queries goes to RDS (existing engine, lift-and-shift) or Aurora (same SQL, higher throughput, faster failover, more replicas, auto-scaling). Key-value/document data with known access patterns at large or unpredictable scale goes to DynamoDB. Scale, latency and cost are tie-breakers after the model decides.
2. What does Aurora change versus RDS for the same engine? Aurora keeps the MySQL/PostgreSQL wire protocol and SQL but replaces the storage layer with a distributed store that keeps six copies across three AZs and auto-grows to 128 TiB. Consequences: faster failover (seconds vs 60–120 s), sub-100 ms reader lag (replicas read shared storage), up to 15 readers, and higher write throughput — at the cost of not being 100% feature-parity with vanilla Postgres/MySQL.
3. Why does DynamoDB’s single-digit-millisecond latency depend on the partition key? DynamoDB hashes the partition key to place items in partitions, and throughput is per partition. A well-distributed (high-cardinality) key spreads load evenly and gives O(1) access at any scale; a low-cardinality or constant key creates a hot partition that throttles even when the table is far under total capacity. Key design is the whole game.
4. Difference between a GSI and an LSI, and when do you use each? A GSI can have any partition/sort key, can be added or removed anytime, has its own capacity, but is eventually consistent only (max 20). An LSI shares the table’s partition key with a different sort key, must be created with the table, can be strongly consistent, shares the table’s capacity, and is bound by a 10 GB per-partition-key item-collection limit (max 5). Use a GSI for a different access key; an LSI for an alternate sort within the same partition key — decided at design time.
5. Multi-AZ vs read replicas on RDS — what’s the difference? Multi-AZ is for availability: a synchronous standby in another AZ that takes over on failure (the standby is not readable in classic Multi-AZ). Read replicas are for read scaling: asynchronous, readable copies that can lag the primary and don’t help write throughput. They solve different problems and you often deploy both.
6. DynamoDB on-demand vs provisioned capacity — how do you choose? On-demand scales automatically and bills per request — pick it for spiky, unpredictable, or new workloads with zero capacity planning. Provisioned (with auto-scaling and optionally reserved capacity) sets RCU/WCU and is markedly cheaper for steady, predictable load. The trade is convenience/spike-handling (on-demand) vs cost-efficiency at stable high volume (provisioned).
7. What’s the difference between eventually and strongly consistent reads in DynamoDB, and the cost? An eventually consistent read may not reflect the most recent write for a short window (it might hit a not-yet-caught-up replica) and costs 0.5 RCU per 4 KB. A strongly consistent read returns the latest committed write and costs 1 RCU per 4 KB — but is not available on GSIs. Use strong reads only where read-after-write correctness matters; eventual elsewhere to scale and save.
8. An RDS read replica is showing 30-second lag and users see stale data. What’s happening and what do you do? Replicas replicate asynchronously, so under heavy write load they lag — ReplicaLag climbs and reads from the replica are stale. Route freshness-critical reads to the primary, reduce write pressure or size the replica up, and consider Aurora (shared storage keeps reader lag sub-100 ms). Adding more replicas won’t fix it — they don’t help write throughput.
9. Your Lambda functions exhaust RDS connections under load. Fix? Each Lambda invocation opening its own connection overwhelms max_connections. Put RDS Proxy in front to pool and multiplex connections across invocations (it’s effectively required for serverless-to-RDS at scale); only then consider a larger instance to raise max_connections. Confirm via DatabaseConnections near the limit.
10. What is Aurora Serverless v2 and when is it the right call? Serverless v2 auto-scales Aurora compute in fine-grained ACUs (≈ 2 GiB RAM each) between a min and max you set, near-instantly with load, billing per ACU-hour. It’s right for spiky, unpredictable, or intermittent SQL workloads (dev/test, variable traffic) where a fixed instance would be over- or under-provisioned. For always-busy steady load, a right-sized provisioned instance can be cheaper.
11. How do DynamoDB Global Tables and Aurora Global Database differ for multi-region? Global Tables give multi-region active-active writes with last-writer-wins conflict resolution — any region can write. Aurora Global Database has one primary region (writes) and read-only secondary regions with < 1 s replication and managed failover for DR/locality. Active-active multi-master (Global Tables) vs single-writer-with-fast-DR (Aurora Global).
12. When is none of these three the right answer? For heavy ad-hoc analytics/joins over large data, use Redshift or Athena (OLAP), not these OLTP stores. For sub-millisecond caching/leaderboards, use ElastiCache in front of the system of record. For graph or time-series at scale, AWS has purpose-built stores. Don’t bend RDS/Aurora/DynamoDB into an analytics warehouse or a cache.
These map to AWS Certified Solutions Architect – Associate (SAA-C03) — design resilient, high-performing, cost-optimised architectures, including selecting databases — and to the Database Specialty / Data Engineer Associate scope for the deeper RDS/Aurora/DynamoDB internals. A compact cert-mapping for revision:
| Question theme | Primary cert | Objective area |
|---|---|---|
| Service selection by data model | SAA-C03 | Design resilient & high-performing architectures |
| RDS Multi-AZ vs replicas, failover | SAA-C03 | High availability & fault tolerance |
| Aurora internals, endpoints, Serverless v2 | DBS / Data Engineer | Database design & operations |
| DynamoDB keys, GSI/LSI, capacity | DBS / Data Engineer | NoSQL design & throughput |
| Consistency models & transactions | DBS / SAA-C03 | Data consistency & integrity |
| Cost optimisation (on-demand vs provisioned, RIs) | SAA-C03 | Cost-optimised architectures |
| Encryption, IAM, network isolation | SCS / SAA-C03 | Secure architectures |
Quick check
- You’re migrating an existing Oracle application with complex joins and stored procedures to AWS with minimal change. Which store, and why not DynamoDB?
- Your DynamoDB table throttles at a fraction of expected load, and Contributor Insights shows one partition key taking nearly all traffic. What is this called and how do you fix it?
- True or false: adding more RDS read replicas is the right fix for an app whose writes are bottlenecked.
- You need a strongly consistent read in DynamoDB but the attribute you query on is only on a GSI. What’s the problem, and what do you do?
- A relational workload has wildly spiky, unpredictable traffic and you don’t want to size a fixed instance. Which AWS option fits, and what unit does it scale/bill in?
Answers
- RDS (Oracle engine). It’s the same engine you already run, so it’s a lift-and-shift with full SQL, joins, transactions and PL/SQL — DynamoDB is wrong because it has no joins or ad-hoc queries and would force a complete re-architecture of the data model and application.
- A hot partition: a low-cardinality (or constant) partition key concentrates traffic on one partition, which throttles even though the table is far under total capacity. Fix by re-keying for high cardinality and/or write sharding (a calculated suffix to spread writes); on-demand or more capacity won’t help a constant key.
- False. Read replicas only scale reads and replicate asynchronously; they do nothing for write throughput. A write bottleneck needs a bigger writer (scale up), Aurora’s higher write ceiling, or moving the write-heavy workload to a horizontally-scaling store like DynamoDB.
- Strongly consistent reads aren’t supported on GSIs (GSIs are eventually consistent only). Either accept eventual consistency for that query, or redesign so the attribute is the base-table key (or an LSI, which can be strongly consistent) — decided at table-creation time.
- Aurora Serverless v2 (for the relational model). It auto-scales compute in ACUs (≈ 2 GiB RAM each) between a min and max you set, billing per ACU-hour, so it shrinks at quiet times and grows for spikes without you sizing a fixed instance.
Glossary
- Amazon RDS — managed relational database service running PostgreSQL, MySQL, MariaDB, Oracle or SQL Server; AWS handles provisioning, patching, backups and Multi-AZ failover, you run the same engine you would on a server.
- Amazon Aurora — AWS’s cloud-native relational engine, MySQL- and PostgreSQL-compatible, with a distributed storage layer (six copies across three AZs, auto-grow to 128 TiB), fast failover and low-lag readers.
- Amazon DynamoDB — fully managed serverless key-value and document database delivering single-digit-millisecond latency at any scale via partitioned storage.
- DB instance / instance class — the compute box (vCPU + RAM, e.g.
db.r6g.large) running an RDS/Aurora engine; the capacity ceiling and failover unit. - Multi-AZ — a synchronous standby in a second Availability Zone for automatic failover (availability), distinct from read scaling.
- Read replica — an asynchronous, readable copy of an RDS/Aurora database used to scale reads; can lag the primary.
- Aurora cluster — a writer plus up to 15 readers sharing one distributed storage volume; the unit you create in Aurora.
- Cluster endpoint (writer / reader / custom) — DNS names that route to the current writer, load-balance across readers, or target a chosen subset of Aurora instances.
- ACU (Aurora Capacity Unit) — the fine-grained scaling/billing unit of Aurora Serverless v2 (≈ 2 GiB RAM with proportional CPU/network).
- Partition key — the DynamoDB attribute hashed to choose an item’s partition; high cardinality spreads load, low cardinality creates hot partitions.
- Sort key — an optional second DynamoDB key that orders items within a partition and enables range queries and item collections.
- Hot partition — a partition key taking disproportionate traffic, throttling even when the table is under total capacity; fixed by key design / write sharding.
- RCU / WCU — Read / Write Capacity Unit: one strongly-consistent 4 KB read/sec (RCU) or one 1 KB write/sec (WCU); the provisioned-capacity and billing unit.
- GSI (Global Secondary Index) — a DynamoDB index with any partition/sort key, addable anytime, own capacity, eventually consistent only.
- LSI (Local Secondary Index) — a DynamoDB index sharing the table’s partition key with a different sort key; creation-time only, can be strongly consistent, 10 GB per-partition-key limit.
- On-demand vs provisioned (DynamoDB) — pay-per-request auto-scaling capacity (on-demand) versus pre-set RCU/WCU with optional auto-scaling (provisioned).
- Serverless v2 (Aurora) — Aurora capacity mode that auto-scales compute in ACUs with load, billing per ACU-hour.
- PITR (point-in-time recovery) — continuous backup allowing restore to any second within the retention window (up to 35 days) on RDS, Aurora and DynamoDB.
- DynamoDB Streams — an ordered change log of item modifications (24 h retention) for CDC and event-driven pipelines.
- Global Tables — multi-region active-active DynamoDB replication with last-writer-wins conflict resolution.
- RDS Proxy — a managed connection pooler that multiplexes application connections to RDS/Aurora, preventing connection exhaustion (key for serverless).
- DAX (DynamoDB Accelerator) — an in-memory cache in front of DynamoDB for microsecond reads on hot items.
Next steps
You can now choose RDS, Aurora or DynamoDB by data model and defend it. Build outward:
- Next: AWS Backup & Disaster-Recovery Strategies — protect whichever store you chose: snapshots, PITR, cross-region copy, and tested restores.
- Related: Amazon VPC: Subnets, Security Groups & Network Design — put databases in private subnets reachable only through security groups, with VPC endpoints for DynamoDB.
- Related: AWS Compute: EC2 vs Lambda vs ECS vs EKS — the compute that connects to these stores, and why Lambda needs RDS Proxy.
- Related: AWS Storage: S3 Storage Classes & Lifecycle — the data-lake target you export DynamoDB and Aurora data to for analytics.
- Related: AWS CloudTrail, Config & Audit Compliance — audit data-plane and control-plane access to your databases.