GCP Lesson 16 of 98

Google Cloud SQL, In Depth: Engines, HA, Read Replicas, Backups & Connectivity

Almost every application that does anything useful needs a relational database, and on Google Cloud the default answer is Cloud SQL — a fully managed service that runs the same MySQL, PostgreSQL and SQL Server engines you already know, while taking the operational pain (patching, backups, replication, failover, storage growth) off your plate. You bring your schema and your queries; Google runs the engine, the host, the disk and the high-availability machinery underneath.

The trouble with “fully managed” is that it hides a lot of decisions you still have to make, and a few of them are very hard to change after the fact. The engine and its major version, the edition, whether you turn on high availability at creation, whether you give the instance a public IP, and which encryption keys protect the data are choices that shape cost, performance, resilience and security for the life of the instance. Get them right up front and Cloud SQL is calm. Get them wrong and you are looking at a migration.

This lesson is the exhaustive version. We will walk every creation setting blade by blade, explain the load-bearing concepts in full — especially the difference between high availability and read replicas, which interviewers probe relentlessly and beginners routinely conflate — cover backups and point-in-time recovery, lay out every connectivity option with its trade-offs, explain customer-managed encryption, and finish with the decision that trips up architects: when Cloud SQL is the right tool and when you should reach for AlloyDB or Cloud Spanner instead. Commands are real gcloud sql against current Cloud SQL (2026), with the console blade names called out so you can follow along either way.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

You should be comfortable with the Google Cloud resource hierarchy (organisation, folders, projects) and basic IAM, have the gcloud CLI installed and initialised, and understand what a VPC and a subnet are — connectivity is half of running Cloud SQL well. Basic SQL and a working knowledge of one relational engine (MySQL or PostgreSQL) will make the engine sections concrete. This is the Databases lesson of the GCP Zero-to-Hero course, sitting after Cloud Storage (G4) and before the VPC deep dive (G6); it is the foundation for the Advanced production-operations companion, Cloud SQL in Production: HA, Read Replicas, PSC Connectivity, and Maintenance, which goes deeper on PSC topologies, failover storms and connection pooling.

Core concepts: the mental model

Before the settings, fix the vocabulary — most Cloud SQL confusion is vocabulary confusion.

The single most important mental distinction in this lesson: HA is for availability of one logical database after a zone outage; a read replica is a second, queryable database for scaling reads or for DR. They look superficially similar (both are “a copy in another place”) and are completely different in mechanism, in what you connect to, and in what they protect against. We will return to this in its own section because it is the question.

Choosing the engine, version and edition

Engines

Cloud SQL offers three engines. The choice is usually dictated by your application or team, but the managed-feature surface differs, so it matters.

Engine Typical use Notes on Cloud SQL support
PostgreSQL New cloud-native apps, geospatial (PostGIS), JSON-heavy, anything wanting rich extensions Broadest extension support; the usual default for greenfield. IAM auth, PITR, cross-region replicas all supported.
MySQL LAMP-stack apps, WordPress, large existing MySQL estates Mature, widely understood; supports IAM auth and replicas. Some replication nuances (GTID) are managed for you.
SQL Server .NET applications, lift-and-shift from on-prem SQL Server Licensed editions (Express/Web/Standard/Enterprise) priced into the machine; some features (e.g. cross-region replicas) have engine-specific limits. No IAM database auth — uses SQL Server authentication.

The engine cannot be changed after creation. Moving from MySQL to PostgreSQL is a data migration (Database Migration Service or dump/restore), not a setting.

Major versions

Pick a current major version at creation. You can perform an in-place major version upgrade later (e.g. PostgreSQL 15 → 16) — Cloud SQL automates it, but it requires downtime and is irreversible, so test on a clone first. Minor versions (patches) are applied automatically during maintenance windows; you do not pick those.

Editions: Enterprise vs Enterprise Plus

This is a genuine fork, not marketing. It governs which machine shapes you can use, performance features, and how much you pay.

Dimension Enterprise Enterprise Plus
Positioning Standard managed databases, dev/test and most production Performance- and availability-critical production
Machine shapes Shared-core, standard, high-memory, and custom machine types Performance-optimised machine families (more vCPU/memory headroom)
Data cache Not available Optional SSD-backed data cache that accelerates reads by caching data on local SSD
Max backup retention Up to 365 automated backups Longer retention ceiling (more automated backups)
Maintenance / failover Standard Reduced downtime for near-zero-downtime planned maintenance and faster failover
Cost Lower Higher per vCPU/GB; you pay for the performance and availability headroom

Choose Enterprise for dev/test and the majority of line-of-business workloads. Choose Enterprise Plus when you need the data cache for read-heavy latency-sensitive workloads, the longer backup retention, or the reduced-downtime maintenance and faster failover for tier-1 systems. You can upgrade Enterprise → Enterprise Plus in place (a restart); plan it.

Machine type and storage

Machine type (the --tier)

The machine type sets vCPU and memory — the biggest single lever on both performance and cost. Cloud SQL exposes three ways to size, depending on edition:

Shape style Example tier When
Shared-core db-f1-micro, db-g1-small (legacy) / db-perf-optimized micros Dev/test and tiny workloads only — bursty, not for production; not eligible for the HA SLA.
Predefined standard / high-memory db-n1-standard-4, db-n1-highmem-8 (and current-generation equivalents) Most production; pick a CPU:memory ratio (standard vs high-memory) to match the workload.
Custom (Enterprise) db-custom-4-16384 (4 vCPU, 16 GB) When no predefined shape fits — set vCPU and memory independently within allowed ratios.
Performance-optimised (Enterprise Plus) db-perf-optimized-N-8 Enterprise Plus instances; required to use the data cache.

Memory matters more than people expect: the engine’s buffer pool / shared buffers live in RAM, so an under-memoried instance thrashes to disk. As a rule, size memory to hold your working set. You can change the machine type after creation — it requires a restart (a brief outage on a non-HA instance; on an HA instance the standby reduces the impact), so do it in a maintenance window.

Storage

Setting Choices Default / guidance Gotcha
Storage type SSD (default) or HDD SSD for anything latency-sensitive (almost everything). HDD only for large, sequential, cost-sensitive workloads. Storage type is fixed after creation — you cannot switch SSD↔HDD in place.
Storage capacity From 10 GB upward Start modest; IOPS and throughput scale with size, so very small disks are also slow. You can grow storage but never shrink it.
Automatic storage increases On / Off Turn on. Cloud SQL grows the disk automatically when it nears full, preventing the instance from going read-only / down. Once grown, it cannot shrink — set a sane growth limit if you want a ceiling.

The two storage gotchas — type is permanent and capacity only grows — are classic exam material. Always enable automatic storage increase in production: a full disk is one of the most common ways a Cloud SQL instance falls over.

High availability: the synchronous regional standby

This is the first half of the concept that matters most. A standard (non-HA) Cloud SQL instance is a single VM in a single zone. If that zone has an outage, your database is down until the zone recovers.

Regional high availability changes the topology. When you set the availability type to regional, Cloud SQL provisions a primary in one zone and a standby in another zone of the same region, with the data on a regional persistent disk that is replicated synchronously across both zones at the storage layer. Key properties:

Property Single-zone Regional HA
Failure domain tolerated None One zone in the region
Replication None (one disk) Synchronous, regional persistent disk
Failover None Automatic, endpoint preserved
Data loss on failover Total until zone recovers None (RPO ≈ 0)
Serves read traffic? n/a No — standby is passive
SLA None 99.95%
Cost 1× compute + disk ~2× compute + regional disk

Enable HA at creation (or convert later — it requires a restart):

gcloud sql instances create pg-prod \
  --project=app-data-prj \
  --database-version=POSTGRES_16 \
  --edition=ENTERPRISE_PLUS \
  --tier=db-perf-optimized-N-8 \
  --region=us-central1 \
  --availability-type=REGIONAL \
  --storage-type=SSD \
  --storage-size=100 \
  --storage-auto-increase \
  --enable-point-in-time-recovery \
  --backup-start-time=03:00 \
  --retained-backups-count=14

The one thing HA does not protect against is a region failure (every zone is in one region) or a logical error such as a bad DELETE (it is faithfully replicated to the standby in milliseconds). For region loss you need a cross-region read replica; for logical errors you need backups and PITR. HA, replicas and backups are three different protections — you generally want all three.

Read replicas: asynchronous, separately-served copies

This is the second half. A read replica is a separate instance — its own machine type, its own IP, its own connection name — that receives changes from the primary asynchronously and serves read-only queries. It exists for three reasons:

  1. Scale reads. Point reporting, analytics and read-heavy traffic at replicas to keep load off the primary.
  2. Cross-region disaster recovery. A replica in another region is a warm copy you can promote to a standalone primary if the primary’s region is lost.
  3. Migration / version testing. Promote a replica to cut over, or use one to test an upgrade.

Key properties:

Create an in-region read replica and a cross-region one for DR:

# In-region read replica to offload reporting
gcloud sql instances create pg-prod-replica-1 \
  --master-instance-name=pg-prod \
  --region=us-central1 \
  --tier=db-custom-4-16384

# Cross-region read replica for disaster recovery
gcloud sql instances create pg-prod-dr \
  --master-instance-name=pg-prod \
  --region=europe-west1 \
  --tier=db-custom-4-16384

Promote the DR replica if you lose the primary region (irreversible — it becomes a standalone primary):

gcloud sql instances promote-replica pg-prod-dr

HA vs read replica — the difference, stated plainly

This is the interview question. Memorise the contrast:

Aspect High availability (standby) Read replica
What it protects against A zone outage (same region) Scaling reads; region outage (cross-region, via promotion)
Replication Synchronous, storage-level Asynchronous, log-based
Data loss on failure None (RPO ≈ 0) Up to the replication lag
Can you query it? No — passive standby Yes — read-only
Separate endpoint? No — endpoint preserved on failover Yes — its own IP/connection name
Failover Automatic Manual promotion (irreversible)
Same region only? Yes (regional disk) No — same or different region
Billing ~2× compute (hidden standby) Full price per replica instance

A clean one-liner for the interview: “HA gives me the same database surviving a zone failure at the same endpoint with zero data loss; a read replica gives me a second, queryable database that lags the primary and can be promoted for read scaling or cross-region DR.” Most resilient production designs use HA for zone resilience + a cross-region replica for region DR + backups for logical recovery — they are complementary, not alternatives.

Backups and point-in-time recovery

HA and replicas protect against infrastructure failures; backups protect against logical disasters — a bad migration, a DROP TABLE, ransomware, a fat-fingered DELETE. They are non-negotiable.

Automated backups

Cloud SQL takes automated daily backups in a window you choose. These backups are incremental under the hood, stored redundantly and managed by Google.

Setting What it does Guidance
Automated backups Enable daily backups On for any instance you care about (required for HA and for creating replicas).
Backup window (--backup-start-time) The hour the daily backup begins (UTC) Pick a low-traffic hour, e.g. 03:00.
Retained backups (--retained-backups-count) How many automated backups to keep Default 7; up to 365 (more on Enterprise Plus). Match your compliance window.
Backup location Region(s) where backups are stored Default is multi-region near the instance; can be customised for data-residency.

Point-in-time recovery (PITR)

Automated daily backups alone only let you restore to a backup point. Point-in-time recovery layers transaction (write-ahead) log retention on top, letting you recover to any second within the retention window — essential for “undo the last 20 minutes” scenarios.

gcloud sql instances patch pg-prod \
  --enable-point-in-time-recovery \
  --retained-transaction-log-days=7

Crucially, PITR does not restore in place — it creates a new instance (a clone) at the chosen timestamp. That is by design: you recover the lost data into a fresh instance, verify it, then redirect the application or copy the rows back. PITR for PostgreSQL/MySQL uses retained transaction logs; the more log-days you retain the wider your recovery window (and the more storage you consume).

Recover to a moment just before a bad change:

gcloud sql instances clone pg-prod pg-prod-recovered \
  --point-in-time='2026-06-15T09:42:00.000Z'

On-demand backups

Take a manual backup before any risky operation — a major version upgrade, a large migration, a schema change:

gcloud sql backups create --instance=pg-prod \
  --description="pre-upgrade-2026-06-15"

On-demand backups are retained until you delete them (they are not subject to the automated retention count), so clean them up to control cost.

Restore vs clone vs export

Three different operations people confuse:

A backup you have never restored is a hope, not a backup. Periodically clone to a point in time and verify your restore objective (RTO/RPO) is real.

Connectivity: every way to reach the instance

How clients reach Cloud SQL is the part beginners under-think and the part that causes the most production incidents (and security findings). There are several models; they are not mutually exclusive.

1. Public IP + authorised networks

The instance gets a public IPv4 address. By default nothing can connect — you must add authorised networks (allow-listed source CIDRs) and the client must use SSL/TLS. Simple to start with, but a public database is an attack surface, and home/office IPs change.

gcloud sql instances patch pg-prod \
  --authorized-networks=203.0.113.10/32

Use public IP only for quick tests or when paired with the Auth Proxy (below). For production, prefer private IP.

2. Private IP via Private Service Access (PSA)

The instance gets an internal IP inside your VPC and is not reachable from the internet at all. This is the production default. It works through Private Service Access: you allocate an internal IP range in your VPC for Google-managed services and establish a VPC peering to the Cloud SQL service producer network; Cloud SQL then places the instance’s private IP in that peered range.

One-time per-VPC setup, then create the instance with private IP:

# 1. Allocate an internal range for Google service producers
gcloud compute addresses create google-managed-services-app-vpc \
  --global --purpose=VPC_PEERING --prefix-length=20 \
  --network=app-vpc

# 2. Create the private connection (PSA peering)
gcloud services vpc-peerings connect \
  --service=servicenetworking.googleapis.com \
  --ranges=google-managed-services-app-vpc \
  --network=app-vpc

# 3. Create the instance on private IP (no public IP)
gcloud sql instances create pg-prod \
  --database-version=POSTGRES_16 \
  --region=us-central1 \
  --network=projects/app-data-prj/global/networks/app-vpc \
  --no-assign-ip \
  --tier=db-custom-4-16384

--no-assign-ip removes the public IP entirely. Reaching a private-IP instance from another VPC or on-prem requires routing (VPC peering with custom-route export, a Shared VPC, or PSC — covered in the production companion lesson). For multi-VPC, centralised access, Private Service Connect (PSC) is the modern alternative to PSA and is covered in the Advanced lesson; for a single application VPC, PSA private IP is the standard.

3. The Cloud SQL Auth Proxy

The Auth Proxy is a small local binary (or sidecar/connector library) that creates a secure, IAM-authorised tunnel to your instance. It is the recommended way to connect from anywhere — laptops, GKE pods, Cloud Run, Compute Engine — because it gives you three things at once:

# Download once, then run pointing at the connection name
./cloud-sql-proxy app-data-prj:us-central1:pg-prod --port 5432
# App connects to 127.0.0.1:5432 as usual

The Auth Proxy works with both public and private IP instances and removes the need for authorised networks entirely — the security boundary becomes IAM, which is exactly where you want it. In GKE and Cloud Run it runs as a sidecar/integration and pairs naturally with Workload Identity so no keys are involved. The same machinery is available as language connector libraries (Java, Go, Python, Node.js) that embed the proxy logic directly into the client.

4. IAM database authentication

By default you authenticate to the database engine with a database username and password. IAM database authentication lets a Google identity (a user or a service account) log in to PostgreSQL or MySQL using a short-lived IAM access token instead of a password — no stored database passwords to leak or rotate.

# Enable IAM auth on the instance
gcloud sql instances patch pg-prod \
  --database-flags=cloudsql.iam_authentication=on

# Add a service account as an IAM database user
gcloud sql users create my-app@app-data-prj.iam \
  --instance=pg-prod --type=cloud_iam_service_account

The caller then connects (typically via the Auth Proxy/connector) using its IAM token; grant roles/cloudsql.instanceUser plus the engine-level GRANTs for what the user may touch. IAM auth is supported on MySQL and PostgreSQL; SQL Server uses SQL Server authentication and does not offer IAM database auth. Combining private IP + Auth Proxy + IAM auth + Workload Identity is the gold-standard, password-less, no-public-surface posture.

Connectivity at a glance

Model Reachable from Auth boundary Best for
Public IP + authorised networks Internet (allow-listed CIDRs) IP allow-list + SSL + DB password Quick tests; legacy clients
Private IP (PSA) Inside the VPC (and peered networks) Network + DB password/IAM Production single-VPC default
Auth Proxy / connectors Anywhere with IAM IAM (cloudsql.client) + mTLS Apps, GKE, Cloud Run, laptops
IAM database auth (layered on the above) IAM token instead of password Password-less DB login

Encryption: at rest, in transit, and CMEK

All Cloud SQL data is encrypted at rest by default with Google-managed keys — you do nothing. In transit, connections should always use TLS (the Auth Proxy and connectors enforce mutual TLS; for raw public-IP connections you can require SSL and even enforce client certificates).

For organisations that must control the key — for compliance, separation of duties, or the ability to revoke access cryptographically — Cloud SQL supports customer-managed encryption keys (CMEK) via Cloud KMS. With CMEK, the instance’s data is encrypted with a key you own and control in Cloud KMS; revoking or disabling the key renders the data inaccessible (a powerful and dangerous lever).

gcloud sql instances create pg-prod \
  --database-version=POSTGRES_16 \
  --region=us-central1 \
  --tier=db-custom-4-16384 \
  --disk-encryption-key=projects/sec-prj/locations/us-central1/keyRings/cloudsql/cryptoKeys/cloudsql-key

Two non-obvious rules: CMEK must be set at creation (you cannot wrap an existing instance’s storage after the fact), and the Cloud SQL service account must be granted the CryptoKey Encrypter/Decrypter role on the key, or creation fails. The KMS key must be in the same region as the instance. Treat CMEK key management as production-critical: lose or accidentally disable the key and you lose the database.

Maintenance windows, deny periods and database flags

Maintenance windows

Google patches the engine (minor versions) and the underlying host periodically. You control when by setting a maintenance window (day of week + hour). On HA and Enterprise Plus instances the disruption is minimised (near-zero-downtime maintenance); on single-zone instances expect a brief restart.

gcloud sql instances patch pg-prod \
  --maintenance-window-day=SUN \
  --maintenance-window-hour=4

You can also set maintenance timing (earlier vs later in the rollout schedule) and deny maintenance periods — date ranges (e.g. a holiday code-freeze, a Black-Friday window) during which Cloud SQL will not perform planned maintenance.

Database flags

Database flags are how you tune the engine — they map to the underlying MySQL/PostgreSQL/SQL Server configuration parameters (e.g. max_connections, log_min_duration_statement, shared_buffers-related settings, cloudsql.iam_authentication). Set them with --database-flags; some take effect live, others require a restart, and a subset of engine parameters is not exposed because Cloud SQL manages them. Always change flags in a maintenance window if a restart is implied, and test on a clone first.

After creation: what you can and cannot change

A quick reference for the operations that bite people:

Change Possible after creation? Notes
Engine (MySQL↔Postgres↔SQL Server) No Data migration only.
Major version upgrade Yes (in place) Downtime; irreversible; test on a clone.
Major version downgrade No Restore an older backup / migrate.
Edition (Enterprise → Enterprise Plus) Yes Restart required.
Machine type (vCPU/RAM) Yes Restart; do it in a window.
Storage type (SSD↔HDD) No Permanent.
Storage capacity Grow only Never shrinks; enable auto-increase.
Enable/disable HA Yes Restart.
Add/remove read replicas Yes Promotion of a replica is irreversible.
Public IP on/off Yes Can add private IP; PSA peering is per-VPC one-time.
CMEK No (must be set at creation) Cannot wrap existing storage later.
Maintenance window / flags Yes Some flags need a restart.

Embedded diagram

Google Cloud SQL deep dive

The diagram shows the whole picture in one frame: the regional primary + synchronous standby sharing a regional disk (HA, zone resilience, one endpoint), an in-region read replica and a cross-region replica fed asynchronously (read scaling and region DR), the backup/PITR layer underneath, and the four connectivity paths — public IP, private IP via PSA, the Auth Proxy, and IAM auth — converging on the primary. Keep this mental model: three independent protections (HA, replicas, backups) and a layered connectivity story.

Hands-on lab

We will create a small Cloud SQL for PostgreSQL instance, enable backups and PITR, add a read replica, take an on-demand backup, connect via the Auth Proxy, then clean everything up. Use a sandbox project and the $300 free trial credit (Cloud SQL is not in the Always Free tier — even small instances bill per hour, so do the cleanup).

0. Set context and enable the API.

gcloud config set project YOUR_SANDBOX_PROJECT
gcloud services enable sqladmin.googleapis.com

1. Create a small instance with backups + PITR. (We use a tiny shared-core tier and public IP to keep the lab cheap and simple; production would use private IP and a standard tier.)

gcloud sql instances create lab-pg \
  --database-version=POSTGRES_16 \
  --edition=ENTERPRISE \
  --tier=db-g1-small \
  --region=us-central1 \
  --storage-type=SSD \
  --storage-size=10 \
  --storage-auto-increase \
  --enable-point-in-time-recovery \
  --backup-start-time=03:00 \
  --retained-backups-count=7 \
  --root-password='ChangeMe_Lab_2026!'

Creation takes a few minutes. Expected: the instance reaches RUNNABLE.

2. Validate the configuration.

gcloud sql instances describe lab-pg \
  --format="yaml(databaseVersion, settings.tier, settings.availabilityType, \
  settings.backupConfiguration.enabled, \
  settings.backupConfiguration.pointInTimeRecoveryEnabled)"

Expected: POSTGRES_16, db-g1-small, availability ZONAL, backups enabled: true, PITR true.

3. Create a database and a user.

gcloud sql databases create appdb --instance=lab-pg
gcloud sql users create appuser --instance=lab-pg --password='AppUser_Lab_2026!'

4. Add an in-region read replica, then confirm it.

gcloud sql instances create lab-pg-replica \
  --master-instance-name=lab-pg \
  --region=us-central1 \
  --tier=db-g1-small

gcloud sql instances list --format="table(name, instanceType, region, state)"

Expected: two rows — lab-pg (CLOUD_SQL_INSTANCE) and lab-pg-replica (READ_REPLICA_INSTANCE).

5. Take an on-demand backup.

gcloud sql backups create --instance=lab-pg --description="lab-manual"
gcloud sql backups list --instance=lab-pg --format="table(id, type, status, windowStartTime)"

Expected: at least one ON_DEMAND backup with status SUCCESSFUL (and AUTOMATED ones over time).

6. Connect via the Auth Proxy (optional but recommended). In one terminal, run the proxy against the connection name; in another, connect with psql.

# Terminal A
./cloud-sql-proxy "$(gcloud sql instances describe lab-pg --format='value(connectionName)')" --port 5432
# Terminal B
psql "host=127.0.0.1 port=5432 user=appuser dbname=appdb"

Expected: a psql prompt — you are connected over the IAM-authorised, TLS tunnel without any authorised-networks entry.

7. Cleanup (do this — Cloud SQL bills per hour). Delete the replica first, then the primary.

gcloud sql instances delete lab-pg-replica --quiet
gcloud sql instances delete lab-pg --quiet
gcloud sql instances list   # expect: empty

Cost note. Two db-g1-small instances plus SSD storage for an hour of lab time is well within the free-trial credit (on the order of a few US cents to low tens of cents). The bill keeps running until the instances are deleted — stopping is not enough to stop storage charges, and on-demand backups persist until removed. Always run step 7.

Common mistakes & troubleshooting

Symptom Likely cause Fix
“Read replica didn’t fail over when the zone died” Confused a replica with HA; replicas have no automatic failover Enable regional HA for zone resilience; replicas are for read scaling / manual DR promotion
Stale data from a replica Replicas are asynchronous — replication lag Read latency-critical / read-your-writes data from the primary; monitor replication lag
Cannot connect over public IP No authorised network entry, or SSL not used Add the client CIDR to authorised networks (or, better, use the Auth Proxy)
Cannot connect over private IP from another VPC Private IP is only reachable inside the VPC / peered networks Set up VPC peering with route export, Shared VPC, or PSC
Instance went read-only / down Disk full Enable automatic storage increase; you cannot shrink afterwards
CMEK instance fails to create Cloud SQL service account lacks Encrypter/Decrypter on the key, or key in wrong region Grant the role on the KMS key in the same region; CMEK must be set at creation
Unexpected bill after “stopping” Stopped instances still incur storage (and IP) charges Delete unused instances; clean up on-demand backups
Major upgrade broke the app Upgraded in place without testing Always upgrade a cloned instance first; on-demand backup beforehand

Best practices

Security notes

Cost & sizing

The levers that move the Cloud SQL bill, roughly in order of impact:

  1. Machine type (vCPU/RAM). The biggest line item. Right-size; don’t over-provision “to be safe”.
  2. Edition. Enterprise Plus costs more per unit — pay for it only where you need the data cache, longer retention, or faster maintenance/failover.
  3. High availability. ~2× compute for the standby. Worth it for production; skip for dev/test.
  4. Read replicas. Each replica is a full-price instance. Add for genuine read load or DR, not by default.
  5. Storage. SSD GB provisioned (and it only grows). Plus backup storage and transaction-log retention for PITR.
  6. Network egress. Cross-region replication and reads from other regions incur egress.
  7. Idle instances. Stopping does not stop storage/IP charges — delete what you don’t need; use sustained/committed-use discounts for steady production.

For dev/test, a small/shared-core single-zone instance with short backup retention is fine. For production, expect HA + standard tier + backups + PITR + (often) a cross-region replica, and use committed-use discounts on the steady baseline.

Interview & exam questions

  1. What is the difference between Cloud SQL high availability and a read replica? HA provisions a synchronous, passive standby in another zone of the same region sharing a regional disk; it fails over automatically, preserves the endpoint, and loses no committed data — it protects against a zone outage. A read replica is a separate, asynchronously-replicated, queryable instance with its own endpoint, used to scale reads or, when promoted (irreversible), for cross-region DR. HA is availability of the same DB; a replica is a second DB.

  2. Does enabling HA give you a second instance you can read from? No. The standby is passive and unqueryable. To offload reads you need a read replica.

  3. What does HA not protect against, and what does? It does not protect against a region failure (use a cross-region read replica + promotion) or a logical error like a bad DELETE (use backups + PITR — the bad change is replicated synchronously to the standby).

  4. Explain point-in-time recovery and how it differs from restoring a backup. PITR uses retained transaction logs to recover to any second in the window and creates a new instance (clone) at that timestamp — non-destructive. Restoring a backup overwrites an existing target instance and only goes to backup points.

  5. Walk me through the connectivity options and which you’d pick for production. Public IP + authorised networks (test only), private IP via PSA (production default, no internet exposure), the Auth Proxy (IAM-authorised mTLS tunnel from anywhere), and IAM database auth (token instead of password). Production: private IP + Auth Proxy + IAM auth + Workload Identity.

  6. What is the Cloud SQL Auth Proxy and why use it? A local binary/sidecar/connector that opens a secure, IAM-authorised (mTLS) tunnel using the connection name, removing the need for authorised-network IP allow-lists or managed certificates. The security boundary becomes IAM (roles/cloudsql.client).

  7. Which engines support IAM database authentication? MySQL and PostgreSQL. SQL Server uses SQL Server authentication and does not support IAM DB auth.

  8. What can you not change after an instance is created? The engine, the storage type (SSD↔HDD), major-version downgrade, and CMEK (must be set at creation). Storage capacity can only grow.

  9. What’s the difference between Enterprise and Enterprise Plus editions? Enterprise Plus adds performance-optimised machines, an SSD data cache, a longer backup-retention ceiling, and near-zero-downtime maintenance / faster failover, at higher cost. Enterprise covers dev/test and most production.

  10. Why did the database go read-only, and how do you prevent it? The disk filled up. Enable automatic storage increase (with a limit). Note you cannot shrink the disk afterwards.

  11. How do you do disaster recovery across regions with Cloud SQL? Maintain a cross-region read replica; on regional loss, promote it to a standalone primary (irreversible) and repoint the application. HA alone does not cross regions.

  12. When would you choose Spanner or AlloyDB over Cloud SQL? Spanner when you need horizontal write scaling and strong consistency across regions/continents beyond a single primary’s limits. AlloyDB when you want PostgreSQL compatibility with much higher performance (especially analytical/HTAP and large datasets). Cloud SQL when a single-primary managed MySQL/PostgreSQL/SQL Server with vertical scaling is enough — which is most of the time.

Quick check

  1. Synchronous standby in another zone, automatic failover, endpoint preserved, no data loss — HA or read replica?
  2. A second instance you can run read-only queries against, with its own IP, that lags the primary — HA or read replica?
  3. True/false: stopping a Cloud SQL instance stops all charges.
  4. Which connectivity option makes IAM the authorisation boundary and needs no authorised-network entries?
  5. You need to undo a bad DELETE that ran 15 minutes ago. Which feature, and what does it produce?

Answers

  1. HA (synchronous standby, automatic failover, same endpoint, RPO ≈ 0).
  2. Read replica (asynchronous, separate endpoint, read-only).
  3. False — storage (and reserved IP) charges continue; you must delete the instance to stop them.
  4. The Cloud SQL Auth Proxy (and connector libraries), backed by roles/cloudsql.client.
  5. Point-in-time recovery (PITR) — it creates a new instance (clone) at the chosen timestamp; you then recover the rows from it.

Exercise

Design and partially build a production-grade Cloud SQL for PostgreSQL topology in a sandbox project:

  1. Create a VPC and set up Private Service Access (allocate a /20 range, create the peering).
  2. Create an Enterprise PostgreSQL 16 instance on private IP only (--no-assign-ip), with regional HA, automated backups, PITR, and automatic storage increase.
  3. Add a cross-region read replica for DR.
  4. Enable IAM database authentication and add a service account as an IAM database user with least-privilege grants.
  5. Connect from a small VM (or Cloud Shell) using the Auth Proxy.
  6. Take an on-demand backup, then perform a PITR clone to a timestamp a few minutes in the past and verify the data.
  7. Promote the DR replica, observe it become standalone, then tear everything down and confirm the bill stops.

Write a short paragraph justifying each resilience choice (HA vs replica vs backup) and which failure each one addresses.

Certification mapping

Glossary

Next steps

Cloud SQL vs AlloyDB vs Spanner: choosing the right managed database

Cloud SQL is the default, but it is not the only managed relational option, and architects are expected to know when to move up. All three are managed and relational; they differ in scale model and consistency.

Dimension Cloud SQL AlloyDB Cloud Spanner
Engine / compatibility MySQL, PostgreSQL, SQL Server PostgreSQL-compatible Proprietary (with a PostgreSQL interface); ANSI SQL
Scale model Single primary, scale vertically; read replicas for reads Single primary with very high performance; read pools Horizontally scalable reads and writes
Consistency / distribution Strong within the primary Strong within the primary Strong consistency, globally distributed
High availability Regional standby (zone) + cross-region replica (DR) Built-in high availability Multi-region, 99.999% configurations
Best for Most apps; lift-and-shift; standard OLTP PostgreSQL apps needing more performance, HTAP/analytical acceleration, large datasets Global scale, huge write throughput, no-compromise consistency (e.g. global inventory, financial ledgers)
Operational model Pick instance size Pick instance size (cluster) Provision compute capacity (nodes/PUs); design schema to avoid hotspots
Relative cost Lowest Higher Highest (for the scale/SLA it buys)

How to decide:

The exam framing is consistent: single-primary vertical scaling → Cloud SQL; PostgreSQL needing more horsepower → AlloyDB; global, horizontally-scaled, strongly-consistent writes → Spanner.

GCPCloud SQLDatabasesHigh AvailabilityRead ReplicasPostgreSQL
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments