GCP Lesson 10 of 98

Google Cloud Run, In Depth: Services, Jobs, Concurrency, Scaling & Traffic

Google Cloud Run is Google’s fully managed serverless container platform: you hand it a container image, and it runs that container on demand, scales it from zero to many copies as traffic arrives, scales it back to zero when traffic stops, and bills you (for the common case) only while a request is being handled. There are no nodes to patch, no clusters to size, no autoscaler to tune at the infrastructure level — Google runs all of that on its own infrastructure (built on the same internal platform, Borg, plus the open-source Knative serving model). You bring a container that listens on a port; Cloud Run does the rest. It is the sweet spot between Cloud Functions (where you bring a single function and Google builds the container) and GKE (where you bring and run the whole Kubernetes cluster).

This lesson is deliberately exhaustive. We cover the two execution resources — services (request-driven, long-lived, scale-to-zero) and jobs (run-to-completion batch work) — and then every knob that governs how a service behaves: the container contract (the PORT env var, statelessness, the request vs instance billing models), concurrency (how many requests one instance handles at once) and autoscaling (minimum and maximum instances, scale-to-zero, cold starts, CPU always-allocated vs throttled, and startup CPU boost), revisions and traffic splitting for blue-green and canary rollouts by percentage, networking (ingress controls, egress via Serverless VPC Access connector or Direct VPC egress, and fronting Cloud Run with a load balancer), environment variables and secrets, and service identity. We finish with the decision that interviewers and the ACE and Professional Cloud Architect exams love: Cloud Run vs Cloud Functions vs GKE. Every option gets the same treatment — what it is · the choices · the default · when to pick which · the trade-off · the limit · the cost impact · the gotcha — and every core operation comes with a real gcloud command. Everything below uses the current v2 surface (gcloud run with current flags); where a default flipped or a flag changed, I call it out.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You should already understand Google Cloud’s resource hierarchy — organisation → folder → project → resource — what a region is, how to run gcloud from Cloud Shell or a local SDK install (covered in the Fundamentals module), and the basics of a container image (a packaged application plus its dependencies, addressable by a registry path). Having read the Compute Engine deep dive helps — Cloud Run is where you go when you want to stop managing VMs entirely — but it is not required; we define every term. This is the serverless containers lesson of the Compute module in the GCP Zero-to-Hero course. It sits between raw VMs/MIGs and full Kubernetes: once you can drive a Cloud Run service and job fluently, you can ship most stateless web apps, APIs, and batch jobs without ever touching a node. For the production networking angle — private ingress, Direct VPC egress patterns, and PSC — pair this with Cloud Run in Production: Services, Jobs, VPC Egress, and Concurrency Tuning.

Core concepts

Before the options, fix six mental models. They explain why every setting is shaped the way it is.

You bring a container; Google runs everything below it. Cloud Run is serverless at the container level. You are responsible for the image (your code, runtime, and OS libraries inside it); Google is responsible for the host, the OS kernel, the autoscaler, request routing, TLS termination, and the regional fleet your instances run on. The unit you deploy is an OCI container image pulled from Artifact Registry (the modern registry; Container Registry is deprecated). You can build that image yourself with a Dockerfile, or let Cloud Run build it from source for you with Cloud Build and Buildpacks (gcloud run deploy --source .).

A service is a stack of immutable revisions, and traffic is a separate decision. When you deploy a service, Cloud Run creates a new revision — an immutable snapshot of the container image plus its entire configuration (env vars, secrets, CPU/memory, concurrency, scaling bounds, the service account). Revisions are never edited; every change produces a new one. Routing traffic to a revision is a separate operation from creating it. This separation is the whole basis of safe deploys: you can create a new revision that receives 0% of traffic, test it, then shift traffic to it gradually (canary) or all at once (blue-green), and roll back instantly by shifting traffic back — no rebuild, no redeploy.

An instance is a running copy of one revision; the autoscaler manages the count. Cloud Run runs your container as instances (sometimes called container instances). The autoscaler watches incoming load and the per-instance concurrency setting, then creates or removes instances to keep up — down to min-instances (zero by default) and up to max-instances. You do not pick instance count; you pick the bounds and the concurrency, and the platform sizes the fleet. This is the opposite of a Managed Instance Group, where you reason about VM counts directly.

Requests are the primary scaling signal; concurrency is requests-per-instance. Unlike a VM that handles whatever you throw at it, a Cloud Run service instance has a concurrency limit: the maximum number of requests it will process simultaneously (default 80, max 1000). The autoscaler does the arithmetic — roughly instances needed ≈ requests-in-flight ÷ concurrency — and adds instances when in-flight requests exceed current capacity. Higher concurrency means fewer instances (cheaper) but more contention inside each container; lower concurrency means more instances (more isolation, higher cost, more cold starts).

Billing is (usually) per-request and to 100 ms. In the default request-based billing model you pay for the vCPU and memory your instances use only while they are handling requests (plus a small per-request fee and a short startup/shutdown allowance), metered to the nearest 100 milliseconds. When no requests are in flight, an idle instance that is still alive is not billed for CPU (CPU is throttled to near-zero). The alternative instance-based billing model bills for the entire lifetime of each instance (CPU always allocated) regardless of requests — you choose this when you need background work or always-on CPU. Cloud Run also has a generous monthly free tier.

Statelessness is the contract, not a suggestion. Instances are ephemeral: the autoscaler can create and destroy them at any time, and there is no sticky storage on the instance beyond an in-memory writable filesystem (which counts against your memory limit and vanishes when the instance does). Anything that must persist — sessions, uploads, state — belongs in an external store (Cloud Storage, a database, Memorystore). Design for any request to land on any instance. Key terms used throughout: service (request-driven app), job (run-to-completion task), revision (immutable config+image snapshot), instance (a running container), concurrency (max simultaneous requests per instance), cold start (the latency of spinning up a fresh instance), and ingress/egress (who can call your service / how your service reaches other resources).

Services vs jobs: the two execution resources

Cloud Run gives you two top-level resources. Choosing the right one is the first decision, and a common interview question.

Aspect Cloud Run service Cloud Run job
Trigger An HTTP(S)/gRPC request (or a Pub/Sub push, Eventarc event, scheduled HTTP) An explicit execution (gcloud run jobs execute, Cloud Scheduler, Workflows, Eventarc)
Lifecycle Long-lived; scales up on traffic, scales to zero when idle Runs to completion, then exits; no listening server
Listens on a port? Yes — must serve on $PORT No — runs your command and exits
Scaling unit Instances driven by request concurrency Tasks — N parallel copies of the same job
Retries Per-request (client/LB retries) Built-in task retries (--max-retries) and parallelism
Billing Per-request (default) or per-instance For the duration each task runs
Typical use Web apps, REST/gRPC APIs, webhooks, SSR front-ends, microservices Batch ETL, data migrations, report generation, scheduled maintenance, fan-out processing
Timeout Per-request, default 5 min, max 60 min Per-task, default 10 min, max 24 hours

The mental test: if it answers requests, it is a service; if it does a unit of work and finishes, it is a job. A web API is a service. A nightly “regenerate all thumbnails” task is a job. A job with --tasks=100 --parallelism=20 runs 100 tasks, 20 at a time, with each task reading its index from the CLOUD_RUN_TASK_INDEX env var to pick its slice of work — a clean fan-out pattern without standing up a queue and workers. Create one of each:

# A service (serves HTTP on $PORT, scales to zero)
gcloud run deploy hello-svc --image=us-docker.pkg.dev/cloudrun/container/hello --region=us-central1

# A job (runs to completion; no port)
gcloud run jobs create nightly-etl --image=REGION-docker.pkg.dev/PROJECT/repo/etl:latest \
  --region=us-central1 --tasks=10 --parallelism=5 --max-retries=3 --task-timeout=30m
gcloud run jobs execute nightly-etl --region=us-central1

The rest of this lesson focuses mainly on services, because that is where concurrency, autoscaling, revisions, and traffic splitting live; the container, networking, env/secrets, and identity material applies to both.

The container contract

Cloud Run will run any container that obeys a small contract. Break it and the deploy fails the health check or the service misbehaves. These are the rules.

Requirement What it means Default / value Gotcha
Listen on $PORT (services) Your server must bind 0.0.0.0:$PORT, the port Cloud Run injects PORT=8080 by default; override with --port Binding localhost or a hard-coded port other than $PORT = “container failed to start and listen”
HTTP/1, HTTP/2, gRPC, or WebSockets The protocol your server speaks HTTP/1 by default; opt in to HTTP/2 end-to-end (--use-http2) WebSockets/streaming work but are bounded by the request timeout
Stateless Any request may hit any instance; instances come and go n/a Never rely on local disk or in-memory state surviving — use external stores
Start quickly Listen on the port within the startup window or fail Startup probe deadline (configurable) Heavy init (model loading, warm caches) lengthens cold starts; use a startup probe and/or CPU boost
Writable filesystem is in-memory /tmp (and the container FS) is a tmpfs backed by RAM Counts against the memory limit Large temp files can OOM the instance; mount a GCS volume for big I/O
Listen, don’t poll, for work A service does CPU work during a request; outside requests CPU is throttled (request billing) n/a Background threads after the response may be paused unless CPU is always-allocated
Max request/response size & timeout Bounded request duration and (for buffered) size Request timeout default 5 min, max 60 min Long jobs belong in a Cloud Run job, not a long-held request

Two contract points deserve emphasis. First, $PORT: read it from the environment rather than hard-coding, e.g. const port = process.env.PORT || 8080. Second, statelessness with an in-memory FS: writing to /tmp is fine for scratch within a request, but it lives in RAM and disappears with the instance — for durable or large files, attach a Cloud Storage FUSE volume or an NFS/Filestore volume (--add-volume / --add-volume-mount), or just call the storage API directly.

Concurrency: requests per instance

Concurrency is the single most important cost-and-performance lever on a Cloud Run service: the maximum number of requests one instance will process at the same time.

Setting What it controls Range / default When to raise When to lower
Concurrency (--concurrency) Max simultaneous requests per instance 1–1000; default 80 I/O-bound apps (waiting on DB/HTTP) that can multiplex cheaply CPU-bound work, or libraries that are not thread-safe / not concurrency-safe

How to reason about it:

The relationship the autoscaler approximates: instances ≈ ceil(concurrent requests ÷ concurrency). So 800 concurrent requests at concurrency 80 ≈ 10 instances; the same load at concurrency 1 ≈ 800 instances. Tune concurrency by load-testing: raise it until p99 latency or error rate starts to climb, then back off. Gotcha: a too-high concurrency on a CPU-bound app produces slow, contended requests with few instances — the bill looks great but users suffer. A too-low concurrency on an I/O-bound app produces a huge instance count and a surprise invoice.

Autoscaling: min/max instances, scale-to-zero, cold starts

The autoscaler turns load into an instance count between two bounds you set.

Setting What it is Default Trade-off / cost Gotcha
Minimum instances (--min-instances) Instances kept warm even at zero traffic 0 (scale-to-zero) >0 removes cold starts but you pay to keep them idle (at the idle/throttled rate unless CPU is always-on) Set ≥1 for latency-sensitive services; costs accrue 24/7
Maximum instances (--max-instances) Upper bound on instances for the revision 100 (raisable via quota) Caps cost and protects downstreams (e.g. a DB) from being overwhelmed Too low = 429/throttling under spikes; too high = a runaway bill or an overwhelmed database
Scale-to-zero Service drops to 0 instances when idle On (when min=0) Cheapest possible (pay nothing when idle) First request after idle pays a cold start
Maximum concurrent requests (see Concurrency) 80 Governs how aggressively the autoscaler adds instances

The concepts behind the knobs:

gcloud run services update hello-svc --region=us-central1 \
  --min-instances=1 --max-instances=20 --concurrency=80

CPU allocation: always-on vs throttled, and CPU boost

How Cloud Run allocates CPU changes both behaviour and billing. There are two independent choices: when CPU is available and whether to boost CPU at startup.

Mode What you get Billing model When to use Gotcha
CPU throttled / “allocated during requests” (default for services) CPU is full speed while a request is being handled, then throttled to near-zero between requests Request-based — pay only during requests (to 100 ms) Standard request/response web apps and APIs; cheapest for spiky traffic Background work after you send the response may be paused; timers/async jobs won’t run reliably
CPU always allocated (--no-cpu-throttling) CPU is available for the entire instance lifetime, even with no requests Instance-based — pay for the whole instance lifetime Background processing, streaming, async work after response, in-memory caches/warm pools, services needing min-instances doing work Costs more (billed even when idle); pairs naturally with min-instances
Startup CPU boost (--cpu-boost) Temporarily gives the instance more CPU during startup to cut cold-start time Adds to startup cost slightly; large net win on latency Almost always for latency-sensitive services and JIT/JVM/Node apps with heavy init Only helps the startup window; doesn’t change steady-state CPU

The two settings answer different questions. CPU throttled vs always-allocated answers “do I need CPU between requests?” — the default (throttled, request-billed) is correct for the overwhelming majority of stateless request/response services and is the cheapest. Switch to always-allocated (--no-cpu-throttling, which moves you to instance-based billing) when you must run work outside the request lifecycle: a background goroutine, a streaming response that keeps computing, periodic in-process tasks, or a warm in-memory cache that must stay warm. Startup CPU boost answers “are my cold starts slow because startup is CPU-starved?” — it gives extra CPU only during container start, which meaningfully shortens cold starts for JVM/Node/Python apps that do heavy initialisation. The two combine freely.

You also size the instance directly:

gcloud run services update hello-svc --region=us-central1 \
  --cpu=1 --memory=512Mi --cpu-boost            # request-billed, boosted cold starts
# vs an always-on background worker:
gcloud run services update worker-svc --region=us-central1 \
  --cpu=1 --memory=512Mi --no-cpu-throttling --min-instances=1

CPU can be set in fractions (e.g. 0.5, 1, 2, up to 8) and memory from 128 MiB up to 32 GiB, within valid CPU/memory combinations (more CPU requires more minimum memory; >4 vCPU requires that startup not be CPU-throttled in some configurations). Choose the smallest shape that meets your latency and concurrency targets.

Revisions and traffic splitting: blue-green and canary

This is the operational heart of Cloud Run and a guaranteed interview topic.

Every deploy creates a new immutable revision. A revision bundles the image and the full config; it never changes. By default, deploying a new revision routes 100% of traffic to it immediately (a straight cut-over). But because routing is separate from deploying, you can decouple them and roll out safely.

Operation Command Effect
Deploy and take all traffic (default) gcloud run deploy SVC --image=... New revision gets 100%
Deploy with no traffic + a tag gcloud run deploy SVC --image=... --no-traffic --tag=candidate New revision at 0%, testable at its candidate---… URL
Canary: 10% to the new tagged revision gcloud run services update-traffic SVC --to-tags=candidate=10 10% canary, 90% stays on current
Ramp to 50/50 gcloud run services update-traffic SVC --to-revisions=NEW=50,OLD=50 Even split
Promote to 100% (blue-green flip) gcloud run services update-traffic SVC --to-latest (or --to-revisions=NEW=100) New revision serves all traffic
Instant rollback gcloud run services update-traffic SVC --to-revisions=OLD=100 Old revision serves all traffic immediately

A typical canary flow:

# 1. Build a new revision but take no traffic; give it a tag for testing.
gcloud run deploy hello-svc --image=REGION-docker.pkg.dev/PROJECT/repo/app:v2 \
  --region=us-central1 --no-traffic --tag=v2

# 2. Smoke-test the candidate directly (the tagged URL), then send it 10% of live traffic.
gcloud run services update-traffic hello-svc --region=us-central1 --to-tags=v2=10

# 3. Watch errors/latency, then ramp.
gcloud run services update-traffic hello-svc --region=us-central1 --to-revisions=hello-svc-v2=50
gcloud run services update-traffic hello-svc --region=us-central1 --to-latest   # 100%

# 4. If anything looks wrong at any step, roll back instantly.
gcloud run services update-traffic hello-svc --region=us-central1 --to-revisions=hello-svc-00001-abc=100

Gotchas: by default deploys cut over to 100% — pass --no-traffic if you want a controlled rollout. Old revisions linger (cost nothing while at 0% with min-instances 0) and are great for rollback; clean up truly dead ones periodically. Traffic percentages are integers and must sum to 100. Tags also let you wire per-revision URLs into tests without touching the live split.

Networking: ingress, egress, and load balancing

Cloud Run’s defaults expose a public HTTPS URL with a Google-managed TLS certificate. Production needs tighter control over who can reach the service (ingress) and how the service reaches private resources (egress).

Ingress — who can call the service

Ingress setting Who can reach the service When to use Gotcha
all (default) The public internet (the *.run.app URL) Public APIs and sites Still gated by IAM authentication unless you allow unauthenticated invocations
internal Only traffic from your VPC(s), VPC-SC perimeter, and other internal Google traffic (e.g. Pub/Sub, Eventarc, Workflows) Internal microservices not meant for the internet The *.run.app URL stops working from outside; callers must be on the VPC
internal-and-cloud-load-balancing Internal traffic plus an external HTTPS load balancer in front Public service that must sit behind Cloud Armor / a custom domain / CDN You must build the LB + serverless NEG; direct *.run.app is restricted

Separately, authentication controls identity: by default Cloud Run requires the caller to present a valid IAM token (roles/run.invoker). Allowing unauthenticated access (--allow-unauthenticated, granting allUsers the invoker role) makes the service publicly callable — appropriate for a public website, not for an internal API. Ingress and auth are independent: a service can be ingress=all but still require authentication.

Egress — how the service reaches private resources

By default a Cloud Run instance reaches the public internet directly but cannot reach private VPC resources (a Cloud SQL private IP, a Memorystore instance, an internal API, an on-prem host over VPN). Two mechanisms connect it to your VPC.

Mechanism What it is Egress modes When to use Gotcha
Serverless VPC Access connector A managed set of e2-micro-class VMs that bridge serverless → VPC; you provision a connector with a /28 range “Private ranges only” (default) or “all traffic” The established option; works everywhere; needed for shared-VPC patterns in some setups You pay for and scale the connector instances; an extra hop; the /28 must not overlap
Direct VPC egress The instance gets an IP directly on a VPC subnet — no connector VMs Route private ranges, or all egress, through the VPC The modern, lower-latency, lower-cost default; higher throughput, scales with the service Needs spare subnet IP space (sized to max instances); newer, check region/feature support
Static outbound IP Pin egress to a fixed IP via Cloud NAT (with connector or Direct VPC egress) n/a When an upstream must allowlist your source IP Requires routing egress through the VPC + Cloud NAT with a reserved IP

Choose Direct VPC egress for new services (cheaper, faster, no connector to manage); use a connector where Direct VPC egress is not yet an option or an existing architecture standardises on it. Set --vpc-egress=private-ranges-only to send only RFC 1918 traffic through the VPC (internet still goes direct, cheaper) or all-traffic to force everything through the VPC (so a Cloud NAT can give you a fixed, allowlistable outbound IP and all egress is subject to your firewall/inspection).

# Direct VPC egress (modern): put instances on a subnet, route private ranges through the VPC
gcloud run services update hello-svc --region=us-central1 \
  --network=my-vpc --subnet=run-subnet --vpc-egress=private-ranges-only

# Or via a Serverless VPC Access connector (classic)
gcloud run services update hello-svc --region=us-central1 \
  --vpc-connector=my-connector --vpc-egress=private-ranges-only

Fronting Cloud Run with a load balancer

To put Cloud Run behind a global external Application Load Balancer — for a custom domain, Cloud CDN, Cloud Armor WAF/DDoS, or to blend Cloud Run with other backends — you create a serverless network endpoint group (NEG) that points at the service and wire it into the LB’s backend service. Set the service’s ingress to internal-and-cloud-load-balancing so it only accepts traffic via the LB (and internal sources), closing the public *.run.app URL. This is the standard production front door; the building blocks (forwarding rule → target proxy → URL map → backend service → serverless NEG) are covered in the load balancing module.

Environment variables, secrets, and service identity

A service needs configuration, secrets, and an identity to act as.

Environment variables

Plain configuration is injected as environment variables, available to the container as a new revision is created.

gcloud run services update hello-svc --region=us-central1 \
  --set-env-vars=LOG_LEVEL=info,FEATURE_X=true        # replace the whole set
gcloud run services update hello-svc --region=us-central1 \
  --update-env-vars=LOG_LEVEL=debug                    # add/change one, keep the rest
gcloud run services update hello-svc --region=us-central1 \
  --remove-env-vars=FEATURE_X                          # remove one

Gotcha: --set-env-vars replaces all env vars; use --update-env-vars/--remove-env-vars for incremental changes. Reserved names (like PORT) are managed by Cloud Run and cannot be set. Never put secrets in plain env vars — they are visible in the revision config to anyone with read access.

Secrets (from Secret Manager)

Sensitive values belong in Secret Manager and are referenced by the service, never baked into the image or plain env. Two delivery shapes:

Delivery How it appears When to use Gotcha
As an environment variable The secret’s value is the env var’s value at instance start Simple secrets (API keys, connection strings) Read once at startup; rotating the secret needs a new revision unless you pin :latest and let new instances pick it up
Mounted as a file (volume) The secret appears as a file at a mount path Certificates, multi-line secrets, apps that read files; supports live-ish updates Path-based access; the service account needs roles/secretmanager.secretAccessor
# Inject a specific secret version as an env var
gcloud run services update hello-svc --region=us-central1 \
  --set-secrets=DB_PASSWORD=db-password:latest

# Or mount a secret as a file
gcloud run services update hello-svc --region=us-central1 \
  --set-secrets=/etc/secrets/tls.key=tls-key:3

Pin a specific version for reproducibility and controlled rotation, or :latest to always get the newest (new instances pick it up). The runtime service account must have roles/secretmanager.secretAccessor on the secret, or the instance fails to start.

Service identity (the runtime service account)

Every Cloud Run service and job runs as a service account — its identity for calling other Google Cloud APIs (Cloud Storage, Pub/Sub, BigQuery, Secret Manager, a database via IAM auth). By default it uses the Compute Engine default service account, which is broad; best practice is a dedicated, least-privilege service account per service.

gcloud iam service-accounts create hello-svc-sa --display-name="hello-svc runtime"
gcloud run services update hello-svc --region=us-central1 \
  --service-account=hello-svc-sa@PROJECT.iam.gserviceaccount.com
# then grant only what it needs, e.g.:
gcloud projects add-iam-policy-binding PROJECT \
  --member=serviceAccount:hello-svc-sa@PROJECT.iam.gserviceaccount.com \
  --role=roles/secretmanager.secretAccessor

Authentication uses the metadata server (Application Default Credentials) — the SDKs pick up the service account’s token automatically, with no key files. Two identities are in play and often confused: the runtime identity (what the service acts as, above) and the invoker identity (who is allowed to call the service, via roles/run.invoker). For service-to-service calls, give the caller’s service account roles/run.invoker on the callee and have the caller send an ID token.

Cloud Run vs Cloud Functions vs GKE

The classic “which compute” decision. All three run your code; they differ in what you bring and what you manage.

Dimension Cloud Run Cloud Functions (2nd gen) GKE (Autopilot/Standard)
You bring A container image (any language/runtime) A single function (source); Google builds the container Containers and the Kubernetes cluster/workloads
Unit of deploy Service or job (a container) A function bound to a trigger Pods/Deployments/Jobs on nodes
Scaling Request-driven, scale-to-zero Event/request-driven, scale-to-zero Pod + node autoscaling; scale-to-zero needs add-ons
State / longevity Stateless, ephemeral instances Stateless, short-lived Stateful sets, DaemonSets, anything Kubernetes supports
Networking control Ingress modes, VPC egress, LB + NEG Similar (2nd gen runs on Cloud Run infra) Full VPC-native networking, network policy, service mesh
Operational burden Minimal (no nodes) Minimal (no container build, no nodes) Highest (you own cluster ops, even on Autopilot you own workloads)
Best for Web apps, APIs, microservices, batch jobs, event consumers Small event-driven glue, single-purpose triggers, lightweight webhooks Complex/stateful systems, multi-service platforms, custom controllers, anything needing full Kubernetes
Cost shape Per-request (or per-instance) Per-invocation + compute Per-node (you pay for the fleet)

The heuristics:

A useful one-liner for interviews: Cloud Functions = bring a function; Cloud Run = bring a container; GKE = bring a cluster. Move up the ladder only when the layer below cannot express what you need.

Google Cloud Run: services, jobs, scaling, traffic

The diagram above shows the full Cloud Run model — a service built from immutable revisions with a traffic split routing percentages between them (the basis of blue-green and canary), the autoscaler sizing instances between min and max based on request concurrency (with scale-to-zero and cold starts), the CPU allocation choice (throttled/request-billed vs always-on/instance-billed) and startup CPU boost, the ingress controls and VPC egress path (connector or Direct VPC egress) into private resources, the service account identity with secrets from Secret Manager, and a separate job running parallel tasks to completion.

Hands-on lab

We will deploy a Cloud Run service, exercise concurrency, scaling, revisions, and a canary traffic split, then create and run a job, and finally clean up. The Cloud Run free tier (monthly free vCPU-seconds, memory-seconds, and requests) plus the $300 free-trial credit covers this comfortably; with scale-to-zero, the service costs nothing once idle.

1. Set your project, region, and enable the APIs.

gcloud config set project YOUR_PROJECT_ID
gcloud config set run/region us-central1
gcloud services enable run.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

2. Deploy the sample service (Google’s public hello image; no build needed). Allow unauthenticated access so you can curl it.

gcloud run deploy hello-svc \
  --image=us-docker.pkg.dev/cloudrun/container/hello \
  --allow-unauthenticated \
  --concurrency=80 --cpu=1 --memory=512Mi \
  --min-instances=0 --max-instances=10 --cpu-boost

Expected output: a deploy summary ending with a Service URL like https://hello-svc-xxxxxxxx-uc.a.run.app.

3. Validate. Hit the URL and confirm a 200:

URL=$(gcloud run services describe hello-svc --format='value(status.url)')
curl -s -o /dev/null -w "%{http_code}\n" "$URL"     # expect: 200

Inspect the live configuration — concurrency, scaling bounds, and the active revision:

gcloud run services describe hello-svc \
  --format="value(spec.template.spec.containerConcurrency, status.latestReadyRevisionName)"

4. Create a second revision and run a canary. Re-deploy with a tag and no traffic, then send the new revision 20% of traffic.

# New revision, 0% traffic, tagged for direct testing
gcloud run deploy hello-svc \
  --image=us-docker.pkg.dev/cloudrun/container/hello \
  --update-env-vars=RELEASE=v2 --no-traffic --tag=v2

# Test the candidate directly via its tagged URL
TAG_URL=$(gcloud run services describe hello-svc --format='value(status.traffic)' | tr ' ' '\n' | grep -o 'https://v2---[^,]*' | head -1)
curl -s -o /dev/null -w "%{http_code}\n" "$TAG_URL"   # expect: 200

# Canary: 20% to v2, 80% to the previous revision
gcloud run services update-traffic hello-svc --to-tags=v2=20
gcloud run services describe hello-svc --format="value(status.traffic)"

5. Promote, then roll back — practise the blue-green flip and instant rollback.

gcloud run services update-traffic hello-svc --to-latest        # 100% to newest (promote)
# ...if something were wrong, roll back to a named earlier revision:
gcloud run revisions list --service=hello-svc --format="value(metadata.name)"
gcloud run services update-traffic hello-svc --to-revisions=PASTE_OLD_REVISION=100

6. Create and run a job (run-to-completion, parallel tasks).

gcloud run jobs create hello-job \
  --image=us-docker.pkg.dev/cloudrun/container/hello \
  --tasks=4 --parallelism=2 --max-retries=2 --task-timeout=5m
gcloud run jobs execute hello-job --wait
gcloud run jobs executions list --job=hello-job --format="value(metadata.name, status.succeededCount)"

Expected: the execution completes with all tasks succeeded.

7. Cleanup. Delete the service and job to stop all charges (scale-to-zero means the service was free while idle, but remove it to be tidy).

gcloud run services delete hello-svc --quiet
gcloud run jobs delete hello-job --quiet

Cost note. Cloud Run’s free tier grants a large monthly allowance of vCPU-seconds, memory-seconds, and 2 million requests; this lab stays well inside it. With min-instances=0 the service bills nothing while idle. The two cost traps to remember: setting min-instances > 0 (you pay to keep instances warm 24/7) and --no-cpu-throttling (instance-based billing charges for the whole instance lifetime, not just requests) — both are the right call sometimes, but neither is free. Deleting the resources above returns you to zero.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Deploy fails: “container failed to start and listen on the port” App binds a hard-coded port or localhost, not 0.0.0.0:$PORT Read $PORT from the env and bind 0.0.0.0; or set --port to match
First request after idle is very slow Cold start with scale-to-zero Set --min-instances=1, enable --cpu-boost, shrink the image, defer heavy init
429 Too Many Requests under load Hit max-instances × concurrency ceiling Raise --max-instances (and quota) and/or --concurrency; check downstream limits
Background/async work never finishes after the response Default CPU throttling pauses CPU between requests Use --no-cpu-throttling (instance billing) or move the work to a Cloud Run job
New deploy broke prod immediately Default deploy cut 100% to the new revision Deploy with --no-traffic --tag=..., canary, then promote; roll back with update-traffic
Service can’t reach Cloud SQL private IP / internal API No VPC egress configured Add Direct VPC egress (--network/--subnet) or a VPC connector; set --vpc-egress
Instance fails to start citing a secret Runtime SA lacks roles/secretmanager.secretAccessor Grant the accessor role on the secret to the service’s service account
Surprise bill on an idle service min-instances > 0 and/or --no-cpu-throttling Drop min-instances to 0 for spiky traffic; use request-based (throttled) CPU unless you need always-on

Best practices

Security notes

Interview & exam questions

  1. What is the difference between a Cloud Run service and a job? A service is request-driven, listens on $PORT, scales with traffic, and can scale to zero — for web apps and APIs. A job runs a command to completion (no port), supports parallel tasks and retries, and is triggered explicitly or on a schedule — for batch work. If it answers requests it’s a service; if it does work and exits it’s a job.
  2. Explain the container contract. The container must listen on 0.0.0.0:$PORT, be stateless (any instance may handle any request; no durable local disk — only an in-memory FS), and start quickly (pass the startup probe). It speaks HTTP/1, HTTP/2, gRPC, or WebSockets.
  3. How does concurrency affect cost and performance? Concurrency is the max simultaneous requests per instance (default 80, max 1000). Higher concurrency packs more work per instance → fewer instances → lower cost, good for I/O-bound apps; lower concurrency (down to 1) isolates requests → more instances → higher cost, needed for CPU-bound or non-thread-safe code.
  4. What is scale-to-zero and what is the trade-off? With min-instances=0 an idle service runs zero instances and costs nothing for compute; the trade-off is a cold start on the first request after idle. Mitigate with min-instances≥1, CPU boost, and a small image.
  5. What is a cold start and how do you reduce it? The latency to schedule, pull, start, and health-check a fresh instance. Reduce it by shrinking the image, deferring heavy initialisation, enabling startup CPU boost, and keeping min-instances≥1 warm.
  6. CPU always-allocated vs throttled — when each? Throttled (default, request-based billing) gives CPU only during requests — cheapest, correct for standard request/response apps. Always-allocated (--no-cpu-throttling, instance-based billing) keeps CPU on for the instance’s whole life — needed for background work, streaming, or warm caches; costs more.
  7. What does startup CPU boost do? It grants extra CPU only during container startup to shorten cold starts (great for JVM/Node/Python heavy-init apps); it does not change steady-state CPU.
  8. How do revisions and traffic splitting enable safe deploys? Each deploy creates an immutable revision, and routing is a separate operation. You can deploy at 0% traffic (--no-traffic), test via a tagged URL, then canary by percentage and promote (blue-green), or roll back instantly by shifting traffic to a prior revision — no rebuild.
  9. How do you do a canary with Cloud Run? Deploy the new revision with --no-traffic --tag=v2, smoke-test it, then update-traffic --to-tags=v2=10, watch metrics, and ramp 25→50→100, rolling back with update-traffic if needed.
  10. Connector vs Direct VPC egress? A Serverless VPC Access connector bridges serverless to your VPC via managed VMs (you pay for/scale the connector). Direct VPC egress puts the instance directly on a subnet — lower latency and cost, higher throughput, no connector to manage — the modern default; it needs spare subnet IPs.
  11. How do you make a Cloud Run service private? Set ingress to internal (or internal-and-cloud-load-balancing) so only VPC/internal traffic reaches it, and require authentication (don’t grant allUsers invoker). For a public service behind a WAF, use internal-and-cloud-load-balancing + a serverless NEG + Cloud Armor.
  12. Cloud Run vs Cloud Functions vs GKE? Cloud Functions = bring a single function (Google builds the container); Cloud Run = bring any container (scale-to-zero, request billing, no nodes); GKE = bring the whole Kubernetes cluster (stateful, custom controllers, full networking — most operational burden). Move up only when the layer below can’t express your needs.

Quick check

  1. Which Cloud Run resource runs to completion with parallel tasks rather than listening on a port?
  2. What is the default concurrency for a service, and what is the maximum?
  3. Which setting keeps instances warm to avoid cold starts, and what does it cost?
  4. Which flag makes a new deploy take 0% of traffic so you can test it first?
  5. What is the modern, connector-free way to give a Cloud Run service access to private VPC resources?

Answers

  1. A job (executed via gcloud run jobs execute), which runs N tasks with configurable parallelism and retries and then exits.
  2. Default concurrency is 80; the maximum is 1000 (set to 1 for CPU-bound or non-thread-safe work).
  3. min-instances (≥1) keeps instances warm; you pay for them 24/7 (at the idle/throttled CPU rate, or the full instance rate with --no-cpu-throttling).
  4. --no-traffic (usually with --tag=NAME), so the new revision is created but receives no live traffic until you shift it.
  5. Direct VPC egress (--network/--subnet with --vpc-egress), which places the instance directly on a subnet without a Serverless VPC Access connector.

Exercise

Ship a small service safely and add a batch job. Using gcloud: (a) create a dedicated runtime service account with only roles/secretmanager.secretAccessor; (b) put a value in Secret Manager and deploy a Cloud Run service that mounts it (or injects it as env from a pinned version), with --concurrency=40, --cpu=1 --memory=512Mi, --min-instances=0 --max-instances=5, --cpu-boost, requiring authentication (no --allow-unauthenticated), and the dedicated SA attached; © deploy a second revision with --no-traffic --tag=v2, smoke-test the tagged URL with an ID token, then canary 10% to it and promote to 100%; (d) add Direct VPC egress so the service can reach a private range; (e) create a job with --tasks=6 --parallelism=3 --max-retries=2 and execute it; then (f) delete the service, the job, the secret, and the service account. In a sentence, explain why you required authentication and used a dedicated service account rather than the default.

Certification mapping

Glossary

Next steps

You can now drive a Cloud Run service and job end to end — the container contract, concurrency and autoscaling, CPU allocation, revisions and traffic splitting, networking, secrets, and identity. For the deep production networking angle — private ingress patterns, Direct VPC egress at scale, internal load balancers, IAP, and Private Service Connect — read Cloud Run in Production: Services, Jobs, VPC Egress, and Concurrency Tuning. Then continue into the storage layer your stateless services depend on with Google Cloud Storage, In Depth: Buckets, Storage Classes, Lifecycle, Versioning & Encryption, and when a single container is no longer enough, step up to the Google Kubernetes Engine deep dive.

gcpcloud-runserverlesscontainersACE
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments