GCP Lesson 59 of 98

Google Cloud Dataflow, In Depth: Apache Beam, Streaming vs Batch, Windowing & Autoscaling

Google Cloud Dataflow is Google’s fully managed, serverless service for running large-scale data processing pipelines — both batch (finite, bounded data, run once over a fixed dataset) and streaming (unbounded, never-ending data, processed continuously as it arrives). You write the pipeline once, in Apache Beam, and Dataflow provisions the workers, distributes the work, autoscales the fleet up and down with the load, rebalances stragglers, handles failures and retries, and tears everything down when a batch job finishes. There is no cluster to size, no scheduler to tune, no shuffle service to operate — Google runs all of that. You describe what transformation you want; Dataflow figures out how to execute it across a dynamically sized set of machines.

The thing that makes Dataflow distinctive — and the thing interviewers and the Professional Data Engineer exam probe hardest — is the unified model. The same Beam pipeline that processes a year of historical files in batch can process a live Pub/Sub stream with only a change of source and a windowing strategy. The hard parts of streaming — when is the data for a window complete? what do I do about events that arrive late? am I measuring time by when the event happened or by when my pipeline saw it? — are handled by three concepts that this lesson treats in full: windowing, watermarks, and triggers, all anchored on the distinction between event time and processing time. Get those three right and streaming stops being mysterious.

This lesson is deliberately exhaustive. We start from the Beam programming model — Pipeline, PCollection, PTransform, ParDo, GroupByKey, and the bounded/unbounded distinction — then cover batch versus streaming, the full windowing taxonomy (fixed, sliding, session, global), watermarks and the late-data problem, the complete trigger family (event-time, processing-time, data-driven, composite, and accumulation modes), the main sources and sinks (Pub/Sub, BigQuery, Cloud Storage, and more), and then the execution side on Dataflow: autoscaling, Streaming Engine, Dataflow Shuffle, Dataflow Prime, worker machine types and the every-job-option pipeline-options matrix, classic versus Flex templates and the Google-provided catalogue for no-code launches, the drain versus cancel decision and in-place update, and finally the architecture choice interviewers love — Dataflow versus Dataproc versus Cloud Data Fusion. Every option gets the same treatment — what it is · the choices · the default · when to pick which · the trade-off · the limit · the cost impact · the gotcha — and every core operation comes with a real gcloud command. Everything below reflects the current (2026) Dataflow surface, including the Runner v2 default and Dataflow Prime.

Learning objectives

After working through this lesson you will be able to:

Prerequisites & where this fits

You need a Google Cloud project with billing enabled (the $300 new-customer credit comfortably covers everything here; Dataflow itself has no perpetual free tier, but the lab below runs for minutes and costs cents), the Dataflow, Compute Engine, Cloud Storage, Pub/Sub, and BigQuery APIs enabled, the gcloud CLI installed and initialised, and either Python 3.9+ or a JDK if you want to author pipelines locally. An IAM principal that can run Dataflow needs roles/dataflow.developer (to create/cancel jobs) plus the ability to act as the worker service account; the workers themselves run as a service account that needs roles/dataflow.worker and read/write access to whatever sources and sinks the pipeline touches. A working mental model of Pub/Sub (the canonical streaming source) and BigQuery (the canonical sink) makes everything here click, because the textbook Dataflow job is Pub/Sub → Dataflow → BigQuery.

In the Google Cloud Zero-to-Hero course this is the Data module, the processing layer that sits between ingestion (Pub/Sub) and the warehouse (BigQuery). It follows the Cloud IAM deep dive (you need the worker-identity model) and precedes the Identity-Aware Proxy deep dive.

Core concepts

Dataflow is the runner; Apache Beam is the programming model and SDK you write against. Beam is portable — the same pipeline can run on Dataflow, Apache Flink, or Apache Spark — but Dataflow is its first-class, fully managed home on Google Cloud. So when you “write a Dataflow job”, you actually write a Beam pipeline and submit it with the Dataflow runner. Six terms carry the whole model.

A Pipeline is the entire data-processing job: a directed acyclic graph (DAG) of transforms, from one or more sources to one or more sinks. You construct it in code (Pipeline p = Pipeline.create(options) in Java, with beam.Pipeline(options=...) as p: in Python), build up the graph by applying transforms, and then run it — locally with the DirectRunner for testing, or on Dataflow with the DataflowRunner.

A PCollection (“parallel collection”) is the data that flows between transforms — Beam’s distributed, immutable dataset abstraction. It can be bounded (a finite dataset of known size — a file, a BigQuery table; this is batch) or unbounded (an infinite, continuously growing stream — a Pub/Sub subscription; this is streaming). Every element in a PCollection carries an implicit timestamp and a window, which is what makes time-based grouping possible. PCollections are immutable: a transform never mutates its input, it produces a new PCollection.

A PTransform is an operation that takes one or more PCollections and produces one or more PCollections — a Map, a Filter, a GroupByKey, a windowing assignment, a read, a write. You compose transforms with the apply operator (.apply(...) in Java, the | pipe in Python) and Beam encourages packaging your own multi-step logic as a composite transform so the pipeline reads at a high level.

ParDo is the most general element-wise transform — Beam’s “for each element, do this” primitive (analogous to the map/flatMap phase of MapReduce). You supply a DoFn (a class with a process method) that receives one input element and emits zero, one, or many output elements. ParDo is where most business logic lives: parsing, enriching, filtering, fan-out, format conversion.

GroupByKey is the core aggregation primitive: given a PCollection of key-value pairs, it groups all values that share a key into (key, Iterable<values>). This is the step that requires a shuffle (moving data across workers so all values for a key land together) and, for streaming, the step that forces Beam to decide when a group is complete — which is the whole reason windowing, watermarks, and triggers exist. Related primitives: CoGroupByKey (relational join of multiple keyed PCollections), Combine (associative/commutative aggregation like sum/mean/count, optimised with partial combining on the workers before the shuffle — far cheaper than GroupByKey when you only need an aggregate), and Flatten (merge several PCollections of the same type into one, like a UNION ALL).

A side input is a small, broadcast view of a PCollection that a ParDo can read in full for every main-input element — used for lookups/enrichment (e.g. a small dimension table). A window assigns each element to one or more time-based buckets; a trigger decides when the results for a window are emitted; a watermark is Dataflow’s estimate of how far event time has progressed. Those last three are the heart of streaming and have their own sections below.

Term What it is Batch analogue Notes
Pipeline The whole DAG, source(s) → sink(s) The job Run with DirectRunner (test) or DataflowRunner (prod)
PCollection Distributed, immutable dataset between transforms An RDD / DataFrame Bounded = batch; unbounded = streaming
PTransform An operation on PCollections A stage Compose with .apply / |; package as composite
ParDo (+ DoFn) General per-element processing map/flatMap Where most logic lives; emits 0…N per element
GroupByKey Group values by key (needs shuffle) GROUP BY Forces “is this group complete?” → windowing/triggers
Combine Associative aggregation (sum/mean/count) aggregate fns Cheaper than GBK — partial combine before shuffle

Batch versus streaming: what actually changes

Beam’s promise is that the transforms are the same; only the shape of the data and the way Dataflow runs it differ. The table makes the distinction concrete.

Aspect Batch Streaming
Input PCollection Bounded (finite, known end) Unbounded (infinite, never ends)
Typical source GCS files, BigQuery table Pub/Sub subscription
Job lifetime Runs to completion, then stops Runs forever until you drain/cancel/update
Default window A single global window (all data in one group) Still global by default — but you almost always apply event-time windows
Watermark Implicitly “complete” at the end of input A live, advancing estimate of event-time progress
Triggers Fire once when input is exhausted Fire repeatedly per the trigger you choose
Autoscaling Throughput-based; can scale to large fleets and back to zero at the end Backlog/CPU-based; keeps a live fleet sized to the stream
Shuffle Dataflow Shuffle (service-side, batch) Streaming Engine (service-side state + timers + shuffle)
Billing Usually short, bursty Continuous (a worker is always running) — the big cost driver
--streaming flag Off (auto-detected from a bounded source) On (auto-set when reading an unbounded source like Pub/Sub)

The single most important conceptual point: in batch, “the data is complete” is obvious — the file ends. In streaming, data never ends, so you must define finite slices of time to aggregate over (windows) and a way to decide each slice is done enough to emit (watermarks + triggers). Everything difficult about streaming flows from that one fact.

Event time versus processing time

This is the distinction every streaming question hinges on, so be precise.

In a perfect world the two would be equal. In reality there is skew: an event that happened at 10:00:00 (event time) might reach your DoFn at 10:00:07 (processing time), or — for a mobile device that was offline — at 14:00:00, four hours late. Windowing is defined in event time (so “the 10:00 window” always means events stamped 10:00, regardless of when they arrive), and the watermark is the mechanism that bridges the gap by estimating how far event time has progressed given what has been seen.

Beam lets you set the element timestamp from the source (Pub/Sub can use the publish time or a message attribute via --timestampAttribute), or assign it in a DoFn with outputWithTimestamp. Choosing the right timestamp source is a design decision: use the true event time from the payload when it exists, not Pub/Sub publish time, or your windows will be wrong by the ingestion delay.

Windowing: fixed, sliding, session, global

A window subdivides an (unbounded) PCollection into finite chunks by event time so that grouping/aggregation has something to close over. You apply a windowing strategy with beam.WindowInto(...) (Python) / Window.into(...) (Java) before the aggregation. Beam provides four window types.

Window type Shape Defined by Overlap Classic use
Fixed (tumbling) Contiguous, equal, non-overlapping intervals size (e.g. 60 s) None — each event in exactly one window “Count events per minute”; regular periodic aggregates
Sliding (hopping) Equal-length windows that overlap, advancing by a period size + period (e.g. size 60 s, period 10 s) Yes — each event in size/period windows “Trailing 1-minute average updated every 10 s”; moving averages
Session Data-driven, variable-length; a window per burst of activity, closed by a gap gap (e.g. 30 min of inactivity) None, but per-key “User session” — group a user’s activity until they go quiet
Global One single window covering all time (the default) n/a Batch totals; streaming only with a non-default trigger

Details and gotchas:

A crucial rule: windowing must be applied before the aggregation that needs it (the GroupByKey/Combine), and the window assignment propagates downstream until you re-window. Two PCollections must share a compatible windowing strategy to be joined with CoGroupByKey.

Watermarks: how Dataflow decides a window is “done”

A watermark is Dataflow’s continuously updated estimate of the statement “event time has now advanced to T; I believe I have seen (almost) all events with timestamp ≤ T.” It is how an unbounded pipeline decides a window can be closed and its result emitted.

You can observe the watermark in the Dataflow UI per stage (the “data watermark” and “system watermark”). A watermark that is stuck or lagging far behind processing time is the canonical streaming symptom — usually a slow/oversubscribed stage, an under-provisioned fleet, or a hot key — and it means windows are not closing and results are not appearing. Watching the watermark vs the backlog is the first move in any streaming incident.

Triggers, allowed lateness, and late data

If the watermark answers “is the window’s input complete?”, the trigger answers “when, and how many times, do I emit a result for this window?” By default a window fires once, when the watermark passes its end (one on-time pane). Triggers let you emit early (speculative, before the watermark — for low-latency dashboards) and late (after the watermark, to incorporate late data) results too. Each emission is a pane.

The trigger family:

Trigger Fires when Why you use it
AfterWatermark (event-time, default) When the watermark passes the end of the window The correct, on-time result
AfterWatermark.withEarlyFirings(...) Repeatedly before the watermark, on a processing-time delay Low-latency speculative results for live dashboards
AfterWatermark.withLateFirings(...) After the watermark, when late data arrives (within allowed lateness) Correct results that account for stragglers
AfterProcessingTime A fixed amount of processing time after the first element in the pane Periodic output independent of event time (e.g. flush every 30 s)
AfterCount / AfterPane.elementCountAtLeast(N) (data-driven) After N elements have accumulated in the window Emit once a batch size is reached
Repeatedly.forever(...) Wraps another trigger to fire it again and again Continuous output (e.g. with a global window in streaming)
AfterAny / AfterAll / AfterEach (composite) Combine sub-triggers with OR / AND / sequence “Fire when EITHER 1000 elements OR 60 s of processing time”
Never / DefaultTrigger Never fire early/late / the standard AfterWatermark Defaults

Two concepts complete the picture:

The mental model to carry into an interview: windowing = which bucket; watermark = is the bucket’s input complete; trigger = when/how often to emit the bucket; allowed lateness = how long to wait for stragglers; accumulation mode = whether each emission replaces or adds to the last. Five orthogonal knobs.

Sources and sinks (I/O connectors)

Beam ships a large catalogue of I/O connectors (...IO) for reading sources and writing sinks. The ones that matter on GCP:

Connector Read (source) Write (sink) Notes
PubSubIO ReadFromPubSub (unbounded) WriteToPubSub The streaming source; use with_attributes/timestamp_attribute to carry event time; reads from a subscription or a topic (topic creates a temp subscription)
BigQueryIO ReadFromBigQuery (table or query; bounded) WriteToBigQuery Write methods: STORAGE_WRITE_API (default, exactly-once, recommended), STREAMING_INSERTS (legacy), or FILE_LOADS (batch, free-er, higher latency). Supports CREATE_IF_NEEDED, WRITE_APPEND/WRITE_TRUNCATE
TextIO / FileIO ReadFromText (GCS, bounded; or watchForNewFiles for streaming) WriteToText (sharded) The classic GCS source/sink; output is sharded (-00000-of-00010)
AvroIO / ParquetIO read Avro/Parquet write Avro/Parquet Columnar/typed file formats on GCS
BigtableIO read write Wide-column NoSQL sink for high-throughput key-value
JdbcIO read via JDBC write via JDBC Cloud SQL / external relational DBs
SpannerIO / DatastoreIO read write Spanner / Firestore-in-Datastore-mode
KafkaIO read (unbounded) write Apache Kafka / Managed Service for Kafka

Key choices and gotchas:

Execution on Dataflow: workers, autoscaling, Streaming Engine, Shuffle, Prime

Once you submit a pipeline with the DataflowRunner, Dataflow optimises the DAG (fuses adjacent transforms to avoid materialising intermediate data, a step called fusion) and runs it on a fleet of worker VMs (Compute Engine instances under the hood). Here is every lever that governs that execution.

Runner v2. Modern Dataflow uses Runner v2 (the portability framework, Docker-based, required for cross-language and Python streaming features) — it is the default and you rarely toggle it.

Autoscaling. Dataflow’s signature feature. It adjusts the number of workers automatically based on load.

Streaming Engine. Moves the streaming pipeline’s state and shuffle off the worker VMs and into the Dataflow service backend. Benefits: workers become smaller and more disposable (less local state to move), autoscaling is much smoother and more responsive (rebalancing does not require shuffling state between VMs), and you use smaller persistent disks. It is the default for new streaming jobs and effectively required for good streaming autoscaling. Enable explicitly with --enable_streaming_engine. Cost: Streaming Engine has its own data-processed charge, but typically saves more on smaller/fewer workers and disk than it costs.

Dataflow Shuffle. The batch analogue: moves the shuffle (the data movement behind GroupByKey/CoGroupByKey/joins) off the workers into the service backend. Faster, more reliable, and lets workers use less local disk. It is the default for batch jobs (--experiments=shuffle_mode=service, now standard).

Dataflow Prime. The next-generation execution platform, layering two big features on top of Streaming Engine/Shuffle:

Worker machine type and resources.

Option What it controls Default Notes
--worker_machine_type The Compute Engine machine type for workers n1-standard-1 (batch) / n1-standard-2-ish (streaming) depending on mode Use n2/e2/c2 for better price-performance; bump for memory-heavy DoFns (or use Prime vertical scaling)
--num_workers / --max_num_workers Initial and ceiling worker count autoscaled The ceiling is your cost guardrail
--disk_size_gb Persistent disk per worker 250 GB batch / 400 GB streaming (much less with Streaming Engine) Streaming Engine slashes the disk you need
--worker_disk_type pd-standard vs pd-ssd pd-standard SSD for shuffle/disk-heavy jobs not using the service shuffle
--number_of_worker_harness_threads Parallel threads per worker runtime-chosen Lower it if a DoFn is not thread-safe or is memory-hungry
--worker_region / --worker_zone Where workers run the job’s region Pin for data residency / proximity to sources
GPU (--dataflow_service_options=worker_accelerator=...) Attach GPUs to workers none For ML inference inside the pipeline
FlexRS (--flexrs_goal) Flexible Resource Scheduling for batch — uses a mix of preemptible + regular VMs, delay-scheduled within 6 h off Big discount for non-urgent batch; not for streaming or time-sensitive jobs

The pipeline-options matrix (the flags every job can take, beyond the above):

Option What it does
--runner DataflowRunner (cloud) or DirectRunner (local test)
--project / --region Project and the region the job runs in (pick close to your data)
--temp_location / --staging_location GCS paths Dataflow uses for temp files and to stage your code (required)
--streaming Force streaming mode (auto-set by an unbounded source)
--job_name A human name; must be unique among running jobs
--service_account_email The worker service account identity (least-privilege!)
--subnetwork / --network / --no_use_public_ips VPC placement; private IPs only (needs Private Google Access / Cloud NAT)
--dataflow_kms_key CMEK for the job’s state and temp data
--autoscaling_algorithm THROUGHPUT_BASED or NONE
--enable_streaming_engine / --dataflow_service_options=enable_prime Streaming Engine / Prime
--experiments=... Opt-in features and tuning flags
--update Replace a running streaming job in place (see below)
--template_location (classic templates) stage the pipeline as a reusable template

Templates: classic, Flex, and Google-provided

A template packages a pipeline so it can be launched without the SDK or source code — by a gcloud command, the console, a REST call, Cloud Scheduler, or a Cloud Function — and with runtime parameters. This is how non-developers (or automation) run a pipeline, and how you separate “who builds the pipeline” from “who runs it”.

Classic templates Flex templates Google-provided templates
What The pipeline graph is compiled and staged to GCS at build time The pipeline is packaged as a Docker image (in Artifact Registry) plus a small template spec in GCS A large catalogue of ready-made classic/Flex templates Google publishes
Parameters Only ValueProvider parameters can vary at runtime (the graph is largely fixed) Any parameter can vary; the graph is built at launch time inside the container Parameterised for you
Flexibility Limited (graph fixed at staging) Full — dynamic graph, any I/O, any dependency baked into the image N/A
Status Legacy; still supported The recommended approach for new custom templates Use first if one fits
Launch gcloud dataflow jobs run gcloud dataflow flex-template run Either, per template

Job lifecycle: drain, cancel, and update

A batch job ends on its own when the input is exhausted. A streaming job runs forever, so you stop or change it deliberately — and the choice matters.

The exam-classic contrast: drain = stop cleanly, flush in-flight data, lose late data; cancel = stop now, lose in-flight data; update = swap code, keep state.

Dataflow versus Dataproc versus Cloud Data Fusion

GCP has three big data-processing services and the choice is a frequent interview/architecture question. They differ in who writes the code, what engine runs, and how managed it is.

Dataflow Dataproc Cloud Data Fusion
Engine / model Apache Beam, unified batch + streaming, serverless Managed Apache Spark / Hadoop / Hive / Flink / Presto clusters Visual, no-code/low-code ETL (built on CDAP), runs on Dataproc under the hood
Cluster None — fully serverless, autoscaled per job You provision/size clusters (ephemeral or long-lived); autoscaling optional Managed instance; executes pipelines on ephemeral Dataproc
Who builds Developers writing Beam (Java/Python/Go) Teams with existing Spark/Hadoop code & skills Data analysts / engineers via drag-and-drop UI
Streaming First-class (the unified model) Spark Structured Streaming / Flink, you operate it Real-time pipelines (limited)
Best for New streaming + batch pipelines; you want zero infra ops and true autoscaling Lift-and-shift of existing Hadoop/Spark workloads; full control of the engine/libraries Visual ETL for teams who do not want to code; rich connector library
Ops burden Lowest (serverless) Highest (you own cluster lifecycle/tuning) Low (visual), but pays for the instance

The decision rule: green-field batch and streaming with minimal ops → Dataflow. Existing Spark/Hadoop jobs or you need a specific OSS engine/library → Dataproc. Non-engineers building ETL by drag-and-drop, or you want a managed connector-rich GUI → Cloud Data Fusion (which itself runs on Dataproc). Dataflow is the answer the PDE exam expects whenever the scenario says “unified batch and streaming”, “serverless”, or “Apache Beam”.

Diagram: the Dataflow execution model

Google Cloud Dataflow: Apache Beam, streaming vs batch, windowing & autoscaling

The diagram traces an unbounded stream from Pub/Sub through a Beam pipeline (ParDo → window assignment → GroupByKey/Combine → BigQuery sink), with the time machinery off to the side — event-time windows, the advancing watermark drawing the on-time/late line, and the trigger emitting early/on-time/late panes — and the execution layer below it: an autoscaling worker fleet with Streaming Engine holding state and shuffle in the service backend.

Hands-on lab

This lab runs the canonical streaming pipeline — Pub/Sub → Dataflow → BigQuery — using a Google-provided Flex template (so you do not need the SDK installed), watches it autoscale, then drains it cleanly. It runs for a few minutes and costs cents on the $300 credit. Set your project first.

export PROJECT=$(gcloud config get-value project)
export REGION=us-central1
export BUCKET=gs://${PROJECT}-dataflow-lab
gcloud services enable dataflow.googleapis.com compute.googleapis.com \
  pubsub.googleapis.com bigquery.googleapis.com storage.googleapis.com

1. Create the staging bucket, a Pub/Sub topic + subscription, and a BigQuery dataset/table.

gsutil mb -l $REGION $BUCKET
gcloud pubsub topics create lab-events
gcloud pubsub subscriptions create lab-events-sub --topic=lab-events
bq mk --dataset --location=$REGION ${PROJECT}:dataflow_lab
bq mk --table ${PROJECT}:dataflow_lab.events \
  'data:STRING,attributes:STRING,message_id:STRING,publish_time:TIMESTAMP'

2. Launch the Google-provided “Pub/Sub Subscription to BigQuery” Flex template. No code — just parameters.

gcloud dataflow flex-template run "lab-ps-to-bq-$(date +%s)" \
  --template-file-gcs-location=gs://dataflow-templates-${REGION}/latest/flex/PubSub_Subscription_to_BigQuery \
  --region=$REGION \
  --parameters=inputSubscription=projects/${PROJECT}/subscriptions/lab-events-sub,outputTableSpec=${PROJECT}:dataflow_lab.events \
  --temp-location=${BUCKET}/temp \
  --max-workers=3 \
  --enable-streaming-engine

Expected output: a job resource with an id and currentState: JOB_STATE_QUEUEDJOB_STATE_RUNNING. The job appears in the Dataflow → Jobs console with a live execution graph.

3. Publish some events so there is data to flow through.

for i in $(seq 1 20); do
  gcloud pubsub topics publish lab-events \
    --message="{\"event\":\"click\",\"n\":$i}" \
    --attribute=source=lab
done

4. Watch it run and autoscale. List the job, then inspect it.

gcloud dataflow jobs list --region=$REGION --status=active
JOB_ID=$(gcloud dataflow jobs list --region=$REGION --status=active \
  --format='value(JOB_ID)' --filter="NAME:lab-ps-to-bq" | head -1)
gcloud dataflow jobs describe $JOB_ID --region=$REGION

In the console, open the job and watch the execution graph, the data watermark per stage, and the autoscaling panel (worker count over time). Within a minute or two you should see workers spin up.

5. Validate the data landed in BigQuery.

bq query --use_legacy_sql=false \
  "SELECT COUNT(*) AS rows FROM \`${PROJECT}.dataflow_lab.events\`"

Expected output: a non-zero rows count matching the messages you published (allow a minute for the pipeline to process and write).

6. Drain the streaming job (graceful stop — flushes in-flight data, then shuts down). Contrast with cancel, which would stop immediately.

gcloud dataflow jobs drain $JOB_ID --region=$REGION
# (use `gcloud dataflow jobs cancel $JOB_ID --region=$REGION` to stop *now* and discard in-flight data)
gcloud dataflow jobs list --region=$REGION   # state moves to JOB_STATE_DRAINING → JOB_STATE_DRAINED

Cleanup (do this — a running streaming job bills continuously, which is the one way this lab could cost real money):

# ensure the job is stopped (drained/cancelled) first, then:
bq rm -r -f -d ${PROJECT}:dataflow_lab
gcloud pubsub subscriptions delete lab-events-sub
gcloud pubsub topics delete lab-events
gsutil -m rm -r $BUCKET

Cost note: the only meaningful charge is the workers running while the streaming job is up — a couple of small workers for a few minutes is cents, fully absorbed by the $300 credit. Streaming Engine adds a small data-processed charge. The expensive mistake is forgetting to drain/cancel a streaming job — it will run (and bill) 24×7 until you stop it. Pub/Sub (10 GiB/month free), this volume of BigQuery storage/writes, and the GCS staging are all effectively free at lab scale. Always confirm gcloud dataflow jobs list --status=active is empty when you are done.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Streaming job never emits results Default global window kept in streaming with the default trigger (never fires); or watermark stuck Apply an event-time window (fixed/sliding/session), or add an explicit trigger to the global window; investigate the stuck stage
Late events silently missing from windows Allowed lateness defaults to zero — late data is dropped Set withAllowedLateness(...) and an AfterWatermark.withLateFirings() trigger; pick an accumulation mode
Watermark stuck / lagging far behind A slow/oversubscribed stage, hot key (one key with disproportionate data), or too few workers Raise --max_num_workers, fix the hot key (add salt / better key), enable Streaming Engine; check the per-stage watermark in the UI
Double-counted totals in the sink ACCUMULATING mode into an additive sink (or late panes re-added) Use DISCARDING for additive sinks, or ACCUMULATING with an upsert/overwrite sink keyed by window
Job stuck QUEUED / fails to start Worker SA lacks roles/dataflow.worker or source/sink access; bad --temp_location; quota for CPUs exhausted Grant the worker SA correct roles + I/O access; supply a valid GCS temp path; check Compute Engine CPU quota in the region
Update rejected (“incompatible”) Renamed a stateful step, or changed a window/coder incompatibly Provide a --transform_name_mapping, keep stateful step names stable, or drain then relaunch for incompatible changes
Wrong-time windows (events in the wrong bucket) Timestamp taken from Pub/Sub publish time instead of the true event time in the payload Set the element timestamp from the payload’s event time (timestamp_attribute / outputWithTimestamp)
OutOfMemory on workers Memory-heavy DoFn / large per-key state on too-small a machine Use a bigger --worker_machine_type, fewer harness threads, or Dataflow Prime vertical autoscaling; reduce per-key state
Sliding-window job is huge/expensive Tiny period over a large size → each element in many windows Increase the period or reduce the size; reconsider whether you need a sliding window
Streaming autoscaling not happening Streaming Engine not enabled Add --enable_streaming_engine (required for responsive streaming autoscaling)

Best practices

Security notes

Interview & exam questions

1. Explain the difference between event time and processing time, and why it matters. Event time is when the event actually happened (the timestamp in the data); processing time is when the pipeline observed it (worker wall clock). They diverge by skew (network delay, queueing, offline devices). Windowing is defined in event time so aggregates are correct regardless of arrival delay; the watermark bridges the two by estimating event-time progress.

2. What is a watermark, and what does it decide? Dataflow’s continuously updated estimate that “event time has reached T and I have seen (almost) all events ≤ T.” When it passes a window’s end, the window’s input is considered complete and (with the default trigger) the on-time result is emitted. It is heuristic, so some events arrive after it — those are late.

3. Describe the four window types and a use case for each. Fixed/tumbling — contiguous non-overlapping intervals (“per-minute counts”). Sliding/hopping — overlapping fixed-length windows advancing by a period (“trailing 1-min average every 10 s”; each element in size/period windows). Session — data-driven, per-key, variable length closed by an inactivity gap (“a user session”). Global — one window for all time (batch totals; streaming only with an explicit trigger).

4. By default, what happens to late data, and how do you incorporate it? By default allowed lateness is zero, so late data (arriving after the watermark passes its window) is dropped. To include it, set withAllowedLateness(duration) and a late-firing trigger (AfterWatermark.withLateFirings(...)), and choose an accumulation mode. More allowed lateness = more state held = more cost.

5. Accumulating vs discarding accumulation mode — when each? Accumulating: each pane re-emits the full result so far — use with an upsert/overwrite sink keyed by window. Discarding: each pane emits only the new elements since the last firing — use with an additive sink. Mismatch causes double-counting (accumulating + additive) or lost data (discarding + overwrite).

6. Drain vs cancel vs update. Drain (streaming): stop ingesting, finish in-flight work, advance the watermark so windows flush, then stop — graceful, no in-flight loss (but late data after the drain is lost). Cancel: stop immediately, discard in-flight data — fast, lossy. Update (--update): replace a running streaming job with new code while preserving state and watermark — for zero-downtime deploys; requires a compatible graph (stateful step names/types must match).

7. What does Streaming Engine do, and why does it matter for autoscaling? It moves the streaming pipeline’s state and shuffle off worker VMs into the Dataflow service backend. Workers become small and disposable, so autoscaling can rebalance smoothly and quickly (no need to move large state between VMs) and disks shrink. It is effectively required for responsive streaming autoscaling and is the default for new streaming jobs.

8. GroupByKey vs Combine — why prefer Combine? Both aggregate by key, but Combine is associative/commutative, so Dataflow does partial combining on each worker before the shuffle (e.g. local partial sums), moving far less data across the network. GroupByKey shuffles all raw values to one place. For sums/means/counts, Combine is dramatically cheaper.

9. How do you run a pipeline without the SDK, with runtime parameters? Classic vs Flex templates? Use a template. Classic: graph staged to GCS at build time; only ValueProvider params vary (legacy). Flex: pipeline packaged as a Docker image + spec in GCS, graph built at launch so any param can vary and you can bake in dependencies — the recommended approach. Or use a Google-provided template (Pub/Sub→BQ, GCS→BQ, etc.) for the common cases.

10. Dataflow vs Dataproc vs Cloud Data Fusion? Dataflow: serverless Apache Beam, unified batch+streaming, lowest ops — choose for green-field pipelines. Dataproc: managed Spark/Hadoop clusters — choose to lift-and-shift existing Spark/Hadoop or when you need a specific OSS engine. Cloud Data Fusion: visual no-code ETL (runs on Dataproc) — choose for non-engineers building drag-and-drop pipelines.

11. Your streaming job’s watermark is stuck and results stopped appearing. How do you diagnose it? Open the per-stage watermark in the Dataflow UI to find the lagging stage; check the backlog and CPU; look for a hot key (skew), an under-provisioned fleet (raise --max_num_workers), Streaming Engine not enabled, or a slow external call in a DoFn. A stuck watermark means windows cannot close.

12. Why might windows contain events from the wrong time, and how do you fix it? Because the element timestamp was taken from Pub/Sub publish time (or assigned at processing time) rather than the true event time in the payload, so ingestion delay shifts events into later windows. Fix by setting the timestamp from the payload’s event-time field (timestamp_attribute on the Pub/Sub read, or outputWithTimestamp in a DoFn).

13. What is Dataflow Prime and when would you use it? The next-gen execution platform adding vertical autoscaling (auto-adjusting worker memory, not just count) and Right Fitting (per-stage resource hints), billed in DCUs. Use it for jobs with variable or hard-to-predict resource needs (uneven memory, a single hungry stage) so you do not size the whole job for its worst step.

Quick check

  1. Which window type is data-driven, per-key, and closed by an inactivity gap?
  2. What is the default allowed lateness, and what does that imply for late events?
  3. Which Dataflow feature moves streaming state and shuffle into the service backend, and what does it enable?
  4. Which stop operation preserves in-flight data by flushing it before shutting down — drain or cancel?
  5. To deploy new code to a 24×7 streaming job without losing state, which operation do you use?

Answers

  1. Session windows.
  2. Zero — by default late events (arriving after the watermark passes their window) are dropped; you must set withAllowedLateness and a late trigger to keep them.
  3. Streaming Engine — it enables smooth, responsive streaming autoscaling (and smaller workers/disks).
  4. Drain (cancel stops immediately and discards in-flight data).
  5. Update (--update / gcloud dataflow jobs update), with a compatible transform graph.

Exercise

Build the classic streaming-analytics pipeline end to end and exercise the time machinery. Create a Pub/Sub topic taxi-rides and a subscription, and a BigQuery table for per-minute aggregates. Write (or adapt a Beam quickstart for) a streaming pipeline that: reads from the subscription using a payload event-time field as the element timestamp (not publish time); applies a fixed 60-second event-time window; uses a trigger that fires early every 10 seconds (speculative), on time at the watermark, and late for up to 10 minutes of allowed lateness, in accumulating mode; Combines to count rides per window; and writes to BigQuery via the Storage Write API keyed so that accumulating panes upsert (overwrite) the window’s row. Launch it on Dataflow with Streaming Engine, --max_num_workers=4, a least-privilege worker service account, and no public IPs. Publish a burst of events including two deliberately late ones (timestamps a few minutes in the past) and confirm: on-time panes appear within a window, the late panes update the affected window’s row (not double-counted, because accumulating + upsert), and the autoscaling panel shows the fleet sizing to the load. Then make a compatible code change (e.g. add a field) and deploy it with --update, confirming the job keeps running with no data loss. Finally drain the job and verify the final windows flush. Tear everything down. Note which changes would have made the update incompatible and forced a drain-then-relaunch.

Certification mapping

Glossary

Next steps

GCPDataflowApache BeamStreamingData EngineeringPDE
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments