GCP Security

Controlling Egress on GCP: Hierarchical Firewall Policies and Cloud NAT, End to End

Most GCP egress incidents are not “the firewall was open.” They are “three layers of firewall disagreed, the route was asymmetric, and nobody could read the logs.” Controlled egress on Google Cloud is a stack: hierarchical firewall policies set the organizational baseline, network firewall policies handle the VPC-specific exceptions, Cloud NAT gives you a deterministic, allowlistable source IP, and Private Google Access plus PSC remove the need for internet egress entirely. This guide wires all of that together and shows you how to verify it before it pages you.

1. How firewall evaluation actually works

Before authoring a single rule, internalise the evaluation order. A packet on GCP is evaluated against rules from multiple sources, in this sequence:

  1. Hierarchical firewall policies attached to the organization, then each folder down the resource hierarchy (closest-to-root first).
  2. Global and regional network firewall policies attached to the VPC.
  3. Legacy VPC firewall rules (the old compute.firewalls objects).
  4. An implied allow egress / deny ingress pair at the very end.

Within any single policy, rules are sorted by priority (lower number wins; 0–2147483643). The first rule that matches the packet determines the action — unless that rule uses the special goto_next action, which delegates the decision to the next level down. So a hierarchical rule can either decide (allow/deny) or explicitly defer (goto_next).

The mental model that keeps you sane: hierarchical policies are for guardrails you never want a project owner to override (deny). Use goto_next for everything you want lower levels to be able to decide. A baseline egress-deny that does not use goto_next is final and cannot be re-opened by a VPC-level allow.

This last point is the crux of egress control. If you put a low-priority (high-number) hierarchical “deny all egress” rule that is final, project teams literally cannot open egress without you editing the org policy. If you make it goto_next, they can.

2. Authoring an org/folder baseline egress deny

Create a hierarchical firewall policy and associate it with a folder (associating at the org node is also valid; folders give you blast-radius control during rollout). Hierarchical policies live under gcloud compute firewall-policies and require --organization for scope.

ORG_ID=123456789012
FOLDER_ID=987654321098

# 1. Create the policy container
gcloud compute firewall-policies create \
  --organization="$ORG_ID" \
  --short-name="egress-baseline" \
  --description="Org baseline: default-deny egress with explicit allowlist"

# Capture the generated policy ID (numeric)
POLICY=$(gcloud compute firewall-policies list \
  --organization="$ORG_ID" \
  --format="value(name)" \
  --filter="shortName=egress-baseline")

Now add rules. Lower-priority numbers are evaluated first, so put the explicit allows above the deny. A common baseline: allow egress to RFC 1918 internal ranges and to Google APIs, then deny the rest.

# Allow egress to internal RFC1918 (delegated decision -> let VPC policies refine)
gcloud compute firewall-policies rules create 1000 \
  --firewall-policy="$POLICY" --organization="$ORG_ID" \
  --direction=EGRESS --action=goto_next \
  --layer4-configs=all \
  --dest-ip-ranges=10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
  --enable-logging

# Allow egress to Google's restricted VIP for Private Google Access
gcloud compute firewall-policies rules create 1100 \
  --firewall-policy="$POLICY" --organization="$ORG_ID" \
  --direction=EGRESS --action=allow \
  --layer4-configs=tcp:443 \
  --dest-ip-ranges=199.36.153.4/30 \
  --enable-logging

# Final baseline: deny all other egress (NOT goto_next -> this is the guardrail)
gcloud compute firewall-policies rules create 2147483643 \
  --firewall-policy="$POLICY" --organization="$ORG_ID" \
  --direction=EGRESS --action=deny \
  --layer4-configs=all \
  --dest-ip-ranges=0.0.0.0/0 \
  --enable-logging

Associate the policy with the folder. Until you associate it, it enforces nothing:

gcloud compute firewall-policies associations create \
  --firewall-policy="$POLICY" --organization="$ORG_ID" \
  --folder="$FOLDER_ID" \
  --name="egress-baseline-folder-assoc"

The 199.36.153.4/30 range is the restricted.googleapis.com VIP — the entry point for Private Google Access when you want to block all general internet egress but still reach Google APIs. (private.googleapis.com is 199.36.153.8/30; restricted is the stricter one that excludes APIs with no VPC-SC support.)

3. Targets: secure tags vs network tags vs service accounts

A firewall rule narrows what it applies to via targets. You have three options, and they are not interchangeable.

Target type Where it works IAM-governed Notes
Network tags VPC firewall rules, network policies No Free-text strings on the instance; any editor can add one. Weakest control.
Service accounts VPC rules, network + hierarchical policies Yes (via SA usage) Strong, but you burn a target SA and instances are limited in SA count.
Secure tags Network + hierarchical policies Yes (Tag User/Admin roles) Key/value resource-manager tags, IAM-bound. The modern recommendation.

For hierarchical and network firewall policies, prefer secure tags (resource-manager tags), referenced as --target-secure-tags. They are governed by IAM (a user needs roles/resourcemanager.tagUser to bind one), so a developer cannot silently grant their VM into a privileged rule the way they can with a network tag.

# Create a tag key/value (resource-manager tags)
gcloud resource-manager tags keys create egress-tier \
  --parent="organizations/$ORG_ID"

KEY=$(gcloud resource-manager tags keys list \
  --parent="organizations/$ORG_ID" \
  --format="value(name)" --filter="shortName=egress-tier")

gcloud resource-manager tags values create allow-internet \
  --parent="$KEY"

# Reference it as a firewall target (full value resource name)
VALUE=$(gcloud resource-manager tags values list --parent="$KEY" \
  --format="value(name)" --filter="shortName=allow-internet")

gcloud compute firewall-policies rules create 900 \
  --firewall-policy="$POLICY" --organization="$ORG_ID" \
  --direction=EGRESS --action=allow \
  --layer4-configs=tcp:443 \
  --dest-ip-ranges=0.0.0.0/0 \
  --target-secure-tags="$VALUE" \
  --enable-logging

Now only instances explicitly bound to the egress-tier=allow-internet tag can reach the internet on 443; everything else hits the deny-all. Tag binding to a VM is a separate gcloud resource-manager tags bindings create operation against the instance’s resource name.

4. Cloud NAT for deterministic, allowlistable IPs

Even with controlled egress, traffic that does leave needs a stable source IP so downstream partners can allowlist you. Default Cloud NAT auto-allocates ephemeral IPs that change. For deterministic egress, reserve static external addresses and pin them.

REGION=us-central1
ROUTER=nat-router
gcloud compute routers create "$ROUTER" \
  --network=prod-vpc --region="$REGION"

# Reserve static, allowlistable external IPs
gcloud compute addresses create nat-ip-1 nat-ip-2 --region="$REGION"

gcloud compute routers nats create prod-nat \
  --router="$ROUTER" --region="$REGION" \
  --nat-external-ip-pool=nat-ip-1,nat-ip-2 \
  --nat-all-subnet-ip-ranges \
  --enable-logging

Two operational dials matter for scale:

gcloud compute routers nats update prod-nat \
  --router="$ROUTER" --region="$REGION" \
  --min-ports-per-vm=64 --max-ports-per-vm=1024 \
  --enable-dynamic-port-allocation

Capacity math you should do before go-live: IPs * 64512 / min-ports-per-vm = max concurrent VMs. Two IPs at 64 min-ports supports ~2,016 VMs at minimum allocation. Under-size this and you get the most common NAT outage there is — port exhaustion under load, which presents as intermittent connection failures, not a clean error.

5. Eliminating internet egress with Private Google Access and PSC

The strongest egress posture is one where workloads never touch the internet for Google services. Two mechanisms:

Private Google Access (PGA) lets VMs without external IPs reach Google APIs over internal routing. Enable it per-subnet and route the restricted VIP:

gcloud compute networks subnets update prod-subnet \
  --region="$REGION" --enable-private-ip-google-access

# Route the restricted VIP via the default internet gateway (internal path)
gcloud compute routes create restricted-google-apis \
  --network=prod-vpc \
  --destination-range=199.36.153.4/32 \
  --next-hop-gateway=default-internet-gateway \
  --priority=1000

Pair this with a private DNS zone for googleapis.com whose records point *.googleapis.com (and the restricted A record) at 199.36.153.4, so SDK calls resolve to the restricted VIP rather than public endpoints. With PGA + restricted VIP + the firewall allow from step 2, a VM can call Storage, BigQuery, and Pub/Sub with zero internet egress.

Private Service Connect (PSC) extends the same idea to published services and supported Google APIs via a consumer endpoint with an IP inside your VPC:

gcloud compute addresses create psc-googleapis-ip \
  --region="$REGION" --subnet=prod-subnet \
  --addresses=10.0.0.50

gcloud compute forwarding-rules create psc-googleapis \
  --global \
  --network=prod-vpc \
  --address=psc-googleapis-ip \
  --target-google-apis-bundle=all-apis

Now all-apis is reachable at 10.0.0.50 — a routable internal IP you can allowlist in firewall rules and point DNS at, with no dependency on the Google VIP ranges at all.

6. Enabling and reading firewall + NAT logs

You cannot audit what you cannot see. Note the --enable-logging flag on every rule above — for policy rules, logging is per-rule. Firewall logs land in Cloud Logging under the compute.googleapis.com/firewall log; NAT logs require enabling on the NAT gateway and can filter to errors only (translation failures, dropped packets) to keep volume sane.

# What hit the baseline deny in the last hour?
gcloud logging read \
  'logName=~"compute.googleapis.com%2Ffirewall"
   AND jsonPayload.disposition="DENIED"' \
  --freshness=1h \
  --format="table(timestamp, jsonPayload.connection.dest_ip,
                  jsonPayload.connection.dest_port,
                  jsonPayload.rule_details.reference)"

# NAT allocation errors (port exhaustion shows here)
gcloud logging read \
  'resource.type="nat_gateway"
   AND jsonPayload.allocation_status="DROPPED"' \
  --freshness=1h --format=json

For NAT, you can scope logging to errors only to control cost:

gcloud compute routers nats update prod-nat \
  --router="$ROUTER" --region="$REGION" \
  --enable-logging --log-filter=ERRORS_ONLY

The rule_details.reference field tells you which policy and rule matched — invaluable when three layers are in play. Route a sink for disposition="DENIED" to BigQuery for longer-term egress audit.

7. Test the stack with Connectivity Tests before rollout

Network Intelligence Center’s Connectivity Tests run the configuration-plane analysis and a live data-plane probe across your firewall layers, so you can validate a rule stack before it bites. This is the single highest-leverage pre-rollout step.

gcloud network-management connectivity-tests create egress-to-partner \
  --source-instance=projects/PROJ/zones/us-central1-a/instances/app-vm \
  --destination-ip-address=203.0.113.10 \
  --destination-port=443 \
  --protocol=TCP

gcloud network-management connectivity-tests describe egress-to-partner \
  --format="value(reachabilityDetails.result)"

A result of REACHABLE confirms the path; UNREACHABLE returns the trace with the exact dropping rule (including hierarchical policy rules), which removes the guesswork. Run one test per intended egress flow and one negative test (expect UNREACHABLE) to prove the baseline deny actually blocks.

Enterprise scenario

A payments platform team rolled out the folder-level baseline deny across ~40 projects, validated every Connectivity Test, and shipped. Two weeks later a regional GKE fleet started throwing intermittent dial tcp: i/o timeout to an external KYC partner — but only under load, never in the canary. The baseline was fine; the partner’s allowlist was fine. The culprit was Cloud NAT port exhaustion that the per-flow Connectivity Tests structurally could not catch.

The fleet ran behind a single NAT IP at the default 64 min-ports-per-vm. The capacity formula said 64512 / 64 ≈ 1008 VMs, comfortably above the node count. What they missed: GKE pods using VPC-native (alias IP) networking each consume NAT ports independently, and a chatty workload opening many short-lived TLS connections to one destination IP:port tuple churns through the per-endpoint port range fast. The nat_gateway logs made it unambiguous — allocation_status="DROPPED" spiking only at peak.

The fix was dynamic port allocation plus enabling per-VM endpoint-independent mapping so reused tuples didn’t each grab fresh ports, and adding a second reserved IP (both still on the partner allowlist):

gcloud compute addresses create nat-ip-2 --region="$REGION"
gcloud compute routers nats update prod-nat \
  --router="$ROUTER" --region="$REGION" \
  --nat-external-ip-pool=nat-ip-1,nat-ip-2 \
  --min-ports-per-vm=64 --max-ports-per-vm=4096 \
  --enable-dynamic-port-allocation \
  --enable-endpoint-independent-mapping

The lasting lesson: a REACHABLE Connectivity Test proves the path, never the capacity. They added a dedicated alert on allocation_status="DROPPED" to the rollout checklist permanently.

Verify

Confirm the end-to-end posture from a workload instance and the control plane.

# 1. From an allowed VM: Google API over the private path succeeds
gcloud compute ssh app-vm --zone=us-central1-a --tunnel-through-iap \
  --command="curl -sS -o /dev/null -w '%{http_code}\n' https://storage.googleapis.com"

# 2. From a non-allowlisted VM: generic internet egress is denied (expect timeout/fail)
gcloud compute ssh locked-vm --zone=us-central1-a --tunnel-through-iap \
  --command="curl -sS --max-time 5 https://example.com || echo BLOCKED_AS_EXPECTED"

# 3. Confirm the NAT source IP is your reserved, allowlistable address
gcloud compute ssh app-vm --zone=us-central1-a --tunnel-through-iap \
  --command="curl -sS https://api.ipify.org"

# 4. Confirm associations are live
gcloud compute firewall-policies associations list --organization="$ORG_ID"

Step 3 should return one of nat-ip-1/nat-ip-2. If it returns something else, your route or NAT scope is wrong.

Rollout checklist

Pitfalls

Build it folder-first, prove every flow with a Connectivity Test, and keep the deny final. That combination gives you egress that is controlled, deterministic, and — most importantly when something breaks at 2 a.m. — readable.

GCPFirewallCloud NATNetworkingSecurity

Comments

Keep Reading