Most GCP egress incidents are not “the firewall was open.” They are “three layers of firewall disagreed, the route was asymmetric, and nobody could read the logs.” Controlled egress on Google Cloud is a stack: hierarchical firewall policies set the organizational baseline, network firewall policies handle the VPC-specific exceptions, Cloud NAT gives you a deterministic, allowlistable source IP, and Private Google Access plus PSC remove the need for internet egress entirely. This guide wires all of that together and shows you how to verify it before it pages you.
1. How firewall evaluation actually works
Before authoring a single rule, internalise the evaluation order. A packet on GCP is evaluated against rules from multiple sources, in this sequence:
- Hierarchical firewall policies attached to the organization, then each folder down the resource hierarchy (closest-to-root first).
- Global and regional network firewall policies attached to the VPC.
- Legacy VPC firewall rules (the old
compute.firewallsobjects). - An implied
allow egress/deny ingresspair at the very end.
Within any single policy, rules are sorted by priority (lower number wins; 0–2147483643). The first rule that matches the packet determines the action — unless that rule uses the special goto_next action, which delegates the decision to the next level down. So a hierarchical rule can either decide (allow/deny) or explicitly defer (goto_next).
The mental model that keeps you sane: hierarchical policies are for guardrails you never want a project owner to override (deny). Use
goto_nextfor everything you want lower levels to be able to decide. A baseline egress-deny that does not usegoto_nextis final and cannot be re-opened by a VPC-level allow.
This last point is the crux of egress control. If you put a low-priority (high-number) hierarchical “deny all egress” rule that is final, project teams literally cannot open egress without you editing the org policy. If you make it goto_next, they can.
2. Authoring an org/folder baseline egress deny
Create a hierarchical firewall policy and associate it with a folder (associating at the org node is also valid; folders give you blast-radius control during rollout). Hierarchical policies live under gcloud compute firewall-policies and require --organization for scope.
ORG_ID=123456789012
FOLDER_ID=987654321098
# 1. Create the policy container
gcloud compute firewall-policies create \
--organization="$ORG_ID" \
--short-name="egress-baseline" \
--description="Org baseline: default-deny egress with explicit allowlist"
# Capture the generated policy ID (numeric)
POLICY=$(gcloud compute firewall-policies list \
--organization="$ORG_ID" \
--format="value(name)" \
--filter="shortName=egress-baseline")
Now add rules. Lower-priority numbers are evaluated first, so put the explicit allows above the deny. A common baseline: allow egress to RFC 1918 internal ranges and to Google APIs, then deny the rest.
# Allow egress to internal RFC1918 (delegated decision -> let VPC policies refine)
gcloud compute firewall-policies rules create 1000 \
--firewall-policy="$POLICY" --organization="$ORG_ID" \
--direction=EGRESS --action=goto_next \
--layer4-configs=all \
--dest-ip-ranges=10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
--enable-logging
# Allow egress to Google's restricted VIP for Private Google Access
gcloud compute firewall-policies rules create 1100 \
--firewall-policy="$POLICY" --organization="$ORG_ID" \
--direction=EGRESS --action=allow \
--layer4-configs=tcp:443 \
--dest-ip-ranges=199.36.153.4/30 \
--enable-logging
# Final baseline: deny all other egress (NOT goto_next -> this is the guardrail)
gcloud compute firewall-policies rules create 2147483643 \
--firewall-policy="$POLICY" --organization="$ORG_ID" \
--direction=EGRESS --action=deny \
--layer4-configs=all \
--dest-ip-ranges=0.0.0.0/0 \
--enable-logging
Associate the policy with the folder. Until you associate it, it enforces nothing:
gcloud compute firewall-policies associations create \
--firewall-policy="$POLICY" --organization="$ORG_ID" \
--folder="$FOLDER_ID" \
--name="egress-baseline-folder-assoc"
The 199.36.153.4/30 range is the restricted.googleapis.com VIP — the entry point for Private Google Access when you want to block all general internet egress but still reach Google APIs. (private.googleapis.com is 199.36.153.8/30; restricted is the stricter one that excludes APIs with no VPC-SC support.)
3. Targets: secure tags vs network tags vs service accounts
A firewall rule narrows what it applies to via targets. You have three options, and they are not interchangeable.
| Target type | Where it works | IAM-governed | Notes |
|---|---|---|---|
| Network tags | VPC firewall rules, network policies | No | Free-text strings on the instance; any editor can add one. Weakest control. |
| Service accounts | VPC rules, network + hierarchical policies | Yes (via SA usage) | Strong, but you burn a target SA and instances are limited in SA count. |
| Secure tags | Network + hierarchical policies | Yes (Tag User/Admin roles) | Key/value resource-manager tags, IAM-bound. The modern recommendation. |
For hierarchical and network firewall policies, prefer secure tags (resource-manager tags), referenced as --target-secure-tags. They are governed by IAM (a user needs roles/resourcemanager.tagUser to bind one), so a developer cannot silently grant their VM into a privileged rule the way they can with a network tag.
# Create a tag key/value (resource-manager tags)
gcloud resource-manager tags keys create egress-tier \
--parent="organizations/$ORG_ID"
KEY=$(gcloud resource-manager tags keys list \
--parent="organizations/$ORG_ID" \
--format="value(name)" --filter="shortName=egress-tier")
gcloud resource-manager tags values create allow-internet \
--parent="$KEY"
# Reference it as a firewall target (full value resource name)
VALUE=$(gcloud resource-manager tags values list --parent="$KEY" \
--format="value(name)" --filter="shortName=allow-internet")
gcloud compute firewall-policies rules create 900 \
--firewall-policy="$POLICY" --organization="$ORG_ID" \
--direction=EGRESS --action=allow \
--layer4-configs=tcp:443 \
--dest-ip-ranges=0.0.0.0/0 \
--target-secure-tags="$VALUE" \
--enable-logging
Now only instances explicitly bound to the egress-tier=allow-internet tag can reach the internet on 443; everything else hits the deny-all. Tag binding to a VM is a separate gcloud resource-manager tags bindings create operation against the instance’s resource name.
4. Cloud NAT for deterministic, allowlistable IPs
Even with controlled egress, traffic that does leave needs a stable source IP so downstream partners can allowlist you. Default Cloud NAT auto-allocates ephemeral IPs that change. For deterministic egress, reserve static external addresses and pin them.
REGION=us-central1
ROUTER=nat-router
gcloud compute routers create "$ROUTER" \
--network=prod-vpc --region="$REGION"
# Reserve static, allowlistable external IPs
gcloud compute addresses create nat-ip-1 nat-ip-2 --region="$REGION"
gcloud compute routers nats create prod-nat \
--router="$ROUTER" --region="$REGION" \
--nat-external-ip-pool=nat-ip-1,nat-ip-2 \
--nat-all-subnet-ip-ranges \
--enable-logging
Two operational dials matter for scale:
- Manual NAT (
--nat-external-ip-pool) gives you fixed IPs to hand to partners, but you must size the pool — each NAT IP provides 64,512 source ports, and Cloud NAT pre-allocates a per-VM minimum. --min-ports-per-vm(default 64) controls how many ports each VM reserves up front. Lower it to pack more VMs per IP; raise it for chatty workloads to avoid port exhaustion. With dynamic port allocation (--enable-dynamic-port-allocation), NAT scales a VM between--min-ports-per-vmand--max-ports-per-vmon demand, which is the better default for bursty fleets.
gcloud compute routers nats update prod-nat \
--router="$ROUTER" --region="$REGION" \
--min-ports-per-vm=64 --max-ports-per-vm=1024 \
--enable-dynamic-port-allocation
Capacity math you should do before go-live: IPs * 64512 / min-ports-per-vm = max concurrent VMs. Two IPs at 64 min-ports supports ~2,016 VMs at minimum allocation. Under-size this and you get the most common NAT outage there is — port exhaustion under load, which presents as intermittent connection failures, not a clean error.
5. Eliminating internet egress with Private Google Access and PSC
The strongest egress posture is one where workloads never touch the internet for Google services. Two mechanisms:
Private Google Access (PGA) lets VMs without external IPs reach Google APIs over internal routing. Enable it per-subnet and route the restricted VIP:
gcloud compute networks subnets update prod-subnet \
--region="$REGION" --enable-private-ip-google-access
# Route the restricted VIP via the default internet gateway (internal path)
gcloud compute routes create restricted-google-apis \
--network=prod-vpc \
--destination-range=199.36.153.4/32 \
--next-hop-gateway=default-internet-gateway \
--priority=1000
Pair this with a private DNS zone for googleapis.com whose records point *.googleapis.com (and the restricted A record) at 199.36.153.4, so SDK calls resolve to the restricted VIP rather than public endpoints. With PGA + restricted VIP + the firewall allow from step 2, a VM can call Storage, BigQuery, and Pub/Sub with zero internet egress.
Private Service Connect (PSC) extends the same idea to published services and supported Google APIs via a consumer endpoint with an IP inside your VPC:
gcloud compute addresses create psc-googleapis-ip \
--region="$REGION" --subnet=prod-subnet \
--addresses=10.0.0.50
gcloud compute forwarding-rules create psc-googleapis \
--global \
--network=prod-vpc \
--address=psc-googleapis-ip \
--target-google-apis-bundle=all-apis
Now all-apis is reachable at 10.0.0.50 — a routable internal IP you can allowlist in firewall rules and point DNS at, with no dependency on the Google VIP ranges at all.
6. Enabling and reading firewall + NAT logs
You cannot audit what you cannot see. Note the --enable-logging flag on every rule above — for policy rules, logging is per-rule. Firewall logs land in Cloud Logging under the compute.googleapis.com/firewall log; NAT logs require enabling on the NAT gateway and can filter to errors only (translation failures, dropped packets) to keep volume sane.
# What hit the baseline deny in the last hour?
gcloud logging read \
'logName=~"compute.googleapis.com%2Ffirewall"
AND jsonPayload.disposition="DENIED"' \
--freshness=1h \
--format="table(timestamp, jsonPayload.connection.dest_ip,
jsonPayload.connection.dest_port,
jsonPayload.rule_details.reference)"
# NAT allocation errors (port exhaustion shows here)
gcloud logging read \
'resource.type="nat_gateway"
AND jsonPayload.allocation_status="DROPPED"' \
--freshness=1h --format=json
For NAT, you can scope logging to errors only to control cost:
gcloud compute routers nats update prod-nat \
--router="$ROUTER" --region="$REGION" \
--enable-logging --log-filter=ERRORS_ONLY
The rule_details.reference field tells you which policy and rule matched — invaluable when three layers are in play. Route a sink for disposition="DENIED" to BigQuery for longer-term egress audit.
7. Test the stack with Connectivity Tests before rollout
Network Intelligence Center’s Connectivity Tests run the configuration-plane analysis and a live data-plane probe across your firewall layers, so you can validate a rule stack before it bites. This is the single highest-leverage pre-rollout step.
gcloud network-management connectivity-tests create egress-to-partner \
--source-instance=projects/PROJ/zones/us-central1-a/instances/app-vm \
--destination-ip-address=203.0.113.10 \
--destination-port=443 \
--protocol=TCP
gcloud network-management connectivity-tests describe egress-to-partner \
--format="value(reachabilityDetails.result)"
A result of REACHABLE confirms the path; UNREACHABLE returns the trace with the exact dropping rule (including hierarchical policy rules), which removes the guesswork. Run one test per intended egress flow and one negative test (expect UNREACHABLE) to prove the baseline deny actually blocks.
Enterprise scenario
A payments platform team rolled out the folder-level baseline deny across ~40 projects, validated every Connectivity Test, and shipped. Two weeks later a regional GKE fleet started throwing intermittent dial tcp: i/o timeout to an external KYC partner — but only under load, never in the canary. The baseline was fine; the partner’s allowlist was fine. The culprit was Cloud NAT port exhaustion that the per-flow Connectivity Tests structurally could not catch.
The fleet ran behind a single NAT IP at the default 64 min-ports-per-vm. The capacity formula said 64512 / 64 ≈ 1008 VMs, comfortably above the node count. What they missed: GKE pods using VPC-native (alias IP) networking each consume NAT ports independently, and a chatty workload opening many short-lived TLS connections to one destination IP:port tuple churns through the per-endpoint port range fast. The nat_gateway logs made it unambiguous — allocation_status="DROPPED" spiking only at peak.
The fix was dynamic port allocation plus enabling per-VM endpoint-independent mapping so reused tuples didn’t each grab fresh ports, and adding a second reserved IP (both still on the partner allowlist):
gcloud compute addresses create nat-ip-2 --region="$REGION"
gcloud compute routers nats update prod-nat \
--router="$ROUTER" --region="$REGION" \
--nat-external-ip-pool=nat-ip-1,nat-ip-2 \
--min-ports-per-vm=64 --max-ports-per-vm=4096 \
--enable-dynamic-port-allocation \
--enable-endpoint-independent-mapping
The lasting lesson: a REACHABLE Connectivity Test proves the path, never the capacity. They added a dedicated alert on allocation_status="DROPPED" to the rollout checklist permanently.
Verify
Confirm the end-to-end posture from a workload instance and the control plane.
# 1. From an allowed VM: Google API over the private path succeeds
gcloud compute ssh app-vm --zone=us-central1-a --tunnel-through-iap \
--command="curl -sS -o /dev/null -w '%{http_code}\n' https://storage.googleapis.com"
# 2. From a non-allowlisted VM: generic internet egress is denied (expect timeout/fail)
gcloud compute ssh locked-vm --zone=us-central1-a --tunnel-through-iap \
--command="curl -sS --max-time 5 https://example.com || echo BLOCKED_AS_EXPECTED"
# 3. Confirm the NAT source IP is your reserved, allowlistable address
gcloud compute ssh app-vm --zone=us-central1-a --tunnel-through-iap \
--command="curl -sS https://api.ipify.org"
# 4. Confirm associations are live
gcloud compute firewall-policies associations list --organization="$ORG_ID"
Step 3 should return one of nat-ip-1/nat-ip-2. If it returns something else, your route or NAT scope is wrong.
Rollout checklist
Pitfalls
- Asymmetric routes. Cloud NAT only handles egress-initiated flows. If a workload needs inbound connections, NAT will not return that traffic — you need a load balancer or external IP. Mixing the two on one subnet produces flows that leave via NAT but expect to return via an external IP, and they silently fail.
- Port exhaustion. The default 64 min-ports plus a small IP pool caps your fleet far lower than people expect. Do the capacity math, enable dynamic port allocation, and alarm on
allocation_status="DROPPED". - Shadowed rules. A lower-numbered
allow 0.0.0.0/0placed above a narrower deny makes the deny unreachable — the broad rule wins on priority. Connectivity Tests’ returnedrule_details.referenceis how you catch which rule actually fired versus which you thought would. goto_nextconfusion. A hierarchical deny set togoto_nextis not a deny — it defers. Reservegoto_nextfor allows you want lower layers to refine; make guardrail denies final.- DNS, not just firewall. Private Google Access without the DNS override still resolves
googleapis.comto public IPs and breaks once you tighten egress. PGA is firewall plus routing plus DNS — all three.
Build it folder-first, prove every flow with a Connectivity Test, and keep the deny final. That combination gives you egress that is controlled, deterministic, and — most importantly when something breaks at 2 a.m. — readable.