Centralized Egress Inspection with AWS Network Firewall: Routing, Domain Filtering, and Suricata Rules

“Block all outbound except an allow-list” is one of those controls that auditors love and engineers underestimate. The hard part is not the firewall rule; it is getting every packet from forty spoke VPCs to traverse one inspection point symmetrically, surviving asymmetric routing, NAT, and the stateful-flow assumptions a Suricata engine makes. This guide builds a centralized AWS Network Firewall egress chokepoint behind a Transit Gateway, then layers domain allow-lists, custom IPS signatures, and the operational tuning that keeps it from becoming a pager magnet.

Everything here assumes you already have a hub-and-spoke Transit Gateway (TGW) and a non-overlapping CIDR plan. If you do not, build that first.

1. The inspection architecture

The pattern is a dedicated inspection VPC that owns internet egress for the whole estate. Spokes have no NAT gateway and no internet route of their own. All outbound traffic is forced to the TGW, which hairpins it into the inspection VPC, through the firewall, out a NAT gateway, and to the internet.

Spoke VPCs (no IGW/NAT)
      |  default route -> TGW
   [ Transit Gateway ]   <- spoke RT: 0.0.0.0/0 -> TGW
      |  (appliance mode ON for the inspection attachment)
[ Inspection VPC ]
   firewall subnet  -> AWS Network Firewall endpoint (per AZ)
   NAT subnet       -> NAT Gateway (per AZ)
   public subnet    -> Internet Gateway
      |
   Internet

Three subnet tiers per AZ inside the inspection VPC, each with its own route table:

Subnet	Route table sends 0.0.0.0/0 to	Purpose
TGW attachment subnet	Firewall endpoint (`vpce-...`)	Forces TGW-delivered traffic into the firewall
Firewall subnet	NAT Gateway	Post-inspection traffic egresses via NAT
NAT/public subnet	Internet Gateway	NAT reaches the internet

The non-obvious move is the TGW attachment subnet route table: its default route points at the firewall VPC endpoint, not the NAT or IGW. That is what guarantees inspection happens before egress. Return traffic from the internet lands on the NAT, whose subnet route table sends the spoke CIDRs (10.0.0.0/8 or your supernet) back to the TGW after a second pass through the firewall.

AWS shipped a native Network Firewall Transit Gateway attachment in July 2025 that lets you attach a firewall directly to a TGW and skip the hand-built inspection VPC, subnets, and route tables. It is the right default for greenfield. I am building the explicit VPC model here because it is what most existing estates run, it makes the packet flow legible, and the rule-group and logging mechanics are identical either way. Treat the native attachment as a simplification of the routing in this section, not of the rules in the rest.

2. Appliance mode and the symmetric-routing problem

Network Firewall is stateful. A flow’s SYN, its data, and its return packets must all hit the same firewall endpoint, or the engine sees half a conversation and drops it. With a multi-AZ TGW and multiple firewall endpoints, the default TGW behavior can send the forward path through AZ-a’s endpoint and the return path through AZ-b’s. That asymmetry breaks stateful inspection.

The fix is TGW appliance mode, enabled on the inspection VPC attachment. Appliance mode makes the TGW pick one endpoint via a flow hash and pin both directions of the connection to it for the flow’s lifetime.

aws ec2 modify-transit-gateway-vpc-attachment \
  --transit-gateway-attachment-id tgw-attach-0abc123def4567890 \
  --options ApplianceModeSupport=enable

In Terraform it is one argument on the attachment:

resource "aws_ec2_transit_gateway_vpc_attachment" "inspection" {
  subnet_ids             = [for s in aws_subnet.tgw_attach : s.id]
  transit_gateway_id     = aws_ec2_transit_gateway.hub.id
  vpc_id                 = aws_vpc.inspection.id
  appliance_mode_support = "enable"   # <-- the whole ballgame for stateful symmetry
}

Forgetting appliance_mode_support is the single most common cause of “the firewall randomly drops long-lived connections.” Symptoms are intermittent: short HTTP requests that complete inside one packet exchange may pass, while large downloads or persistent gRPC streams die mid-flight. If you see that pattern, check appliance mode before you touch a single rule.

You also want separate TGW route tables for spokes and inspection so you do not create a routing loop. Spoke associations point 0.0.0.0/0 at the inspection attachment; the inspection attachment’s route table propagates the spoke CIDRs back so return traffic finds its way home.

3. Stateless vs stateful rule groups and evaluation order

A firewall policy references two engines. Packets always hit the stateless engine first.

Stateless engine — packet-by-packet, 5-tuple match, no flow context. Use it almost exclusively to forward everything to the stateful engine. The default action you want is aws:forward_to_sfe. Resist the temptation to drop here; you lose logging fidelity and flow visibility.

Stateful engine — Suricata-compatible, flow-aware, sees application-layer fields like TLS SNI and HTTP host. This is where egress policy lives.

The decision that will haunt you if you get it wrong is stateful rule evaluation order:

Mode	Behavior	When
`DEFAULT_ACTION_ORDER`	Pass rules, then drop, then alert — Suricata action precedence, not your file order	Legacy default; surprising for allow-lists
`STRICT_ORDER`	Rule groups by `priority`, rules within a group top-to-bottom	What you want for a deny-by-default egress policy

With default action order, a drop you place last still loses to any pass because Suricata evaluates all pass actions before any drop. That makes a clean “allow these domains, drop everything else” policy nearly impossible to reason about. Use STRICT_ORDER. It also unlocks a real default action, so the policy itself denies traffic that matches nothing.

resource "aws_networkfirewall_firewall_policy" "egress" {
  name = "central-egress"

  firewall_policy {
    stateful_engine_options {
      rule_order              = "STRICT_ORDER"
      stream_exception_policy = "DROP"
    }

    # Stateless engine: do nothing clever, forward to stateful.
    stateless_default_actions             = ["aws:forward_to_sfe"]
    stateless_fragment_default_actions    = ["aws:forward_to_sfe"]

    # Deny-by-default at the policy level (STRICT_ORDER only).
    stateful_default_actions = ["aws:drop_established", "aws:alert_established"]

    stateful_rule_group_reference {
      priority     = 100
      resource_arn = aws_networkfirewall_rule_group.domain_allowlist.arn
    }
    stateful_rule_group_reference {
      priority     = 200
      resource_arn = aws_networkfirewall_rule_group.suricata_ips.arn
    }
  }
}

aws:drop_established drops packets on flows that are already established but match no rule, while aws:alert_established logs them so you can see exactly what you broke. stream_exception_policy = "DROP" decides what happens to flows that break mid-stream (e.g., after a deploy swaps the firewall): drop them rather than fail open. Lower priority numbers evaluate first, so your allow-list (100) runs before your threat-intel IPS group (200).

4. Domain-based egress filtering

You have two ways to express “only these domains.”

4a. Managed domain-list rule groups

The simplest is a domain list rule group. You give it FQDNs and target types; Network Firewall compiles the Suricata rules for you. A leading dot is the wildcard: .amazon.com matches s3.amazon.com and www.amazon.com.

{
  "RulesSource": {
    "RulesSourceList": {
      "Targets": [".amazonaws.com", ".github.com", "registry.npmjs.org"],
      "TargetTypes": ["TLS_SNI", "HTTP_HOST"],
      "GeneratedRulesType": "ALLOWLIST"
    }
  }
}

aws network-firewall create-rule-group \
  --rule-group-name egress-allowlist \
  --type STATEFUL \
  --capacity 1000 \
  --rule-group file://allowlist.json

Under the hood, an ALLOWLIST for TLS_SNI generates exactly this (worth understanding, because it shows the mechanism and its limits):

pass tls $HOME_NET any -> $EXTERNAL_NET any (ssl_state:client_hello; tls.sni; dotprefix; content:".amazonaws.com"; nocase; endswith; msg:"matching TLS allowlisted FQDNs"; priority:1; flow:to_server, established; sid:1; rev:1;)
drop tls $HOME_NET any -> $EXTERNAL_NET any (msg:"not matching any TLS allowlisted FQDNs"; priority:1; ssl_state:client_hello; flow:to_server, established; sid:3; rev:1;)

The filtering is on the TLS SNI for HTTPS and the HTTP Host header for cleartext HTTP. No decryption happens — the firewall reads the SNI from the unencrypted ClientHello. That is the central caveat: a client that sends a fake or empty SNI but connects to a denied IP is not blocked by SNI matching alone. Domain allow-lists are a guardrail against well-behaved clients and accidental data paths, not a hard control against a determined exfiltrator. If your threat model includes the latter, you need TLS inspection (decrypt) or IP-based egress control on top.

4b. Hand-written SNI rules with strict order

When you need more than allow/deny — say, “allow checkip.amazonaws.com only if the server certificate issuer is Amazon” — drop to raw Suricata in a STRICT_ORDER group. AWS documents this exact pattern:

alert tls $HOME_NET any -> $EXTERNAL_NET 443 (ssl_state:client_hello; tls.sni; content:"checkip.amazonaws.com"; endswith; nocase; xbits:set, allowed_sni_destination_ips, track ip_dst, expire 3600; noalert; sid:238745;)
pass tcp $HOME_NET any -> $EXTERNAL_NET 443 (xbits:isset, allowed_sni_destination_ips, track ip_dst; flow: stateless; sid:89207006;)
pass tls $EXTERNAL_NET 443 -> $HOME_NET any (tls.cert_issuer; content:"Amazon"; msg:"Pass rules do not alert"; xbits:isset, allowed_sni_destination_ips, track ip_src; sid:29822;)
reject tls $EXTERNAL_NET 443 -> $HOME_NET any (tls.cert_issuer; content:"="; nocase; msg:"Block all other cert issuers not allowed by sid:29822"; sid:897972;)

A few rules that earn their place in any egress policy:

# Block TLS connecting straight to an IP literal in the SNI (skipped DNS — classic exfil/C2).
reject tls $HOME_NET any -> $EXTERNAL_NET any (ssl_state:client_hello; tls.sni; content:"."; pcre:"/^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$/"; msg:"IP in TLS SNI"; flow:to_server; sid:1239848;)

# Block deprecated TLS.
reject tls any any -> any any (msg:"TLS 1.0 or 1.1"; ssl_version:tls1.0,tls1.1; sid:2023070518;)

# Allow DNS only to your resolvers (drop everything else elsewhere in the policy).
pass dns $HOME_NET any -> $EXTERNAL_NET any (dns.query; dotprefix; content:".amazonaws.com"; endswith; nocase; msg:"Pass rules do not alert"; sid:118947;)

Note $HOME_NET. By default Network Firewall sets HOME_NET to the firewall’s own VPC CIDR. In a centralized model the source traffic originates in spoke CIDRs, so you must override HOME_NET in the rule group’s variables to include the full supernet (e.g. 10.0.0.0/8), or your $HOME_NET-anchored rules silently never match.

rule_variables {
  ip_sets {
    key = "HOME_NET"
    ip_set { definition = ["10.0.0.0/8"] }   # spoke supernet, not the inspection VPC
  }
}

5. Custom Suricata IPS signatures and managed threat-intel groups

Egress filtering and intrusion prevention are different jobs sharing one engine. For IPS, lean on AWS Managed Rule Groups — AWS maintains threat-intelligence feeds (botnet C2 domains, known-malware IPs, emerging threats) you reference by ARN and never have to curate yourself.

aws network-firewall list-rule-groups --scope MANAGED \
  --query "RuleGroups[?contains(Name, 'ThreatSignatures')].[Name,Arn]" --output table

Reference a managed group in the policy exactly like a custom one, at a priority that runs after your allow-list so a flow to an allowed domain is still scanned:

stateful_rule_group_reference {
  priority     = 300
  resource_arn = "arn:aws:network-firewall:eu-west-1:aws-managed:stateful-rulegroup/ThreatSignaturesBotnetStrictOrder"
}

For estate-specific detections, write your own. A custom signature group can use either raw Suricata strings or, increasingly useful, AWS’s domain/URL category keywords, which classify destinations without you maintaining domain lists:

# Block known malicious and phishing categories regardless of allow-list.
drop http any any -> any any (msg:"Block malware/phishing"; aws_domain_category:Malware,Phishing; sid:55555556; rev:1;)

aws_domain_category evaluates the TLS SNI or HTTP host; aws_url_category evaluates full URLs but requires TLS inspection to see the path. Keep these in a separate STRICT_ORDER group so their priority relative to the allow-list is explicit.

Two hard constraints worth pinning to a wall: every stateful rule needs a unique sid across all groups in the policy, and each rule group is created with a fixed capacity (a units budget you cannot raise after creation — size it with headroom or you will be recreating groups).

6. High availability, scaling, and endpoint placement

Network Firewall is a managed, horizontally-scaling service, but you own the AZ topology. Place one firewall endpoint per AZ that carries traffic, with a TGW attachment subnet, firewall subnet, and NAT in each. A firewall endpoint is a single point of failure for its AZ; spanning AZs is what gives you resilience.

Each endpoint scales automatically to tens of Gbps; there is no instance to size. Per-flow throughput is bounded, so a single elephant flow will not saturate, but it also will not exceed the per-flow ceiling — shard genuinely large transfers across connections.
Appliance mode (Section 2) keeps each flow pinned to its AZ’s endpoint. If an AZ fails, new flows hash to surviving endpoints; in-flight flows in the dead AZ are lost, which is correct behavior.
Keep firewall, NAT, and TGW attachment subnets in the same AZ so traffic does not cross-AZ inside the inspection VPC. Cross-AZ data transfer is billable and adds latency for no benefit.

A firewall referencing N endpoints across N AZs:

resource "aws_networkfirewall_firewall" "egress" {
  name                = "central-egress"
  firewall_policy_arn = aws_networkfirewall_firewall_policy.egress.arn
  vpc_id              = aws_vpc.inspection.id

  dynamic "subnet_mapping" {
    for_each = aws_subnet.firewall          # one per AZ
    content { subnet_id = subnet_mapping.value.id }
  }

  delete_protection = true   # stop a stray `terraform destroy` opening egress wide
}

7. Logging, alerting, and tuning false positives

If it is not logged, it did not happen — and with a default-deny egress policy, the logs are how you find the legitimate traffic you just broke. Network Firewall emits two stateful log types: ALERT (rules with the alert action or the aws:alert_established default) and FLOW (connection records). Send ALERT to a place you can query fast.

resource "aws_networkfirewall_logging_configuration" "egress" {
  firewall_arn = aws_networkfirewall_firewall.egress.arn

  logging_configuration {
    log_destination_config {
      log_type        = "ALERT"
      log_destination_type = "CloudWatchLogs"
      log_destination = { logGroup = aws_cloudwatch_log_group.nfw_alert.name }
    }
    log_destination_config {
      log_type        = "FLOW"
      log_destination_type = "S3"
      log_destination = { bucketName = aws_s3_bucket.nfw_flow.id, prefix = "flow" }
    }
  }
}

Put high-volume FLOW logs in S3 (cheap, queryable with Athena) and noisy-but-actionable ALERT logs in CloudWatch Logs for real-time querying and metric filters. A starter query to find what your allow-list is denying, in CloudWatch Logs Insights:

fields @timestamp, event.src_ip, event.tls.sni, event.http.hostname, event.alert.signature
| filter event.alert.action = "blocked"
| stats count(*) as hits by event.tls.sni, event.http.hostname
| sort hits desc
| limit 50

That single query is your tuning loop: run it, find the legitimate domain you forgot (every estate forgets one telemetry or package endpoint), add it to the allow-list, repeat. Roll out with the default action set to alert, not drop for the first few days — you get the full list of what would be blocked without a single broken deploy. Flip to aws:drop_established once the alert stream is quiet. Wire a CloudWatch alarm on a metric filter for event.alert.action = "blocked" so a sudden spike (a new service, or genuine exfiltration) pages someone.

8. Cost model — and when not to use it

Network Firewall bills two ways: an hourly charge per firewall endpoint plus a per-GB charge on traffic processed. With one endpoint per AZ across three AZs, you pay three endpoint-hours every hour before a single byte moves, and that is the line item that surprises people.

Egress control	Strengths	Watch-outs
Network Firewall	Stateful, L7 SNI/host filtering, IPS, managed threat intel, centralized	Per-endpoint hourly + per-GB; no decrypt without TLS inspection
Squid / explicit proxy	Cheap at high GB, full URL path, auth-aware, caching	You operate it (HA, patching, scaling); apps must be proxy-aware
NAT + route/SG + prefix lists	Nearly free, simple	IP-based only; no domain or L7 awareness; brittle as IPs churn
Route 53 Resolver DNS Firewall	Block by domain at resolution, very cheap	DNS-layer only; trivially bypassed by hardcoded IPs

Reach for a proxy when you are pushing very large volumes (the per-GB fee dominates) and your apps can speak to a proxy — Squid filtering full URLs can be cheaper and more granular. Reach for NAT plus prefix lists when “egress control” means a short, stable list of partner IPs and you do not need domains. Reach for DNS Firewall as a cheap first layer that complements, not replaces, Network Firewall. Choose Network Firewall when you need stateful L7 egress filtering and IPS centralized across many accounts without running appliances yourself — which is most regulated, multi-account estates. A common production answer is both: DNS Firewall as a cheap broad net, Network Firewall as the stateful chokepoint.

Verify

Validate routing and policy before you trust it. From a spoke instance with no other internet path:

# Allowed domain should connect; SNI is read from the ClientHello.
curl -sS -o /dev/null -w "%{http_code}\n" https://www.github.com   # expect 200/30x

# Denied domain should hang then fail (dropped established flow).
curl -sS --max-time 8 https://example.org ; echo "exit=$?"          # expect non-zero (timeout)

# IP-literal SNI bypass attempt should be rejected by the IP-in-SNI rule.
curl -sS --max-time 8 https://93.184.216.34 ; echo "exit=$?"        # expect rejected

Confirm the data plane and config:

# Firewall endpoints exist, one per AZ, and report READY.
aws network-firewall describe-firewall --firewall-name central-egress \
  --query "FirewallStatus.SyncStates" --output table

# Appliance mode is actually enabled on the inspection attachment.
aws ec2 describe-transit-gateway-vpc-attachments \
  --transit-gateway-attachment-ids tgw-attach-0abc123def4567890 \
  --query "TransitGatewayVpcAttachments[0].Options.ApplianceModeSupport"   # expect "enable"

Then run the CloudWatch Logs Insights query from Section 7 and confirm denied attempts show event.alert.action = "blocked" with the expected SNI/host. If a denied curl succeeds, your traffic is bypassing the firewall (route table or appliance-mode error), not failing open at the rule layer — chase the routing first.

Enterprise scenario

A fintech platform team ran the textbook centralized model: forty spokes, one inspection VPC across three AZs, a TLS-SNI allow-list, deny-by-default. Weeks after go-live, their nightly data-warehouse load to a partner S3 bucket began failing roughly one run in five — never on smaller jobs, always on the multi-hundred-GB transfers. The allow-list had .amazonaws.com, so SNI matching was not the suspect. ALERT logs showed aws:drop_established firing on the partner’s flows mid-transfer.

The cause was asymmetric routing. The inspection VPC attachment had been created in an early Terraform module without appliance_mode_support = "enable". Short flows completed inside a single AZ’s endpoint by luck of the flow hash; long-lived multi-gigabyte streams lived long enough for the TGW to route return packets through a different AZ’s endpoint, where the stateful engine had no record of the flow and dropped it as a mid-stream break. The reason it was intermittent — not total — is exactly what made it hard to spot: it correlated with transfer duration, not destination.

The fix was one argument and a brief maintenance window to re-establish flows:

resource "aws_ec2_transit_gateway_vpc_attachment" "inspection" {
  subnet_ids             = [for s in aws_subnet.tgw_attach : s.id]
  transit_gateway_id     = aws_ec2_transit_gateway.hub.id
  vpc_id                 = aws_vpc.inspection.id
  appliance_mode_support = "enable"   # was absent; this is what fixed the intermittent drops
}

They also added a guardrail so it could never regress silently: an AWS Config rule plus a CI check asserting ApplianceModeSupport == enable on every inspection attachment, because the failure mode is invisible until a flow happens to live long enough — and by then it is a production incident, not a review comment.

Centralized Egress Inspection with AWS Network Firewall: Routing, Domain Filtering, and Suricata Rules

1. The inspection architecture

2. Appliance mode and the symmetric-routing problem

3. Stateless vs stateful rule groups and evaluation order

4. Domain-based egress filtering

4a. Managed domain-list rule groups

4b. Hand-written SNI rules with strict order

5. Custom Suricata IPS signatures and managed threat-intel groups

6. High availability, scaling, and endpoint placement

7. Logging, alerting, and tuning false positives

8. Cost model — and when not to use it

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

Centralized AWS Backup with Organizations: Vault Lock, Cross-Account Copy, and Recovery Runbooks

Validating VPC Connectivity with Reachability Analyzer and Network Access Analyzer

Building Cross-Account Services with AWS PrivateLink: Endpoint Services, NLBs, and DNS