Networking AWS

AWS Network Firewall in Production: Suricata Rule Engineering for Egress Inspection

Security groups and NACLs gate traffic by IP and port. They have nothing to say about which domain an exfiltrating process is calling out to over 443. Egress control at L7 is where AWS Network Firewall earns its keep: a managed, horizontally scaled Suricata engine you can drop into the data path to allowlist destinations by TLS SNI, run IDS/IPS signatures, and log every flow. The engine is the easy part. The hard part — the part that turns a “deployed” firewall into one that actually inspects traffic — is the route-table choreography and the rule-evaluation ordering. Get the routing wrong and traffic sails past untouched, or worse, gets blackholed by asymmetry. This guide builds the centralized inspection model end to end and engineers the Suricata rules to match.

1. Deployment models: distributed vs centralized inspection VPC

You can run Network Firewall two ways, and the choice dictates the entire routing design.

Model Where firewall endpoints live Inspects Scales to
Distributed One firewall per spoke VPC, endpoints in each AZ That VPC’s traffic only A handful of VPCs
Centralized One firewall in a dedicated inspection VPC behind Transit Gateway All spoke egress and (optionally) east-west Tens to hundreds of VPCs

Distributed is simpler to reason about per VPC but multiplies cost and operational surface with every new account. For any platform running more than a few VPCs, the centralized inspection VPC behind a Transit Gateway is the model that scales: spokes have no internet path of their own, the TGW pulls their traffic into the inspection VPC, the firewall scrubs it, and a single egress path (NAT gateway plus IGW) sits behind the firewall.

The firewall does not route. It is a bump in the wire. Network Firewall creates a Gateway Load Balancer-style VPC endpoint in each inspection-VPC AZ, and you are responsible for steering traffic to that endpoint with route tables. Nothing is implicit.

The inspection VPC holds, per AZ: a firewall subnet (where the endpoint lives), a TGW-attachment subnet, and a NAT subnet. The firewall endpoint is the hinge every route table bends around.

2. The endpoint-and-route dance: appliance mode and symmetric return paths

This is where most implementations break. Stateful inspection requires that both directions of a flow traverse the same firewall endpoint. In a multi-AZ TGW topology, the default TGW behavior hashes the forward and return packets independently, so a flow’s SYN can enter via the us-east-1a endpoint and its SYN-ACK return via us-east-1b. The stateful engine in 1b never saw the SYN, drops the packet as out-of-state, and the connection hangs.

The fix is appliance mode on the inspection VPC’s TGW attachment. Appliance mode makes the TGW pin all packets of a flow (same 5-tuple) to the same AZ for the life of the flow, restoring symmetry.

# Enable appliance mode on the inspection VPC's TGW attachment.
# Without this, asymmetric routing silently breaks stateful rules.
aws ec2 modify-transit-gateway-vpc-attachment \
  --transit-gateway-attachment-id tgw-attach-0abc123 \
  --options ApplianceModeSupport=enable

The route choreography in the inspection VPC, per AZ:

# TGW-attachment subnet: send everything to the firewall endpoint
aws ec2 create-route \
  --route-table-id rtb-tgw-az-a \
  --destination-cidr-block 0.0.0.0/0 \
  --vpc-endpoint-id vpce-0fw1111aaaa   # firewall endpoint, AZ a

# NAT subnet: return path to spokes goes back through the firewall (symmetry)
aws ec2 create-route \
  --route-table-id rtb-nat-az-a \
  --destination-cidr-block 10.0.0.0/8 \
  --vpc-endpoint-id vpce-0fw1111aaaa

On the spoke side, the spoke VPC route tables point 0.0.0.0/0 at the TGW; the TGW route table for spokes points the default route at the inspection VPC attachment. The spokes have no IGW and no NAT of their own — that is the whole point. There is no legal egress path that bypasses the firewall.

Keep every AZ’s traffic on its own endpoint. Routing AZ-a’s TGW subnet to AZ-b’s firewall endpoint works until the AZ-b endpoint or AZ fails, and it defeats appliance mode’s symmetry guarantee. One firewall endpoint per AZ, referenced only by that AZ’s subnets.

3. Stateless vs stateful engines and rule evaluation order

A Network Firewall policy chains two engines, and traffic flows through them in a fixed order. Understanding this order is the difference between a rule that fires and one that is dead code.

packet -> [stateless engine] --(forward action)--> [stateful engine] -> verdict
                |
                +--(pass / drop)--> verdict (skips stateful)

The critical knob: the stateless default action. The policy default for unmatched packets is set in the firewall policy. For an inspection design you want the stateless engine to do almost nothing except hand everything to Suricata:

{
  "StatelessDefaultActions": ["aws:forward_to_sfe"],
  "StatelessFragmentDefaultActions": ["aws:forward_to_sfe"],
  "StatefulEngineOptions": {
    "RuleOrder": "STRICT_ORDER",
    "StreamExceptionPolicy": "DROP"
  }
}

Two settings here are load-bearing. RuleOrder: STRICT_ORDER changes how the stateful engine evaluates rules (covered next). StreamExceptionPolicy: DROP decides what happens to packets the engine can’t associate with a tracked stream (mid-stream packets after a flush, out-of-window data) — DROP is the secure default; the alternatives CONTINUE and REJECT trade safety for fewer broken connections.

4. Authoring Suricata rules: SNI allowlists, domain lists, and signatures

There are two ways to express stateful rules: domain-list rule groups (a managed abstraction) and Suricata-compatible rule strings (raw signatures). Use domain lists for the bulk allowlist and Suricata strings for protocol-aware logic and IDS.

A domain-list rule group for HTTP Host and TLS SNI allowlisting:

{
  "RulesSource": {
    "RulesSourceList": {
      "GeneratedRulesType": "ALLOWLIST",
      "TargetTypes": ["TLS_SNI", "HTTP_HOST"],
      "Targets": [
        ".amazonaws.com",
        ".pypi.org",
        ".pythonhosted.org",
        "registry.npmjs.org",
        ".ghcr.io"
      ]
    }
  }
}

A leading dot (.amazonaws.com) matches the domain and all subdomains; without it the match is exact. TLS_SNI inspects the Server Name Indication in the TLS ClientHello — no decryption required, since SNI is sent in the clear during the handshake.

For finer control, author raw Suricata. To allow a specific SNI and drop everything else over TLS, ordering matters (see the next section). A protocol-aware allow-then-deny pair:

# Allow TLS to approved SNIs
pass tls $HOME_NET any -> $EXTERNAL_NET any (tls.sni; \
  dotprefix; content:".amazonaws.com"; endswith; \
  msg:"ALLOW AWS endpoints"; sid:1000001; rev:1;)

pass tls $HOME_NET any -> $EXTERNAL_NET any (tls.sni; \
  content:"registry.npmjs.org"; nocase; \
  msg:"ALLOW npm registry"; sid:1000002; rev:1;)

# Drop any other TLS handshake by SNI (default-deny for 443)
drop tls $HOME_NET any -> $EXTERNAL_NET any (tls.sni; \
  msg:"DENY non-allowlisted TLS SNI"; sid:1000099; rev:1;)

$HOME_NET and $EXTERNAL_NET are rule-variable sets you define on the rule group; point $HOME_NET at your spoke CIDRs. The tls.sni keyword with dotprefix plus endswith is the canonical way to do suffix matching on a domain. For DNS-layer control you can match queries directly:

# Alert on DNS queries to a known C2 pattern
alert dns $HOME_NET any -> any any (dns.query; \
  content:".onion"; nocase; \
  msg:"DNS query for .onion TLD"; sid:1000201; rev:1;)

SNI filtering is not TLS interception. You see the requested hostname, not the payload, and a determined adversary can use ESNI/ECH or domain fronting to evade it. SNI allowlisting raises the bar and catches the overwhelming majority of misconfigured or commodity-malware egress — treat it as one control, not the whole story.

5. Strict ordering, default-drop, and capacity planning

By default the stateful engine uses action-order evaluation (Suricata’s native precedence: pass > drop > reject > alert), which means a pass rule anywhere wins over a drop, regardless of where you placed it. For an allowlist firewall that is exactly backwards — you want your explicit drop of non-allowlisted SNI to be the floor, with passes above it in a deterministic sequence. That is what STRICT_ORDER gives you: rules evaluate top-to-bottom in the order of the rule groups (set by priority) and the rules within them, and the first match wins.

With strict order you also set an explicit default action for the policy. To get default-drop egress, add the drop action to the stateful default:

{
  "StatefulEngineOptions": { "RuleOrder": "STRICT_ORDER" },
  "StatefulDefaultActions": [
    "aws:drop_established",
    "aws:alert_established"
  ]
}

aws:drop_established drops packets on flows that the engine has established but that no rule explicitly passed — your safety net. Ordering of rule groups is then controlled by integer Priority (lower = evaluated first) in the policy reference.

Capacity is the constraint nobody plans for until provisioning fails. Every rule group reserves a fixed capacity (a unit count) at creation, and you cannot change it later — you size it up front. A firewall policy has a hard ceiling of 30,000 capacity units across all its stateful rule groups (and a separate budget for stateless). Suricata rules consume one unit each; domain-list entries consume capacity proportional to the list size and TargetTypes count.

# Reserve capacity generously; you cannot resize a rule group after creation.
aws network-firewall create-rule-group \
  --rule-group-name egress-tls-allowlist \
  --type STATEFUL \
  --capacity 200 \
  --rule-group file://allowlist.json

Size each group with headroom for rev: bumps and new entries, and track the cumulative total against 30,000 so a future rule group doesn’t get rejected at apply time.

6. Importing and tuning managed threat-signature groups

AWS publishes managed rule groups maintained by its threat-intelligence team — ThreatSignaturesBotnet, ThreatSignaturesMalware, ThreatSignaturesScanners, AbusedLegitDomains, and others — plus curated domain/IP reputation lists. They drop into a policy by ARN and update automatically.

aws network-firewall list-rule-groups \
  --scope MANAGED \
  --managed-type AWS_MANAGED_THREAT_SIGNATURES \
  --query 'RuleGroups[].Name'

The failure mode with managed IDS signatures is false-positive flooding: enable a broad signature set in drop posture on day one and you will sever legitimate traffic and page yourself at 2 a.m. The disciplined rollout:

  1. Add the managed group to the policy in alert-only posture first. Managed groups expose an override so you can flip their action to DROP_TO_ALERT without editing the group:
{
  "ResourceArn": "arn:aws:network-firewall:us-east-1:aws-managed:stateful-rulegroup/ThreatSignaturesBotnet",
  "Priority": 200,
  "Override": { "Action": "DROP_TO_ALERT" }
}
  1. Run for a week. Pull the alert logs, identify the SIDs firing on known-good traffic, and build a rule-variable or a targeted pass rule above the managed group (strict order) to whitelist that exact flow.
  2. Once the alert stream is clean, remove the DROP_TO_ALERT override so the group enforces.

Never bulk-suppress by disabling a managed group. Suppress the specific SID for the specific source, and document why. A blanket disable of ThreatSignaturesMalware because one signature was noisy is how a real intrusion walks straight through.

7. Alert and flow logging to S3/CloudWatch and the log schema

Network Firewall emits two log types, configured independently:

{
  "LoggingConfiguration": {
    "LogDestinationConfigs": [
      {
        "LogType": "FLOW",
        "LogDestinationType": "S3",
        "LogDestination": { "bucketName": "acme-nfw-logs", "prefix": "flow" }
      },
      {
        "LogType": "ALERT",
        "LogDestinationType": "CloudWatchLogs",
        "LogDestination": { "logGroup": "/anfw/alert" }
      }
    ]
  }
}

A decoded alert record (the event object is Suricata’s EVE JSON):

{
  "firewall_name": "acme-egress-inspection",
  "availability_zone": "us-east-1a",
  "event_timestamp": "1717842000",
  "event": {
    "src_ip": "10.20.4.55",
    "dest_ip": "151.101.0.223",
    "app_proto": "tls",
    "tls": { "sni": "pypi.org", "version": "TLS 1.3" },
    "alert": {
      "action": "blocked",
      "signature": "DENY non-allowlisted TLS SNI",
      "signature_id": 1000099
    }
  }
}

The fields that matter operationally: event.alert.action (allowed vs blocked), event.alert.signature_id (the SID to tune against), and event.tls.sni (what the workload was actually trying to reach). A CloudWatch Logs Insights query to surface your top blocked destinations:

fields event.tls.sni as sni, event.src_ip as src
| filter event.alert.action = "blocked"
| stats count(*) as hits by sni, src
| sort hits desc
| limit 25

That query is the single most useful day-2 artifact: it tells you which legitimate destinations you forgot to allowlist before users do.

Verify

Run these from a workload instance in a spoke subnet (via SSM Session Manager — there is no direct SSH path if the design is right) and from your operator workstation against the AWS APIs.

# 1. Appliance mode is actually on (the silent killer if off)
aws ec2 describe-transit-gateway-vpc-attachments \
  --transit-gateway-attachment-ids tgw-attach-0abc123 \
  --query 'TransitGatewayVpcAttachments[0].Options.ApplianceModeSupport'
# Expect: "enable"

# 2. Firewall reports READY in every AZ
aws network-firewall describe-firewall \
  --firewall-name acme-egress-inspection \
  --query 'FirewallStatus.SyncStates'

# 3. From the workload subnet: an allowlisted SNI succeeds
curl -sS -o /dev/null -w '%{http_code}\n' https://pypi.org
# Expect: 200

# 4. A non-allowlisted SNI is dropped (connection hangs then times out)
curl -sS --max-time 8 https://example.com ; echo "exit=$?"
# Expect: exit=28 (operation timed out) — the drop is silent at L4

Then confirm the firewall saw it:

# 5. The drop shows up in the alert log within ~1-2 minutes
aws logs filter-log-events \
  --log-group-name /anfw/alert \
  --filter-pattern '{ $.event.tls.sni = "example.com" }' \
  --query 'events[].message'

If step 4 returns a fast connection refused instead of a timeout, your stateless default is dropping (not forwarding to the stateful engine) and your Suricata rules aren’t even running. If step 3 also hangs, you likely have asymmetric routing — recheck appliance mode and that each AZ’s subnets reference only that AZ’s endpoint.

Enterprise scenario

A fintech platform team migrated forty workload accounts behind a centralized inspection VPC to satisfy a PCI requirement that all egress be allowlisted by destination. The rollout passed every test in staging and then broke intermittently in production: roughly one connection in three to *.amazonaws.com would hang for exactly the SSM and S3 gateway paths, while curl tests from the same host usually passed. Classic heisenbug.

The constraint was multi-AZ. Staging ran a single AZ; production ran three. Their TGW attachment for the inspection VPC did not have appliance mode enabled, so the TGW was load-balancing forward and return packets across AZs independently. A flow’s ClientHello would enter the AZ-a firewall endpoint, the stateful engine in AZ-a would establish state and pass the SNI, but the return traffic hashed to the AZ-c endpoint — whose Suricata engine had never seen the handshake and dropped the packets as out-of-stream under the default DROP stream-exception policy. The “one in three” was literally the three-AZ hash.

The fix was one API call plus a routing audit:

aws ec2 modify-transit-gateway-vpc-attachment \
  --transit-gateway-attachment-id tgw-attach-0fin999 \
  --options ApplianceModeSupport=enable

After enabling appliance mode, they re-audited every inspection-VPC route table to guarantee each AZ’s TGW-attachment and NAT subnets pointed only at that AZ’s firewall endpoint — the second half of the symmetry guarantee. Flow drops went to zero. The lesson they wrote into their landing-zone module: appliance mode is not optional for stateful inspection behind a multi-AZ TGW, and it must be asserted by the IaC, not clicked once and forgotten. They added a Config rule to alarm if any inspection-VPC TGW attachment ever reports ApplianceModeSupport != enable.

Checklist

AWSNetwork FirewallSuricataEgressSecurityVPC

Comments

Keep Reading