Servers Security

Designing Stateful Linux Firewalls with native nftables Rulesets and NAT

Most Linux firewalls in production are still iptables rulesets that nobody dares touch: hundreds of ordered lines, parallel IPv4 and IPv6 copies that have drifted apart, and a reload that flushes the table before re-adding rules, leaving a window where the box is wide open. nftables was built to fix exactly these problems. It replaces the four legacy tools (iptables, ip6tables, arptables, ebtables) with one syntax and one kernel subsystem, gives you first-class data structures (named sets, maps, verdict maps), and reloads atomically — the new ruleset commits in a single transaction or not at all.

This article builds a real stateful firewall and router from an empty ruleset: a correct base policy with connection tracking, scaling with named sets and verdict maps, SNAT/DNAT/masquerade for a forwarding box, rate limiting and connection caps, include files that reload atomically, a clean iptables migration, and packet tracing for when something inevitably does not match.

Scope: examples target current mainline nftables (0.9.x and newer) on RHEL-family (RHEL 9/10, Rocky, Alma) and Ubuntu Server 22.04/24.04, all of which ship the nftables kernel backend. Run everything as root. Where a command is destructive to connectivity I call it out — test from a console you cannot lock yourself out of.

1. The architecture: families, tables, chains, hooks, and priority

Before writing a rule, internalize the object model, because nftables mistakes are almost always a misplaced chain rather than a wrong rule.

The priority aliases are not decorative — they place your chain correctly relative to conntrack and the NAT engine:

Alias Value Typical hook Purpose
raw -300 prerouting/output Runs before conntrack; use to notrack
mangle -150 all Packet mangling
dstnat -100 prerouting DNAT happens here
filter 0 all Normal filtering
security 50 input/forward/output After filter
srcnat 100 postrouting SNAT/masquerade happens here

The single rule that explains most “my NAT does nothing” tickets: DNAT must be in a prerouting chain at dstnat priority, SNAT/masquerade in a postrouting chain at srcnat priority. A NAT verdict in a filter-hooked chain is silently ignored.

2. A correct stateful base ruleset

Start from empty and build a default-drop input policy that accepts established traffic via conntrack. This is the skeleton every host gets.

#!/usr/sbin/nft -f
# /etc/nftables.conf
flush ruleset

table inet filter {
    chain input {
        type filter hook input priority filter; policy drop;

        # 1. Fast-path established/related connections FIRST.
        ct state established,related accept
        ct state invalid drop

        # 2. Loopback is always trusted; guard against spoofed lo.
        iif "lo" accept
        iif != "lo" ip daddr 127.0.0.0/8 drop
        iif != "lo" ip6 daddr ::1 drop

        # 3. ICMP / ICMPv6 — do NOT blanket-drop ICMPv6, IPv6 needs it.
        ip protocol icmp icmp type { echo-request, destination-unreachable, time-exceeded, parameter-problem } accept
        ip6 nexthdr ipv6-icmp icmpv6 type { echo-request, nd-neighbor-solicit, nd-neighbor-advert, nd-router-advert, destination-unreachable, packet-too-big, time-exceeded, parameter-problem } accept

        # 4. New, allowed inbound services.
        tcp dport 22 ct state new accept
        tcp dport { 80, 443 } ct state new accept
    }

    chain forward {
        type filter hook forward priority filter; policy drop;
        ct state established,related accept
        ct state invalid drop
    }

    chain output {
        type filter hook output priority filter; policy accept;
    }
}

Three correctness points that separate a working firewall from a subtly broken one:

  1. Order conntrack first. ct state established,related accept at the top means a reply packet is accepted in one rule instead of walking the whole chain. Put it last and an early drop can sever return traffic for connections you meant to allow.
  2. Never blanket-drop ICMPv6. IPv6 depends on Neighbor Discovery (the nd-* types) and Path MTU Discovery (packet-too-big) at the network layer. Drop them and IPv6 connectivity degrades in confusing, intermittent ways. This is the most common IPv6 firewall bug.
  3. policy drop is the safety net, not the policy. The explicit accepts are the policy; the chain policy catches everything you forgot.

Check syntax without touching the live ruleset, then apply atomically:

nft -c -f /etc/nftables.conf   # -c: parse and validate, change nothing
nft -f /etc/nftables.conf      # atomic: whole file commits or nothing does
nft list ruleset               # the live, effective ruleset

Persistence is the file plus the unit. On RHEL and Ubuntu, nftables.service loads /etc/nftables.conf at boot, so the file is persistence — but only if the unit is enabled:

systemctl enable --now nftables

3. Named sets and verdict maps for policy that scales

Inlining { 80, 443 } is an anonymous set — fine for constants, but it cannot be updated without rewriting the rule. Named sets are referenced objects you can modify at runtime, and verdict maps turn a linear chain into an O(1) lookup.

A named set with flags interval stores CIDR ranges; auto-merge collapses overlaps automatically:

table inet filter {
    set admin_nets {
        type ipv4_addr
        flags interval
        auto-merge
        elements = { 10.0.0.0/8, 192.0.2.10 }
    }

    set blocklist {
        type ipv4_addr
        flags interval, timeout      # timed entries for dynamic blocking
    }

    chain input {
        type filter hook input priority filter; policy drop;
        ct state established,related accept
        ct state invalid drop
        iif "lo" accept

        ip saddr @blocklist drop
        tcp dport 22 ip saddr @admin_nets ct state new accept
    }
}

Update a set live, without reloading the ruleset, and add a timed block that expires itself:

nft add element inet filter admin_nets { 198.51.100.0/24 }
nft delete element inet filter admin_nets { 192.0.2.10 }

# Block a scanner for one hour; the entry self-removes.
nft add element inet filter blocklist { 203.0.113.66 timeout 1h }
nft list set inet filter blocklist     # shows remaining TTL per element

Verdict maps (vmap) map a key directly to a verdict — accept, drop, jump <chain>. Instead of one rule per service, dispatch by port in a single lookup:

table inet filter {
    chain input {
        type filter hook input priority filter; policy drop;
        ct state established,related accept
        ct state invalid drop
        iif "lo" accept
        ip protocol icmp accept
        ip6 nexthdr ipv6-icmp accept

        # One lookup dispatches every TCP service.
        tcp dport vmap {
            22  : jump svc_ssh,
            80  : accept,
            443 : accept,
            25  : drop
        }
    }

    chain svc_ssh {
        ip saddr @admin_nets ct state new accept
        ct state new limit rate 10/minute accept   # fallback for non-admin
    }
}

You can also map by interface. A router with several zones is far more readable as a verdict map keyed on the inbound interface than as a wall of iifname comparisons:

chain forward {
    type filter hook forward priority filter; policy drop;
    ct state established,related accept
    ct state invalid drop

    iifname vmap {
        "eth0"   : jump zone_wan,
        "eth1"   : jump zone_lan,
        "br-dmz" : jump zone_dmz
    }
}

This is what keeps a firewall maintainable at scale: chains encode logic, sets and maps encode data, and only the data changes between hosts and over time.

4. NAT: masquerade, SNAT, and DNAT for a forwarding box

NAT lives in a separate nat-type table. Define both base chains even if one is empty — the hooks must exist for the engine to consider your rules.

First, forwarding must be enabled at the kernel level. This is independent of any firewall rule; NAT without it does nothing:

sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv6.conf.all.forwarding=1
printf 'net.ipv4.ip_forward = 1\nnet.ipv6.conf.all.forwarding = 1\n' \
    | tee /etc/sysctl.d/99-forward.conf
sysctl --system

Now the NAT table. masquerade is dynamic SNAT to the outbound interface’s current address (use it on dynamic/DHCP uplinks); snat to <ip> is static and slightly cheaper when the WAN IP is fixed; dnat to <ip>:<port> redirects inbound traffic to an internal host:

table ip nat {
    chain prerouting {
        type nat hook prerouting priority dstnat; policy accept;

        # DNAT: publish an internal web server on the WAN IP.
        iifname "eth0" tcp dport 443 dnat to 10.0.10.20:443

        # Port translation is allowed: external 8443 -> internal 443.
        iifname "eth0" tcp dport 8443 dnat to 10.0.10.20:443
    }

    chain postrouting {
        type nat hook postrouting priority srcnat; policy accept;

        # Static SNAT when the WAN address is fixed.
        oifname "eth0" ip saddr 10.0.10.0/24 snat to 203.0.113.5

        # OR masquerade for a dynamic uplink (use one or the other).
        # oifname "eth0" ip saddr 10.0.10.0/24 masquerade
    }
}

The piece people forget: DNAT changes the destination but not the firewall decision. Forwarded packets still traverse the forward chain, which is policy drop. You must explicitly allow the now-translated flow. Match it precisely with ct status dnat so you are not opening the LAN wider than intended:

chain forward {
    type filter hook forward priority filter; policy drop;
    ct state established,related accept
    ct state invalid drop

    # Allow exactly the DNATed flow to the backend.
    ip daddr 10.0.10.20 tcp dport 443 ct state new ct status dnat accept

    # Allow LAN clients out.
    iifname "eth1" oifname "eth0" accept
}

ct status dnat matches only connections the NAT engine actually translated, so a direct attempt to reach 10.0.10.20:443 from elsewhere does not slip through this rule.

5. Rate limiting, logging, and connection-limit protections

Three mechanisms harden the edge against floods and brute force. They are distinct and stack cleanly.

Rate limiting with limit throttles matches by packet rate, optionally with a burst allowance:

# Accept at most 10 new SSH connections/min, bursting to 5.
tcp dport 22 ct state new limit rate 10/minute burst 5 packets accept
tcp dport 22 ct state new drop   # anything over the limit

Per-source connection caps with ct count bound concurrent connections per client — the right tool against a single host opening thousands of sockets. It needs a meter (a dynamic set keyed on source address):

table inet filter {
    chain input {
        type filter hook input priority filter; policy drop;
        ct state established,related accept
        ct state invalid drop
        iif "lo" accept

        # Drop a source already holding 20+ connections to port 443.
        tcp dport 443 ct state new meter conn_limit \
            { ip saddr ct count over 20 } drop
        tcp dport 443 ct state new accept
    }
}

Logging writes to the kernel log (read via journalctl -k or your syslog pipeline). Always rate-limit the log rule itself — an unlimited log on a flood is a self-inflicted disk and CPU outage:

# Log dropped input, but never more than a few lines/sec.
chain input {
    # ... accepts above ...
    limit rate 5/second log prefix "nft-drop-in: " level info
    counter comment "dropped-input"
}

counter is worth attaching liberally — it records packet and byte counts per rule with negligible cost and is your first instrument when asking “is this rule even matching?” (see Step 8).

6. Atomic reloads and include-file organization

nft -f <file> is atomic: the entire file is parsed, then committed in one kernel transaction. If line 200 has a syntax error, lines 1-199 never apply and the live ruleset is untouched. That is the whole point — there is no flush-then-rebuild window where the firewall is open, the way iptables-restore without --noflush leaves you exposed.

The idiom is flush ruleset at the top of the file followed by the full definition. Because flush and rebuild are in the same transaction, the swap is seamless:

#!/usr/sbin/nft -f
flush ruleset
include "/etc/nftables.d/sets.nft"
include "/etc/nftables.d/filter.nft"
include "/etc/nftables.d/nat.nft"

Gotcha: include resolves at parse time and is still atomic. But globs only match files that exist — include "/etc/nftables.d/*.nft" silently includes nothing if the directory is empty, which can leave you with a flush-ed empty ruleset and a default-accept kernel. Validate with nft -c -f in CI and after every render.

Split by concern so a fleet config tool varies only the data files. Define variables and macros once and reuse them:

# /etc/nftables.d/sets.nft
define wan_if = "eth0"
define lan_net = 10.0.10.0/24

table inet filter {
    set admin_nets {
        type ipv4_addr
        flags interval
        auto-merge
        elements = { 10.0.0.0/8 }
    }
}
# /etc/nftables.d/filter.nft
table inet filter {
    chain input {
        type filter hook input priority filter; policy drop;
        ct state established,related accept
        ct state invalid drop
        iif "lo" accept
        tcp dport 22 ip saddr @admin_nets ct state new accept
        ip saddr $lan_net accept
    }
}

Render sets.nft per environment from Ansible/Jinja, keep filter.nft and nat.nft byte-identical everywhere, and commit the lot to Git. The chain logic is reviewed once; only the data differs between hosts.

7. Migrating from iptables and coexisting with firewalld

If you are on a live iptables ruleset, do not hand-rewrite it — translate it. The iptables-nft shims and the translation tools do the mechanical work:

# Translate a single rule to see the nft equivalent.
iptables-translate -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT
# -> nft add rule ip filter INPUT tcp dport 22 ip saddr 10.0.0.0/8 counter accept

# Translate an entire saved ruleset.
iptables-save > /tmp/rules.v4
iptables-restore-translate -f /tmp/rules.v4 > /etc/nftables-from-iptables.nft

Review the output — the translator is faithful but mechanical, producing one ip/ip6 table rather than consolidating into inet, and it will not refactor repeated rules into sets. Treat it as a correct starting point to refactor, then validate with nft -c -f before cutting over.

A critical distinction on RHEL/Ubuntu: the legacy commands may already be nftables underneath via the nft backend. Check which backend iptables resolves to:

iptables --version          # look for "(nf_tables)" vs "(legacy)"
update-alternatives --display iptables   # Debian/Ubuntu: pick the backend

If it reports nf_tables, your iptables rules already live in the kernel nftables engine — but in separate tables from any native ruleset you write. Two tools writing different tables on the same hooks both apply, in priority order, which is a real source of “I removed the rule but traffic is still blocked.” Standardize on one path.

firewalld is the higher-level front end on RHEL and renders to nftables by default (FirewallBackend=nftables in /etc/firewalld/firewalld.conf). Running firewalld and hand-managing nftables.service means two owners fighting over the ruleset. Pick one:

# Going all-in on native nftables: stop firewalld owning the ruleset.
systemctl disable --now firewalld
systemctl enable --now nftables

If you must keep firewalld (for its zone model or because other tooling drives it), do not also enable nftables.service. Instead, inject custom native rules through firewalld’s direct/passthrough or a dedicated priority-separated table, and accept that firewalld owns flush. The failure mode to avoid is a flush ruleset in your file wiping firewalld’s tables on the next reload, or firewalld’s reload wiping yours.

8. Tracing packets and debugging with nft monitor and counters

When a packet does not do what the ruleset says it should, stop reading rules and watch the kernel evaluate them. nftables has a real packet tracer.

Tag the traffic you care about with meta nftrace set 1 in a high-priority chain (filter it tightly — tracing is verbose), then watch the trace stream:

# Trace just one client's traffic so the output stays readable.
nft add table inet trace
nft add chain inet trace prerouting \
    '{ type filter hook prerouting priority -300; }'
nft add rule inet trace prerouting ip saddr 198.51.100.7 meta nftrace set 1

# In another terminal, watch every chain/rule the tagged packets hit:
nft monitor trace

The trace output shows each rule the packet matched and the verdict at every hop — it tells you exactly where a packet is dropped or which chain it never reached. Remove the trace table when done so you are not tracing in production:

nft delete table inet trace

nft monitor (without trace) is the other half: it streams ruleset changes in real time — every add, delete, and set update — which is invaluable for catching a config tool stepping on your rules:

nft monitor              # live feed of all ruleset events
nft monitor rules        # just rule add/delete events

For the “is this rule matching at all?” question, attach counter and read it. A rule with zero packets after you have generated traffic is matching nothing — the bug is in the match, not downstream:

nft list ruleset -a      # -a prints each rule's handle and counters
nft list chain inet filter input

Use the handle from -a to surgically delete or insert one rule without rewriting the file — essential for live debugging:

nft delete rule inet filter input handle 17
nft insert rule inet filter input position 4 ip saddr 203.0.113.0/24 drop

Verify

Validate from the bottom up; a failure low in the stack invalidates everything above it.

# Syntax is valid and the unit will load it at boot.
nft -c -f /etc/nftables.conf && systemctl is-enabled nftables

# The live ruleset matches intent, with per-rule counters and handles.
nft list ruleset -a

# Conntrack is actually tracking flows (and not full).
conntrack -C                         # current entry count
cat /proc/sys/net/netfilter/nf_conntrack_max
conntrack -L -p tcp --dport 443 2>/dev/null | head   # live entries

# Forwarding is on wherever you NAT or route.
sysctl net.ipv4.ip_forward net.ipv6.conf.all.forwarding

# NAT is rewriting as designed — watch the translation happen.
conntrack -L -d 203.0.113.5 2>/dev/null   # shows orig vs reply tuples for DNAT/SNAT

# Counters prove which rules fire under real traffic.
nft list chain inet filter input

# Packet-level proof on the wire, both sides of the NAT.
tcpdump -ni eth0 'tcp port 443'
tcpdump -ni eth1 'host 10.0.10.20'

The fastest NAT sanity check is conntrack -L: each entry prints the original tuple and the reply tuple. If the reply tuple’s source/destination is not rewritten the way you expect, your NAT rule is in the wrong chain or never matched — go back to Step 1’s priority table.

Enterprise scenario

A platform team ran a pair of active/standby Linux edge routers fronting a payments DMZ. They published three backend services via DNAT on the WAN and masqueraded outbound LAN traffic. After migrating the routers from RHEL 8 (iptables) to RHEL 9 (native nftables), two of the three published services worked and the third — a partner API on TCP 8443 mapped to an internal host on 443 — returned connection resets from the outside, intermittently.

The constraint: the cutover happened during a change window with the partner integration frozen, so they could not just “open it up and tighten later.” They needed the exact, minimal rule.

Tracing told the story in under a minute. With meta nftrace set 1 scoped to the partner’s source prefix, nft monitor trace showed the DNAT in prerouting firing correctly — the destination was rewritten to 10.0.10.30:443 — but the packet then hit the forward chain’s policy drop. The team had copied the forward allow rule from the working services, which matched tcp dport 443. The partner flow arrived on external port 8443; after DNAT the destination port was 443, but their forward rule had been written against the pre-NAT port out of habit, and a stray ct status dnat on the wrong service line meant only some flows matched. The fix was one precise rule keyed on the post-translation tuple plus the DNAT status:

# forward chain: allow exactly the partner's DNATed flow.
ip daddr 10.0.10.30 tcp dport 443 ct state new ct status dnat \
    ip saddr 198.51.100.0/24 accept

They also folded the three backends into a verdict map so the next service addition was a data change, not a logic change, and could be reviewed as a one-line diff:

table ip nat {
    map dnat_map {
        type inet_service : ipv4_addr . inet_service
        elements = {
            443  : 10.0.10.20 . 443,
            8080 : 10.0.10.25 . 80,
            8443 : 10.0.10.30 . 443
        }
    }
    chain prerouting {
        type nat hook prerouting priority dstnat; policy accept;
        iifname "eth0" tcp dport @dnat_map dnat to tcp dport map @dnat_map
    }
}

The lesson the team wrote into their runbook: after DNAT, every downstream rule sees the translated tuple, so forward-chain allows must be written against the post-NAT destination and gated with ct status dnat. And when a rule “should” match but does not, trace the packet — do not re-read the file a tenth time. The whole ruleset went into Git with nft -c -f running in CI, so a malformed include could never again reach a router.

Checklist

Pitfalls and next steps

The recurring failures are predictable once you know the model: a NAT verdict in a filter-hooked chain that does nothing, a forward allow written against the pre-DNAT port, a blanket ICMPv6 drop that quietly breaks IPv6, two owners (firewalld and nftables.service) flushing each other’s tables, and an unbounded log rule that turns a SYN flood into a disk-full incident. Each is invisible until you check the live ruleset with nft list ruleset -a, watch counters, and trace the packet.

From here, push the ruleset into provisioning: render the data files (sets.nft, NAT maps) per environment, keep chain logic byte-identical and reviewed once, and run nft -c -f in CI so a bad include never reaches a box. Wire nft counters into your metrics so a filling conntrack table or a rule suddenly matching a flood pages a dashboard, not you. The endgame is a firewall that is a small, reviewed, data-driven artifact in Git — not a thousand-line file nobody will touch.

linuxnftablesfirewallnatnetworking

Comments

Keep Reading