Containerization Networking

Cilium Beyond CNI: Cluster Mesh, Egress Gateway, and the BGP Control Plane

Most teams install Cilium, get a working CNI plus NetworkPolicy, and stop there. That leaves the most valuable half of the product on the shelf. The same eBPF dataplane that forwards your pod traffic can federate clusters into a single service namespace, force selected pods to leave the cluster through a stable source IP a partner has allowlisted, and speak BGP to your top-of-rack switches so PodCIDRs and LoadBalancer IPs are reachable from the physical network without an external load balancer. This guide wires all three together and shows how to debug them when they misbehave.

Everything here targets Cilium 1.16/1.17 with the BGP control plane v2 API (CiliumBGPClusterConfig and friends) and Cluster Mesh as it ships today. Commands are real and current; where a feature has a sharp edge I name it instead of papering over it.

1. The dataplane recap that makes the rest make sense

Three properties of the Cilium dataplane explain why the platform features behave the way they do.

Identities, not IPs. Cilium assigns every pod a numeric security identity derived from its labels, and the dataplane enforces policy on identities, not IP addresses. A flow is “frontend talking to payments,” not “10.0.3.7 talking to 10.0.9.4.” That indirection is what makes Cluster Mesh possible: if two clusters agree on what label set maps to which identity, a policy written in cluster A applies to the same workload in cluster B.

kube-proxy replacement. Cilium can fully replace kube-proxy, implementing Service load balancing in eBPF at the socket and tc layers instead of with iptables/IPVS. This removes a large iptables ruleset, lowers latency, and — critically for this article — is the mechanism that lets a ClusterIP resolve to backends in another cluster.

# Confirm kube-proxy replacement is actually active
cilium config view | grep -i kube-proxy-replacement
# kube-proxy-replacement   true

cilium status --verbose | grep -A2 KubeProxyReplacement

Routing mode. Cilium runs in tunnel mode (VXLAN/Geneve, the default and most portable) or native routing, where pod traffic is routed without encapsulation and the underlying network is expected to know the PodCIDRs. Native routing is faster and is the natural pairing with the BGP control plane — you advertise the PodCIDRs precisely so the fabric can route them. Check yours:

cilium config view | grep -E 'routing-mode|tunnel-protocol'
# routing-mode   tunnel   (or "native")

Mental model for the whole article: the eBPF datapath already does identity-aware L3/L4 load balancing. Cluster Mesh extends the identity and service catalog across clusters; Egress Gateway changes the source IP of selected egress flows; BGP changes what the physical network knows how to reach. They compose because they all sit on the same datapath.

2. Cluster Mesh: shared identities, global services, cross-cluster LB

Cluster Mesh connects two or more Cilium clusters so that Services, identities, and policy span all of them. Each cluster runs a clustermesh-apiserver (etcd plus a sync agent) that exposes its state; every other cluster’s agents read it.

Two invariants must hold before you connect anything, and getting them wrong is the most common failure:

  1. Unique, stable cluster-id and cluster-name per cluster. IDs are integers. With the default identity allocation, the usable range is 1255 and IDs must be unique across the mesh. (Raising max-connected-clusters to 511 is possible but reshapes the identity bit layout and must be set identically and from install time on every cluster — not a knob to flip on a live mesh.)
  2. Non-overlapping PodCIDRs across all clusters. Cluster Mesh routes pod-to-pod by IP; overlapping ranges are unroutable. Plan this at install.

Set the identity at install (Helm) so it is baked into every agent:

# Cluster 1
helm upgrade --install cilium cilium/cilium --namespace kube-system \
  --set cluster.name=cluster-east \
  --set cluster.id=1

# Cluster 2 — different name AND id
helm upgrade --install cilium cilium/cilium --namespace kube-system \
  --set cluster.name=cluster-west \
  --set cluster.id=2

Enable the apiserver and connect the clusters. The CLI handles cert/secret exchange and writes the peer config into both clusters:

# Run against each cluster context
cilium clustermesh enable --context cluster-east --service-type LoadBalancer
cilium clustermesh enable --context cluster-west --service-type LoadBalancer

# Wait for the apiserver to be Ready in both
cilium clustermesh status --context cluster-east --wait

# Bi-directional connect (run once; it configures both sides)
cilium clustermesh connect --context cluster-east --destination-context cluster-west

--service-type LoadBalancer exposes the apiserver via a cloud LB; on bare metal use NodePort or, better, a LoadBalancer IP that you then advertise with BGP (section 5 — this is one reason the features pair so well). The apiserver endpoint must be reachable from the other cluster’s nodes, so a private NLB or an advertised VIP is the production-grade choice over NodePort.

Global services

A normal Service is local. You opt a Service into the mesh with an annotation, and Cilium unions the endpoints from every cluster that defines a Service of the same name and namespace:

apiVersion: v1
kind: Service
metadata:
  name: checkout
  namespace: shop
  annotations:
    service.cilium.io/global: "true"
spec:
  selector:
    app: checkout
  ports:
    - port: 8080

Deploy an identically named checkout Service in both clusters. Now checkout.shop.svc.cluster.local resolves locally as always, but the eBPF load balancer’s backend set for that ClusterIP includes pods from both clusters. A caller in cluster-east can be served by a checkout pod in cluster-west transparently — no DNS tricks, no second hostname.

3. Designing global vs local services, and failover affinity

“Make everything global” is the wrong instinct. The right unit of decision is per-Service, and the lever is affinity.

By default a global service load-balances across all backends in all clusters equally. That is rarely what you want for latency-sensitive paths — you do not want half your checkout calls crossing a region boundary in steady state. The service.cilium.io/affinity annotation fixes this:

metadata:
  annotations:
    service.cilium.io/global: "true"
    # Prefer local backends; spill to remote only if no healthy local exist
    service.cilium.io/affinity: "local"
affinity value Steady-state routing Cross-cluster used when
local Local backends only No healthy local backend remains
remote Remote backends only No healthy remote backend remains
none (default) All clusters, evenly Always eligible

affinity: local is the workhorse for active/active with regional failover: traffic stays in-cluster for latency and egress cost, and the mesh silently fails the Service over to the peer cluster only when local endpoints disappear. There is also service.cilium.io/shared: "false", which lets a cluster consume a global service but not export its own backends into it — useful for a cluster that should call a shared service without advertising its own pods as backends.

A design rule that saves incidents: keep stateful backends local. Global services are an L3/L4 endpoint union with no awareness of data locality. Federating a database Service so writes can land in either region is a correctness bug, not a feature. Make stateless, idempotent services global; pin stateful ones with affinity: local or keep them out of the mesh entirely.

4. Egress Gateway: fixed source IPs for partner allowlists

The recurring enterprise problem: a partner, a payment processor, or a legacy mainframe will only accept connections from a short list of source IPs. Pods get ephemeral, node-dependent addresses that change on every reschedule, so they cannot be allowlisted. Egress Gateway solves exactly this — it forces traffic from selected pods, to selected destinations, to leave the cluster SNAT’d to a stable IP owned by a designated gateway node.

Prerequisites: enable the feature and bpf-masquerade (Egress Gateway depends on eBPF masquerading).

helm upgrade cilium cilium/cilium --namespace kube-system --reuse-values \
  --set egressGateway.enabled=true \
  --set bpf.masquerade=true \
  --set kubeProxyReplacement=true

Define the policy. This says: pods labelled app=billing in namespace shop, when talking to the partner CIDR, egress via the node matching egress-node=true, SNAT’d to 203.0.113.10.

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: billing-to-partner
spec:
  selectors:
    - podSelector:
        matchLabels:
          io.kubernetes.pod.namespace: shop
          app: billing
  destinationCIDRs:
    - "198.51.100.0/24"        # the partner's network
  egressGateway:
    nodeSelector:
      matchLabels:
        egress-node: "true"
    egressIP: 203.0.113.10     # must live on an interface of the gateway node

The egressIP must be a real address configured on an interface of the gateway node (commonly a secondary IP). The gateway node becomes a chokepoint and a SPOF for that traffic class, so in production you select a small set of gateway nodes and front the egress IP with a mechanism that can move it on failure — on bare metal, BGP/L2 announcement (section 6); in cloud, a floating/secondary IP you can reassign.

Sharp edge: only traffic matching both the pod selector and a destinationCIDR is redirected and SNAT’d. Everything else egresses normally with the node’s IP. Do not write 0.0.0.0/0 as the destination unless you genuinely intend to funnel all egress from those pods through one node — that is a bandwidth and blast-radius decision, not a default. Also note return traffic must route back to the gateway node, so the egress IP and the gateway node must share a subnet/routing domain that the upstream network honors.

5. BGP control plane: advertise PodCIDR and LoadBalancer IPs

In native-routing or bare-metal clusters, the physical network does not inherently know how to reach pod IPs or LoadBalancer VIPs. Cilium’s BGP control plane peers each node with your routers (ToR switches, a route reflector) and advertises that reachability dynamically. No more static routes that rot when a node is replaced.

The v2 API splits configuration into composable resources: a CiliumBGPClusterConfig (which nodes peer, with whom), a CiliumBGPPeerConfig (timers, families, reusable peer template), and CiliumBGPAdvertisement (what to advertise). Enable the feature first:

helm upgrade cilium cilium/cilium --namespace kube-system --reuse-values \
  --set bgpControlPlane.enabled=true

Define a reusable peer config — graceful restart matters so a Cilium agent restart does not blackhole traffic while sessions re-establish:

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  name: tor-peer
spec:
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 120
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: bgp        # ties to the Advertisement below

Cluster config: which nodes peer, and the router’s ASN/address. peerASN is the upstream router; localASN is what Cilium presents. Use private ASNs (6451265534) unless your network team assigned one.

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPClusterConfig
metadata:
  name: cilium-bgp
spec:
  nodeSelector:
    matchLabels:
      bgp-enabled: "true"        # only label nodes that should peer
  bgpInstances:
    - name: instance-65000
      localASN: 65000
      peers:
        - name: tor-switch
          peerASN: 64512
          peerAddress: 10.0.0.1          # the ToR/router
          peerConfigRef:
            name: tor-peer

Finally, what to advertise. Here both PodCIDR and LoadBalancer service IPs:

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
  name: pod-and-lb
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: PodCIDR
    - advertisementType: Service
      service:
        addresses:
          - LoadBalancerIP
      selector:
        matchLabels:
          announce: "bgp"        # only Services with this label get advertised

The Service advertisement only fires for Service objects that carry the matching label and have an assigned LoadBalancer IP — which is where IPAM comes in.

6. LoadBalancer IPAM and L2 announcements for bare metal

On bare metal there is no cloud controller to fill in status.loadBalancer.ingress, so type: LoadBalancer Services sit <pending> forever. Cilium ships its own IPAM to hand out those IPs from pools you define — this is the MetalLB-equivalent, built into Cilium.

apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: lb-pool
spec:
  blocks:
    - cidr: "203.0.113.0/24"
  serviceSelector:
    matchLabels:
      announce: "bgp"            # scope this pool to Services we also advertise

Now a LoadBalancer Service gets an IP from the pool automatically. You then make it reachable one of two ways:

apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: l2-default
spec:
  loadBalancerIPs: true          # announce LB IPs (not externalIPs here)
  interfaces:
    - ^eth[0-9]+$                 # interfaces eligible to answer ARP
  nodeSelector:
    matchLabels:
      l2-announce: "true"

L2 announcement needs kubeProxyReplacement=true and the l2announcements feature enabled; it uses a leader-election lease per Service, so exactly one node answers at a time. Pick BGP for scale and ECMP, L2 for simplicity when BGP peering is not available. Do not enable both for the same VIP.

7. Securing cross-cluster traffic with cluster-aware policy

Federating clusters quietly widens your blast radius: a global service means a pod in cluster-west can now reach a backend in cluster-east. Standard CiliumNetworkPolicy still applies on the receiving side, and you should tighten it to be cluster-aware rather than leaving the mesh open. Cilium exposes the source cluster as the well-known label io.cilium.k8s.policy.cluster:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: checkout-allow-east-only
  namespace: shop
spec:
  endpointSelector:
    matchLabels:
      app: checkout
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
            io.cilium.k8s.policy.cluster: cluster-east   # only this cluster
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP

This permits frontend to reach checkout only when the caller originates in cluster-east, even though checkout is a global service reachable from the whole mesh. Treat the mesh as you would any trust expansion: default-deny on the receiving namespace, then allow named cross-cluster flows explicitly, qualified by source cluster. The identity model makes this exact and durable — it survives pod IP churn on both sides because it is written in labels.

Verify

Prove each layer independently; debugging a federated, BGP-advertised, egress-pinned cluster is miserable if you cannot isolate which feature broke.

# --- Cluster Mesh ---
cilium clustermesh status --context cluster-east        # peers Ready, tunnels up
# Confirm a global service unioned backends from both clusters:
kubectl exec -n kube-system ds/cilium -- \
  cilium-dbg service list | grep -A4 checkout
# A global service shows backends with IPs from BOTH PodCIDRs.

# --- Egress Gateway ---
kubectl exec -n kube-system ds/cilium -- cilium-dbg bpf egress list
# Then from a billing pod, hit a destination that echoes the source IP:
kubectl exec -n shop deploy/billing -- curl -s https://ifconfig.me
# Must print the egressIP (203.0.113.10), not a node IP.

# --- BGP ---
cilium bgp peers                  # Session State: established; per-peer
cilium bgp routes advertised ipv4 unicast    # PodCIDR + /32 LB IPs present

# --- LoadBalancer IPAM ---
kubectl get svc -n shop -o wide   # EXTERNAL-IP populated from the pool, not <pending>

# --- End-to-end flow visibility ---
hubble observe --namespace shop --follow
# Cross-cluster flows are tagged with the source/destination cluster.

For BGP specifically: if Session State is active or idle and never reaches established, it is almost always (a) a wrong peerASN/peerAddress, (b) the node not matching the nodeSelector (check the bgp-enabled label), or © the upstream router not configured to accept the session / expecting MD5 auth. cilium bgp peers shows the negotiated families and received/advertised route counts, which disambiguates “session up but no routes” from “no session.”

Enterprise scenario

A payments platform team ran two regional EKS-on-self-managed-nodes clusters (us-east, us-west) behind a single product. Two constraints collided. First, their acquiring bank’s firewall allowlisted exactly two source IPs and refused to add more — so every settlement call, from any cluster, had to appear to come from one of those two addresses. Second, the business wanted regional active/active: a settlement request landing in us-west should be served locally for latency, but fail over to us-east if the west settlement workers were down, without a DNS change.

They solved it by composing the three features. Cluster Mesh federated the clusters, and the settlement Service was made global with service.cilium.io/affinity: "local" so steady-state traffic stayed regional and failover to the peer cluster was automatic. The two allowlisted bank IPs were configured as egress IPs on a pair of dedicated gateway nodes per region, and a CiliumEgressGatewayPolicy funneled only the settlement pods, only to the bank’s CIDR, through them. The gateway IPs themselves were advertised via the BGP control plane so they survived gateway-node replacement without the bank ever seeing a new source.

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: settlement-egress
spec:
  selectors:
    - podSelector:
        matchLabels:
          io.kubernetes.pod.namespace: payments
          app: settlement
  destinationCIDRs:
    - "192.0.2.0/24"            # the bank's published ingress range
  egressGateway:
    nodeSelector:
      matchLabels:
        role: egress-gw
    egressIP: 198.51.100.7      # one of the two bank-allowlisted IPs

The decisive design choice was scoping destinationCIDRs to the bank’s range only. An earlier draft used 0.0.0.0/0, which pushed all egress from the settlement pods — telemetry, package pulls, DNS to external resolvers — through the two gateway nodes and saturated them during a deploy. Narrowing the CIDR to just the bank dropped gateway throughput by an order of magnitude and made the SPOF acceptable. The failover affinity earned its keep three weeks later when a us-west node group rolled badly: settlement traffic shifted to us-east automatically, the bank saw the same two source IPs throughout, and no customer transaction failed.

Checklist

ciliumebpfnetworkingbgpmulticluster

Comments

Keep Reading