Containerization Lesson 92 of 113

Running the Managed Istio Add-on on AKS: mTLS, Ingress Gateways, and Egress Control

The AKS managed Istio add-on — Microsoft brands it Azure Service Mesh, and you address it everywhere as a revision string like asm-1-27 — takes the part of Istio that most teams get catastrophically wrong (control-plane lifecycle, version upgrades, CRD hygiene) and makes it Microsoft’s problem. The istiod control plane is installed, patched and health-monitored for you; the canary upgrade machinery is wired up; the gateway deployments are lifecycled. What the add-on does not do is make your security posture, routing, or egress correct by default. It ships with permissive mTLS and ALLOW_ANY egress out of the box. Every property that makes a mesh worth the Envoy memory tax — strict identity, least-privilege authorization, a fixed ingress edge, an auditable egress allowlist — is something you still configure deliberately, and the managed variant diverges from upstream Istio in a dozen specifics that silently break copy-pasted blog tutorials.

This is the production playbook for that gap. You will walk the full path a platform team takes: enable the add-on against a pinned revision, label namespaces with the revision label the add-on actually honours (not the one every tutorial shows), enforce STRICT PeerAuthentication scoped by AuthorizationPolicy with SPIFFE identities, stand up managed internal and external ingress gateways, lock egress down to a REGISTRY_ONLY + ServiceEntry allowlist, run a canary revision upgrade end to end with revision tags, and wire Envoy telemetry into Managed Prometheus. Because the add-on’s constraints are exactly where teams lose afternoons — the root namespace is aks-istio-system not istio-system; istio-injection=enabled is a no-op; the shared ConfigMap name is revision-suffixed; the egress gateway is unsupported on Pod Subnet clusters — the rules, settings, limits and failure modes here are all laid out as scannable tables. Read the prose once; keep the tables open during the incident.

By the end you will stop guessing why a pod came up 1/1 instead of 2/2, why flipping STRICT produced a wall of 503 UC, why your first AuthorizationPolicy black-holed traffic, why a VirtualService you applied changed nothing, and why an external call returns 502 from inside the sidecar. Each of those has one confirming command and one fix, and knowing which in ninety seconds is the difference between a clean rollout and a Sev-2 bridge.

What problem this solves

A service mesh exists to move three cross-cutting concerns — encryption in transit, workload identity/authorization, and traffic control — out of every application and into a uniform data plane of sidecar proxies. Without it, each team re-implements mTLS in their own language, authorization is a tangle of network policies and IP allowlists that break on every reschedule, and outbound traffic from a compromised pod can reach any host on the internet with nothing to stop or even log it. The managed add-on additionally solves the operational half: self-managing Istio means owning istiod upgrades, CRD migrations, and the blast radius of getting either wrong — work that has sunk more than one platform team.

What breaks without this knowledge is subtler than “the mesh is down,” because the add-on fails silently and asymmetrically. Label a namespace the way every upstream doc shows (istio-injection=enabled) and you get no sidecar and no error — the workload runs unencrypted, outside every policy you wrote, and looks healthy. Put a mesh-wide PeerAuthentication in istio-system (the upstream root namespace) and it is simply ignored, because the add-on’s root is aks-istio-system. Flip a namespace to STRICT before every caller is inside the mesh and you take an outage. Add your first ALLOW authorization policy and you default-deny everything you forgot to enumerate. Set REGISTRY_ONLY to lock egress and every undeclared external dependency starts returning 502 from the sidecar. None of these throw at apply time; they bite at request time, in production.

Who hits this: any platform or SRE team standing a mesh on AKS for PCI/zero-trust segmentation, anyone migrating from self-managed Istio or OSM, and anyone who copied a generic Istio tutorial and cannot work out why injection, mesh policy, or egress “doesn’t work.” It bites hardest on teams running Azure CNI Pod Subnet (the managed egress gateway is unsupported there) and on multi-team clusters where one namespace’s STRICT flip or authz policy ripples into another team’s calls. The fix is almost never “reinstall the mesh” — it is “use the add-on’s namespace, label, ConfigMap name and revision, not upstream Istio’s.”

To frame the whole field before the deep dive, here is every failure class this article covers, the question it forces, and the single command that localises it:

Failure class What you observe First question to ask First command to run Most common single cause
No sidecar injected Pod is 1/1, traffic unencrypted, policies ignored Did the namespace get the add-on’s label? kubectl get pods -n <ns> (expect 2/2) istio-injection=enabled used instead of istio.io/rev
STRICT breaks traffic (503 UC) Upstream-connect-failure after enabling STRICT Is every caller in the mesh, and is the client TLS mode right? istioctl authn tls-check <pod>.<ns> A client still un-injected, or a DestinationRule forcing DISABLE
Authz black-holes traffic (403) RBAC: access denied after first policy Did the ALLOW policy enumerate every legit caller? Envoy access log (rbac_access_denied) First ALLOW policy is default-deny for the workload
Config not applied (STALE) “I applied the VirtualService, nothing changed” Is the proxy actually synced to the latest push? istioctl proxy-status --istioNamespace aks-istio-system Policy in wrong namespace, or proxy STALE
Egress blocked (502/000) External call fails from inside the pod Is there a ServiceEntry for the host under REGISTRY_ONLY? kubectl exec ... -c istio-proxy -- curl ... No ServiceEntry, or shared ConfigMap name wrong for the revision
Upgrade did nothing New revision running but workloads unchanged Did you restart workloads after repointing the tag? istioctl proxy-status (mixed revisions) Relabel/repoint without kubectl rollout restart

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with core Kubernetes objects (Deployment, Service, Namespace, labels/annotations) and with kubectl. You should understand what a sidecar is and the rough idea of a service mesh: a per-pod Envoy proxy that intercepts all inbound/outbound traffic so the platform — not the app — can do mTLS, routing and policy. Familiarity with mTLS (mutual TLS: both sides present certificates) and with Azure networking concepts (Standard Load Balancer, subnets/CIDRs, UDR, Azure Firewall) will let you move fast. You need an AKS cluster you can modify and Azure CLI 2.57.0+ (2.80.0+ if you want egress gateways).

This sits in the AKS networking & platform track. Conceptually it is downstream of Understanding Managed Kubernetes: AKS vs EKS vs GKE Compared and the broader Production AKS: Networking & Observability. It is the managed counterpart to the upstream-Istio deep dives — Istio Ambient Mesh: mTLS & Traffic Management and Istio Ambient: Waypoint Proxies & L7 Authorization — and a sibling to other mesh choices like Linkerd: mTLS, Retries & Multi-Cluster Failover. For the ingress/egress edges it pairs with Application Gateway for Containers: Gateway API & Traffic Splitting and Deterministic Egress with Azure NAT Gateway. Telemetry lands where Azure Monitor: Managed Prometheus & Managed Grafana for AKS picks it up.

A quick map of who owns which layer during a mesh incident, so you page the right person:

Layer What lives here Who usually owns it Failure classes it can cause
Client / DNS TLS to the edge, name resolution Frontend / SRE North-south 503 only if the gateway IP/host is wrong
Managed ingress gateway Public/internal LB, Gateway/VirtualService Platform / network 503 (no route/host match), source-range blocks
Envoy sidecar (data plane) mTLS, authz, routing per pod Platform + app 503 UC (STRICT mismatch), 403 (authz), 502 (egress)
istiod (control plane) xDS config push, cert issuance Microsoft (managed) STALE config, cert issues — rare, but root-namespace errors here
Shared ConfigMap / MeshConfig Mesh-wide config (egress mode, access logs) Platform Egress mode, telemetry; wrong revision suffix = ignored
Egress (gateway / firewall) Outbound allowlist, fixed source IP Platform + network 502 under REGISTRY_ONLY with no ServiceEntry

Core concepts

Six mental models make every later diagnosis obvious.

The add-on is revision-scoped — there is no “Istio version” on the cluster. There is a revision like asm-1-27, and almost every object you touch is suffixed or keyed by that string: the istiod-asm-1-27 deployment, the istio-asm-1-27 reconciled ConfigMap, the istio-shared-configmap-asm-1-27 you actually edit, the istio.io/rev=asm-1-27 namespace label, the per-revision gateway pods. This is what makes the canary upgrade model work (two control planes side by side) and it is precisely why generic Istio docs — which assume a single un-suffixed istiod in istio-system — lead you astray.

The root namespace is aks-istio-system, not istio-system. Mesh-wide policy objects (a selector-less PeerAuthentication, the shared MeshConfig ConfigMap) live in the add-on’s root namespace. A PeerAuthentication you drop in istio-system is read by nothing. This single fact invalidates a large fraction of blog-post copy-paste, and it is the number-one reason a “mesh-wide STRICT” change appears to do nothing.

Injection requires an explicit revision label, applied at admission. istio-injection=enabled — the label every upstream tutorial uses — is silently ignored by the add-on. You must label the namespace istio.io/rev=asm-X-Y (or a revision tag, see below). Even then, labelling changes nothing about running pods: injection happens at pod admission, so you must kubectl rollout restart existing workloads to get a sidecar. A correctly-labelled namespace whose pods you never restarted still shows 1/1.

STRICT is a server-side contract; ISTIO_MUTUAL is the client-side one. PeerAuthentication governs what a workload’s sidecar accepts (PERMISSIVE = plaintext or mTLS; STRICT = mTLS only). A DestinationRule’s trafficPolicy.tls.mode governs what the client sidecar originates. The classic 503 UC (upstream-connect-failure) after flipping STRICT is a mismatch: a server now demanding mTLS while some client is either un-injected (sends plaintext) or has a DestinationRule pinning it to DISABLE. Migrate clients in before you flip the server.

Authorization is default-allow until your first ALLOW policy, then default-deny for that workload. With no AuthorizationPolicy, mTLS proves identity but every authenticated workload can still call every other one. The moment you attach an AuthorizationPolicy with action ALLOW and at least one rule to a workload, anything not explicitly matched is rejected (403, rbac_access_denied in the Envoy log). Your first policy can therefore black-hole traffic you forgot to enumerate. Prefer source principals (SPIFFE identities like cluster.local/ns/checkout/sa/checkout-api) over IP rules — identities are stable across reschedules; IPs are not.

Egress is ALLOW_ANY until you make it REGISTRY_ONLY, set in the shared ConfigMap. By default a compromised pod can reach any internet host and your mesh provides zero egress control or logging. Flipping outboundTrafficPolicy.mode to REGISTRY_ONLY makes Envoy block anything not in the service registry — after which every external dependency must be declared as a ServiceEntry. You set this via the revision-suffixed shared ConfigMap (istio-shared-configmap-asm-X-Y), which the control plane merges over its reconciled default; you never edit the default istio-asm-X-Y ConfigMap directly.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Revision (asm-X-Y) The installed Istio version identity Suffix on most objects Keys injection, ConfigMap, gateways, upgrades
Root namespace Where mesh-wide policy/config lives aks-istio-system Mesh-wide STRICT / MeshConfig go here, not istio-system
Injection label What onboards a namespace istio.io/rev=asm-X-Y (or tag) istio-injection=enabled is ignored → no sidecar
Sidecar (Envoy) Per-pod proxy doing mTLS/routing/authz Each mesh pod (2/2) If absent, pod is outside the mesh entirely
PeerAuthentication Server-side mTLS accept mode Root ns (mesh) or app ns PERMISSIVE → STRICT migration; mis-namespace = ignored
AuthorizationPolicy L7 allow/deny by identity/path App namespace First ALLOW = default-deny for the workload
DestinationRule Client-side TLS/subset/LB policy App namespace ISTIO_MUTUAL originates mTLS; subsets enable splits
VirtualService Routing rules (host → destination) App namespace Weighted splits, header routing, gateway binding
Gateway Ingress/egress L7 listener config App namespace Binds to a managed gateway by service label
ServiceEntry Declares an external host to the registry App namespace Required for any egress under REGISTRY_ONLY
Shared ConfigMap Your mesh config overlay istio-shared-configmap-asm-X-Y Egress mode, access logs; name must match revision
Revision tag Stable alias for a revision Cluster-scoped Repoint once to move many namespaces at upgrade

1. The managed add-on vs self-managed Istio: the revision model

The single most important mental model is that the add-on is revision-scoped, and the second is that it is a constrained Istio: you cannot set arbitrary MeshConfig, you cannot use upstream’s namespaces, and some upstream features (egress gateway) are gated by your cluster’s network plugin. Internalise the differences below before you touch anything, because each row is an afternoon someone has already lost.

Aspect Self-managed (upstream) Istio AKS managed add-on Why it matters
Control-plane lifecycle You install/upgrade istiod Microsoft installs/patches it Patch upgrades auto-roll in your maintenance window
Root namespace istio-system aks-istio-system Mesh-wide policy in the wrong ns is ignored
Ingress namespace wherever you install gateways aks-istio-ingress (managed) Gateway pods/services are created and lifecycled for you
Egress namespace wherever you install aks-istio-egress (managed) Egress gateway gated by Static Egress Gateway support
Injection label istio-injection=enabled works Only istio.io/rev=asm-X-Y Upstream label is a silent no-op
MeshConfig Fully editable Partitioned allowed/supported/blocked configSources etc. blocked; edit the shared ConfigMap
Shared config object the istio ConfigMap istio-shared-configmap-asm-X-Y (merged over default) Must match the running revision name
istioctl target istio-system by default --istioNamespace aks-istio-system every call Otherwise it talks to a control plane that isn’t there
Version identity a chart/Helm version a revision string asm-X-Y Everything is keyed by it
Supported versions your choice at least two revisions; n-2 supported ~6 weeks after newest n Outside that window = “allowed but unsupported”

Why almost everything is revision-suffixed

The canary upgrade model demands that two control planes coexist, so every control-plane-scoped object carries the revision to avoid collisions. Knowing which name carries the suffix (and which is the one you edit) removes most of the confusion:

Object Name pattern Namespace You edit it?
Control plane deployment istiod-asm-1-27 aks-istio-system No (managed)
Reconciled default config istio-asm-1-27 (ConfigMap) aks-istio-system No — never edit directly
Shared overlay config istio-shared-configmap-asm-1-27 aks-istio-system Yes — your MeshConfig overlay
External ingress gateway aks-istio-ingressgateway-external (svc) + per-rev pods aks-istio-ingress Annotations only
Internal ingress gateway aks-istio-ingressgateway-internal (svc) + per-rev pods aks-istio-ingress Annotations only
Egress gateway named on enable, per-rev pods aks-istio-egress Annotations only
Namespace injection label istio.io/rev=asm-1-27 (or a tag) each app namespace Yes

Check what is actually available in your region before you do anything else — compatibility is a function of both the AKS version and the region:

az aks mesh get-revisions --location eastus2 -o table

The CLI versions and prerequisites that gate each capability:

Requirement Minimum Gates Notes
Azure CLI (mesh enable) 2.57.0 az aks mesh enable aks-preview not required for GA features
Azure CLI (egress gateway) 2.80.0 az aks mesh enable-egress-gateway Newer surface than ingress
Kubernetes version >= 1.23 Enabling the add-on at all Match to a supported asm-X-Y
OSM add-on must be removed Coexistence Istio and OSM cannot both be enabled
Network plugin (egress GW) not Pod Subnet Static Egress Gateway → egress GW On Pod Subnet, use REGISTRY_ONLY + Firewall
istioctl matches/near revision tag, proxy-status, authn Always --istioNamespace aks-istio-system
Managed Prometheus enabled on cluster Metric scraping Edit ama-metrics-settings-configmap to opt in

The add-on’s structural limits and supportability rules — the numbers that shape your upgrade calendar and topology:

Limit / rule Value Why it matters Consequence if ignored
Supported revisions at once at least two Enables canary upgrades
n-2 support window ~6 weeks after newest n rolls out Time to finish an upgrade Falls to “allowed but unsupported”
Upgrade jump n+1 or n+2 Skip a version if needed Larger jump = more validation
Root namespace aks-istio-system (fixed) Mesh-wide policy location Policy elsewhere is ignored
MeshConfig fields allowed / supported / blocked Some fields (e.g. configSources) blocked Apply fails or is dropped
Egress gateway on Pod Subnet unsupported Topology decision enable-egress-gateway fails
Patch rollout automatic, in maintenance window Control plane only Sidecars stay old until restart

2. Enabling the add-on and the namespace labeling strategy

Enable on an existing cluster. If you omit --revision, AKS picks a current default — fine for a lab, but in production you pin the revision so it does not drift between environments or across a Terraform apply:

export RESOURCE_GROUP=rg-platform
export CLUSTER=aks-prod-eastus2
export REV=asm-1-27

az aks mesh enable \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER \
  --revision $REV

Two hard prerequisites repeated because they fail the enable outright: Azure CLI 2.57.0+ (2.80.0+ for egress gateways), and the Open Service Mesh add-on must be removed first — the two cannot coexist. The add-on also requires Kubernetes >= 1.23.

Confirm the mesh mode and the control-plane pods, then pull the live revision so the rest of your scripts are not hard-coded:

az aks show -g $RESOURCE_GROUP -n $CLUSTER --query 'serviceMeshProfile.mode' -o tsv
# -> Istio

az aks get-credentials -g $RESOURCE_GROUP -n $CLUSTER
kubectl get pods -n aks-istio-system
# -> istiod-asm-1-27-... Running

ASM_REV=$(az aks show -g $RESOURCE_GROUP -n $CLUSTER \
  --query 'serviceMeshProfile.istio.revisions[0]' -o tsv)

The az aks mesh enable flags you will actually reach for:

Flag What it sets Default When to set it
--revision Pin the asm-X-Y to install latest default Always in non-lab; prevents env drift
--resource-group / --name Target cluster Required
--enable-ingress-gateway (or the enable-ingress-gateway subcommand) Provision a gateway off When you need north-south entry
--ingress-gateway-type external / internal One invocation per type
(egress subcommand) --istio-egressgateway-name Name the egress gateway Only when Static Egress GW is supported

The namespace labeling strategy

Do not label everything. Onboard namespaces deliberately — injection rewrites the pod spec and forces a restart, and a mesh that watches every namespace wastes istiod and Envoy memory. The label must match the running revision exactly:

# Correct for the add-on:
kubectl label namespace payments istio.io/rev=$ASM_REV --overwrite

# WRONG — silently skipped by the add-on, no sidecar injected:
# kubectl label namespace payments istio-injection=enabled

Labelling alone does nothing to running pods. Injection happens at admission, so restart existing workloads to get a sidecar:

kubectl rollout restart deployment -n payments
kubectl get pods -n payments
# Each pod should now show 2/2 READY (app container + istio-proxy)

The label-and-restart contract is where most “no sidecar” tickets come from. The full truth table of what each combination produces:

Namespace label Workload restarted? Result Pod READY
istio.io/rev=asm-1-27 (matches running rev) Yes Sidecar injected, in mesh 2/2
istio.io/rev=asm-1-27 (matches) No Old pods still un-injected 1/1
istio.io/rev=prod (tag → current rev) Yes Sidecar injected via tag 2/2
istio-injection=enabled Yes Ignored by add-on — no sidecar 1/1
istio.io/rev=asm-1-26 (stale, not running) Yes No matching control plane → no sidecar 1/1
No label Outside the mesh 1/1
Pod has sidecar.istio.io/inject: "false" Yes Explicitly opted out 1/1

A practical governance tactic at scale: pair the revision label with discoverySelectors in MeshConfig so istiod only watches mesh-labelled namespaces. On a large cluster this materially reduces istiod and Envoy memory by pruning irrelevant config from every proxy’s push. The injection control surfaces, ranked from broad to surgical:

Control Scope Effect Use when
istio.io/rev=<rev> on namespace Namespace Inject all pods in ns Standard onboarding
istio.io/rev=<tag> on namespace Namespace Inject via stable alias You want upgrade indirection
sidecar.istio.io/inject: "true" on pod Pod Force inject one pod Opt a pod in within an un-labelled ns
sidecar.istio.io/inject: "false" on pod Pod Skip one pod Exclude a job/batch pod in a mesh ns
discoverySelectors in MeshConfig Mesh istiod only watches matching ns Large clusters; cut proxy memory

3. Enforcing STRICT PeerAuthentication and scoping with AuthorizationPolicy

Out of the box the mesh runs PERMISSIVE mTLS: sidecars accept both plaintext and mTLS. That is the right default during onboarding (un-injected clients keep working) and the wrong default for production. The migration sequence is everything — you turn on STRICT only after every client of a service is inside the mesh, or you cause an outage.

The three mTLS modes and exactly what each does on the server side:

PeerAuthentication mode Server accepts Use during Risk if set too early
PERMISSIVE (default) Plaintext and mTLS Onboarding / migration None — but no enforcement either
STRICT mTLS only Steady-state production 503 UC from any un-injected/plaintext client
DISABLE Plaintext only Debug / explicit opt-out Drops encryption; rarely correct

Mesh-wide STRICT goes in the root namespace (aks-istio-system), with no selector:

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: aks-istio-system   # add-on root namespace, NOT istio-system
spec:
  mtls:
    mode: STRICT

A safer rollout is per-namespace STRICT, so you flip services one blast radius at a time. Pair it with an AuthorizationPolicy to move from “encrypted” to “encrypted and authorized” — mTLS proves identity, but without an authorization policy every authenticated workload can still call every other one:

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: payments
spec:
  mtls:
    mode: STRICT
---
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: payments-allow-checkout
  namespace: payments
spec:
  action: ALLOW
  rules:
  - from:
    - source:
        # SPIFFE identity, not IP — survives pod reschedules
        principals: ["cluster.local/ns/checkout/sa/checkout-api"]
    to:
    - operation:
        methods: ["POST"]
        paths: ["/v1/charges"]

The scoping precedence — where you put a PeerAuthentication decides its blast radius:

Placement metadata.namespace spec.selector Applies to
Mesh-wide aks-istio-system none Every workload in the mesh
Namespace-wide app namespace none Every workload in that namespace
Workload-specific app namespace matchLabels Only matching pods
Port-specific app namespace matchLabels + portLevelMtls A single port on matching pods

A subtle, important rule: an AuthorizationPolicy with an ALLOW action and at least one rule is default-deny for that workload — anything not explicitly matched is rejected. Adding your first ALLOW policy to a namespace can therefore black-hole traffic you forgot to enumerate. The action semantics, in full, because mixing them up is a common self-inflicted outage:

action With matching rule With no policy attached Evaluation order
(none attached) Allow all (mTLS still required if STRICT) n/a
ALLOW Allow matched; deny everything else After DENY and CUSTOM
DENY Deny matched Evaluated first — overrides ALLOW
CUSTOM Delegate to ext authz (e.g. OPA) Evaluated before ALLOW/DENY
AUDIT Log match, do not enforce Logging only

Prefer source principals and namespaces over IP-based rules; identities are stable across reschedules and are exactly what mTLS gives you. The match fields you have to work with, and their stability:

Rule field Matches on Stable across reschedule? Recommended?
from.source.principals SPIFFE identity (SA) Yes Yes — first choice
from.source.namespaces Caller namespace Yes Yes (coarse-grained)
from.source.ipBlocks Source IP/CIDR No (pods reschedule) Avoid inside the mesh
to.operation.methods HTTP method Yes Yes
to.operation.paths HTTP path Yes Yes
when.key (conditions) request attributes (JWT claims, headers) Yes Yes for L7 authz

4. Provisioning managed ingress gateways

The add-on provisions and lifecycles the gateways for you. Enable an external (internet-facing) and an internal (VNet-only) gateway:

az aks mesh enable-ingress-gateway \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --ingress-gateway-type external

az aks mesh enable-ingress-gateway \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --ingress-gateway-type internal

This creates two LoadBalancer services in aks-istio-ingress: aks-istio-ingressgateway-external (public IP) and aks-istio-ingressgateway-internal (an internal Standard LB IP, reachable only from the VNet). The label you bind your Gateway to is on those services, e.g. istio: aks-istio-ingressgateway-internal.

kubectl get svc -n aks-istio-ingress

The two managed gateway types side by side:

Property External gateway Internal gateway
Service name aks-istio-ingressgateway-external aks-istio-ingressgateway-internal
Azure resource Standard LB with public IP Standard internal LB
Reachable from Internet VNet (and peered/VPN/ER) only
Selector label istio: aks-istio-ingressgateway-external istio: aks-istio-ingressgateway-internal
Typical use Public APIs, web Internal services, private apps
Pair with WAF / Front Door upstream Application Gateway / private clients

Customize the underlying Azure LB via annotations on the service. Two that almost every enterprise needs — pin the internal gateway to a dedicated subnet, and restrict the external gateway’s source ranges:

# Internal gateway -> specific subnet (must be in the mesh's VNet)
kubectl annotate svc aks-istio-ingressgateway-internal -n aks-istio-ingress \
  service.beta.kubernetes.io/azure-load-balancer-internal-subnet=snet-ingress --overwrite

# External gateway -> allow only known source CIDRs (e.g. your WAF / Front Door egress)
kubectl annotate svc aks-istio-ingressgateway-external -n aks-istio-ingress \
  service.beta.kubernetes.io/azure-allowed-ip-ranges="203.0.113.0/24,198.51.100.0/24" --overwrite

The Azure LB annotations you will use most on these gateway services, and what each buys:

Annotation Effect Default Gateway it fits
azure-load-balancer-internal: "true" Make the LB internal n/a (internal svc already is) Internal
azure-load-balancer-internal-subnet Pin internal LB to a subnet LB picks a subnet Internal
azure-allowed-ip-ranges Restrict source CIDRs open (external) External
azure-load-balancer-resource-group Place the public IP’s RG node RG External
azure-pip-name Use a named static public IP dynamic IP External
azure-load-balancer-health-probe-request-path Custom LB probe path TCP probe Either

Bind a Gateway + VirtualService to the internal gateway. Note the selector points at the service label, and the Gateway object lives in the application namespace, not in aks-istio-ingress:

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: storefront-internal
  namespace: payments
spec:
  selector:
    istio: aks-istio-ingressgateway-internal
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "shop.internal.contoso.com"
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: storefront
  namespace: payments
spec:
  hosts:
  - "shop.internal.contoso.com"
  gateways:
  - storefront-internal
  http:
  - route:
    - destination:
        host: productpage
        port:
          number: 9080

If you need the real client IP at the gateway (for WAF logging or rate limiting), set externalTrafficPolicy: Local on the external service. It preserves source IP and removes a hop, at the cost of less even traffic spreading across nodes:

kubectl patch svc aks-istio-ingressgateway-external -n aks-istio-ingress \
  --type merge -p '{"spec":{"externalTrafficPolicy":"Local"}}'

The externalTrafficPolicy trade-off, which trips up source-IP-dependent setups:

externalTrafficPolicy Source IP preserved? Extra hop? Load spread Use when
Cluster (default) No (SNAT’d) Yes (node→node) Even across nodes You do not need client IP
Local Yes No Uneven (only nodes with pods) WAF/rate-limit needs real client IP

5. Routing: VirtualService, DestinationRule, and subset traffic splitting

Canary application releases (distinct from mesh-revision upgrades) are driven by a DestinationRule that declares subsets and a VirtualService that weights them. Define the subsets against pod labels, then split:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: reviews
  namespace: payments
spec:
  host: reviews
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL   # use mesh-issued mTLS to the upstream
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
  namespace: payments
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 90
    - destination:
        host: reviews
        subset: v2
      weight: 10

The division of labour between the two routing objects, which people constantly conflate:

Object Governs Key fields Without it…
VirtualService Where traffic goes hosts, http.route.weight, match, gateways Default round-robin to all endpoints
DestinationRule How it gets there subsets, trafficPolicy.tls, loadBalancer, outlier detection Subsets undefined → VirtualService subset refs fail

The DestinationRule client-side TLS modes — the counterpart to server-side PeerAuthentication:

trafficPolicy.tls.mode Client sidecar sends Pair with server STRICT? Typical use
ISTIO_MUTUAL Mesh-issued mTLS Yes In-mesh service-to-service
SIMPLE One-way TLS (you supply certs) n/a TLS origination to an external TLS endpoint
MUTUAL mTLS with your own certs n/a External mTLS to a partner
DISABLE Plaintext No — causes 503 UC under STRICT Debug only; remove before STRICT

Setting trafficPolicy.tls.mode: ISTIO_MUTUAL is what tells the client sidecar to originate mTLS. STRICT PeerAuthentication governs the server side (what it accepts); the DestinationRule governs the client side (what it sends). When you flip a namespace to STRICT, make sure no DestinationRule is overriding the client side back to DISABLE for that host — a mismatch here is the classic 503 UC / upstream-connect-failure you will spend an afternoon chasing. The routing capabilities a VirtualService unlocks beyond a flat weight split:

Capability Field Example use
Weighted split route[].weight 90/10 canary
Header/path match http[].match Route x-canary: true to v2
Fault injection http[].fault Test 5xx/latency handling
Timeout http[].timeout Cap slow upstreams
Retries http[].retries Retry on 5xx/reset
Mirroring http[].mirror Shadow traffic to v2, ignore response
Redirect/rewrite http[].redirect / rewrite Path/host rewrites
CORS policy http[].corsPolicy Browser cross-origin rules at the mesh
Header manipulation http[].headers Add/remove request/response headers
Direct response http[].directResponse Return a fixed body without an upstream

And the DestinationRule trafficPolicy knobs beyond TLS mode — the resilience controls people forget the mesh gives them for free:

trafficPolicy knob Field What it does Typical setting
Load balancing loadBalancer.simple ROUND_ROBIN / LEAST_REQUEST / RANDOM LEAST_REQUEST for uneven latencies
Connection pool (TCP) connectionPool.tcp Max connections, connect timeout Cap to protect upstreams
Connection pool (HTTP) connectionPool.http Max requests/conn, pending Tune for chatty clients
Outlier detection outlierDetection Eject failing endpoints consecutive5xxErrors: 5
Locality LB localityLbSetting Prefer same-zone endpoints Cut cross-zone egress cost

6. Locking down egress with ServiceEntry and REGISTRY_ONLY

By default outboundTrafficPolicy.mode is ALLOW_ANY: a compromised pod can call any host on the internet, and your mesh provides zero egress control. Flip it to REGISTRY_ONLY so Envoy blocks anything not explicitly in the service registry. You set this via the shared ConfigMap, whose name is revision-specific and which the control plane merges over its reconciled default (you never edit the default istio-asm-X-Y ConfigMap directly):

apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-shared-configmap-asm-1-27   # must match your revision
  namespace: aks-istio-system
data:
  mesh: |-
    accessLogFile: /dev/stdout
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY

The two egress modes and their security posture:

outboundTrafficPolicy.mode Behaviour Posture Cost of running it
ALLOW_ANY (default) Any external host reachable Insecure — no control, no log Zero config; zero protection
REGISTRY_ONLY Only ServiceEntry-declared hosts Auditable allowlist in Git Every dependency needs a ServiceEntry

With that applied, every external dependency must be declared as a ServiceEntry. This turns egress into an auditable allowlist living in Git:

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: contoso-payments-api
  namespace: payments
spec:
  hosts:
  - api.payments-partner.com
  ports:
  - number: 443
    name: tls
    protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL

The ServiceEntry fields that decide how the host is resolved and treated:

Field Values Meaning Gotcha
hosts FQDN(s) The external name(s) to allow Wildcards (*.partner.com) supported but broad
location MESH_EXTERNAL / MESH_INTERNAL Outside vs inside the mesh External = no mTLS expected by default
resolution DNS / STATIC / NONE How Envoy resolves endpoints NONE for passthrough by SNI
ports.protocol TLS / HTTPS / HTTP / TCP / GRPC L7 treatment TLS passthrough preserves end-to-end encryption
endpoints IPs/hosts Static endpoints when resolution: STATIC Needed when no DNS
exportTo namespaces / . / * Visibility scope . keeps it namespace-local

For defense-in-depth — a predictable source IP that a partner or Azure Firewall can allowlist — route that traffic through a managed Istio egress gateway, which builds on the AKS Static Egress Gateway feature. Provision it against a StaticGatewayConfiguration that owns a fixed egress IP prefix:

az aks mesh enable-egress-gateway \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --istio-egressgateway-name egress-partners \
  --istio-egressgateway-namespace aks-istio-egress \
  --gateway-configuration-name sgc-partners

Caveat worth knowing before you design around it: the Istio egress gateway requires Static Egress Gateway, which is not supported on Azure CNI Pod Subnet clusters — so the egress gateway isn’t either. On those clusters, enforce egress with REGISTRY_ONLY + ServiceEntry + Azure Firewall instead, and skip the gateway.

The three egress-enforcement strategies, and when each is the right tool:

Strategy Gives you Fixed source IP? Works on Pod Subnet? Best for
REGISTRY_ONLY + ServiceEntry L7 allowlist in Git, identity-aware No Yes Baseline egress control
+ Managed egress gateway Above + a static IP prefix Yes (Static Egress GW) No Partner allowlist-by-IP, non-Pod-Subnet
+ Azure Firewall (UDR) Above + packet capture, central policy Yes (firewall public IP) Yes PCI/audit on Pod Subnet clusters

7. Canary revision upgrades: tag, shift, roll back

This is where the managed add-on earns its keep. A minor revision upgrade runs the new istiod alongside the old one; you migrate workloads at your own pace and can roll back at any point before completing. You can move n+1 or skip to n+2, provided both are supported and AKS-compatible.

Minor versus patch upgrades behave completely differently — confusing them leaves your data plane stale:

Upgrade type Example Who triggers it Data-plane effect Rollback
Minor (revision) asm-1-27asm-1-28 You (az aks mesh upgrade start) New istiod alongside; you migrate per-ns complete or rollback while canary
Patch 1.27.2 → 1.27.3 AKS, in your maintenance window Control plane only; sidecars unchanged until you restart n/a (auto)

First, see your valid targets (if a newer revision is missing here, your AKS version is too old and must be upgraded first):

az aks mesh get-upgrades --resource-group $RESOURCE_GROUP --name $CLUSTER

If you set any custom MeshConfig, copy your shared ConfigMap to the new revision’s name first (e.g. istio-shared-configmap-asm-1-28) — it has to exist the moment the new control plane comes up. Then start the canary:

az aks mesh upgrade start \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --revision asm-1-28

Now both control planes are running. Rather than relabel every namespace (tedious and error-prone), use revision tags as a stable indirection. Point a tag at the old revision, label namespaces with the tag, and later you just repoint the tag:

# istioctl must target the add-on namespace
istioctl tag set prod --revision asm-1-27 --istioNamespace aks-istio-system
kubectl label namespace payments istio.io/rev=prod --overwrite

# When ready to shift, repoint the tag — all 'prod'-tagged namespaces move at once
istioctl tag set prod --revision asm-1-28 --istioNamespace aks-istio-system --overwrite

# Relabeling/repointing does nothing until you restart workloads:
kubectl rollout restart deployment -n payments

Verify both control planes — and, if ingress is enabled, the per-revision gateway pods sitting behind one shared, immutable service IP:

kubectl get pods -n aks-istio-system   # istiod-asm-1-27-* AND istiod-asm-1-28-*
kubectl get pods -n aks-istio-ingress  # gateway pods for both revisions; same LB IP

Check your dashboards, then commit or revert. Completing removes the old control plane; rollback (after repointing the tag and restarting workloads back) removes the canary:

# Healthy -> finalize
az aks mesh upgrade complete --resource-group $RESOURCE_GROUP --name $CLUSTER

# Regression -> repoint tag to old rev, restart workloads, then:
az aks mesh upgrade rollback --resource-group $RESOURCE_GROUP --name $CLUSTER

The full canary upgrade runbook as an ordered table — the sequence is the lesson:

# Step Command / action Gate before proceeding
1 Check targets az aks mesh get-upgrades A valid n+1/n+2 exists (else upgrade AKS)
2 Copy shared ConfigMap create istio-shared-configmap-asm-1-28 Exists before start
3 Start canary az aks mesh upgrade start --revision asm-1-28 Both istiod-* pods Running
4 Repoint tag istioctl tag set prod --revision asm-1-28 --overwrite Tag now → new rev
5 Restart workloads (canary subset) kubectl rollout restart deployment -n <ns> Pods 2/2, proxy on new rev
6 Verify istioctl proxy-status; dashboards All SYNCED, golden signals healthy
7a Commit az aks mesh upgrade complete Old control plane removed
7b Roll back repoint tag to old rev, restart, az aks mesh upgrade rollback Canary removed

Patch versions (e.g. 1.27.2 → 1.27.3) are different: AKS rolls them out automatically for istiod and gateways inside your planned maintenance window. Your sidecars do not update until you restart the workloads — patching the control plane alone leaves data-plane proxies on the old build.

8. Telemetry: metrics, access logs, and Managed Prometheus

Istio exposes rich Envoy metrics on each pod’s merged telemetry endpoint, port 15020 (/stats/prometheus). Azure Managed Prometheus does not scrape pod-annotation targets by default — you opt in by editing the ama-metrics-settings-configmap to enable pod-annotation-based scraping, then annotating mesh pods:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ama-metrics-settings-configmap
  namespace: kube-system
data:
  pod-annotation-based-scraping: |-
    podannotationnamespaceregex = "payments|checkout|aks-istio-ingress"

Annotate the mesh workloads so the agent knows where to scrape. Envoy merges its own and the app’s metrics onto 15020, so a single scrape target covers both:

# pod template annotations on your Deployments
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "15020"
    prometheus.io/path: "/stats/prometheus"

The Istio/Envoy ports you must know — half of mesh debugging is knowing which port does what:

Port Purpose Direction Notes
15006 Inbound capture (to app via Envoy) Inbound Where STRICT mTLS is enforced
15001 Outbound capture (from app) Outbound Egress decisions happen here
15021 Health / readiness (/healthz/ready) Inbound Kubelet probes the sidecar here
15020 Merged telemetry (/stats/prometheus) Scrape App + Envoy metrics in one target
15012 xDS to istiod (mTLS) To control plane Config push channel
15000 Envoy admin (config_dump, /clusters) Local istioctl pc reads this

For access logs, the accessLogFile: /dev/stdout line in the shared ConfigMap (from section 6) emits structured per-request logs to the istio-proxy container, where Container Insights picks them up. Be deliberate: mesh-wide access logging measurably increases Envoy CPU and log volume. Scope it with the Telemetry API to the namespaces that need it rather than blasting it across the fleet. The golden-signal Istio series you actually alert on:

Metric series What it measures Read it for Key labels
istio_requests_total Request count by response code Success rate, error spikes response_code, source_workload, destination_service
istio_request_duration_milliseconds Latency histogram p50/p95/p99 destination_service, le
istio_request_bytes / istio_response_bytes Payload sizes Throughput, anomalies direction, workload
istio_tcp_connections_opened_total TCP connections Non-HTTP traffic health source/destination
envoy_cluster_upstream_cx_connect_fail Upstream connect failures 503 UC root cause cluster_name
pilot_proxy_convergence_time Time for a push to converge Control-plane health quantile

Once metrics land in your Managed Prometheus workspace, a request-success-rate query in KQL against the Azure Monitor workspace:

Metrics
| where Name == "istio_requests_total"
| extend code = tostring(parse_json(Tags)["response_code"])
| summarize total = sum(Val), errors = sumif(Val, toint(code) >= 500) by bin(TimeGenerated, 5m)
| extend success_rate = todouble(total - errors) / total
| project TimeGenerated, success_rate

The verification command set you run after each major step — these catch the failure modes that produce confusing 503s and silent plaintext:

# Command Confirms Bad result looks like
1 kubectl get pods -n payments Sidecars injected Any 1/1 in a mesh namespace
2 istioctl authn tls-check <pod>.payments --istioNamespace aks-istio-system mTLS is genuinely STRICT A plaintext listener still present
3 istioctl proxy-status --istioNamespace aks-istio-system Config pushed everywhere Any STALE for a config type
4 kubectl exec ... -c istio-proxy -- curl https://example.com Egress locked under REGISTRY_ONLY 200 (should be 502/000)
5 kubectl get svc aks-istio-ingressgateway-external -n aks-istio-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}' Ingress LB has an IP Empty / <pending>

istioctl proxy-status is the highest-signal command in the set: if a proxy shows STALE for any config type, that workload is running stale routing or policy, which is the usual root cause of “I applied the VirtualService but nothing changed.”

Architecture at a glance

Read the diagram left to right as a single request’s journey, with the control and egress paths branching off it. A client opens HTTPS to the managed ingress layer in aks-istio-ingress — either the external gateway (Standard LB, public IP) or the internal gateway (internal Standard LB pinned to snet-ingress, reachable only from the VNet). The gateway matches a Gateway/VirtualService and routes into the mesh data plane: your application pod runs 2/2 (app container + Envoy sidecar), inbound traffic is captured on port 15006 where STRICT PeerAuthentication demands mTLS, and an AuthorizationPolicy evaluates the caller’s SPIFFE identity before the request reaches your code. Off to the side, the control plane in aks-istio-system — the revision-suffixed istiod-asm-1-27 plus the istio-shared-configmap you overlay — pushes xDS config to every sidecar over port 15012. When the app calls out, the egress path enforces REGISTRY_ONLY: traffic is allowed only if a ServiceEntry declares the host, and optionally leaves through a managed egress gateway with a static IP prefix (or Azure Firewall on Pod Subnet clusters).

The five numbered badges mark exactly where the managed add-on’s specifics bite. (1) at the app pod is the no-sidecar trap (istio-injection=enabled ignored, or stale revision) — confirm with kubectl get pods showing 1/1. (2) at STRICT mTLS is the migration outage (503 UC when a client is still plaintext or a DestinationRule forces DISABLE). (3) at the AuthorizationPolicy is the default-deny black-hole (403/rbac_access_denied for an un-enumerated caller). (4) at the control plane is the wrong-namespace / STALE config problem (policy in istio-system not aks-istio-system). (5) at egress is the blocked external call (502 with no ServiceEntry, or a shared ConfigMap whose name doesn’t match the revision). The legend narrates each as symptom · confirm · fix — the whole diagnostic method on one canvas.

Managed Istio add-on architecture on AKS, traced left to right: a client opens HTTPS to the managed ingress layer in aks-istio-ingress (external gateway on a Standard public-IP load balancer and internal gateway on an internal Standard LB pinned to snet-ingress on port 80); the gateway routes via Gateway and VirtualService into the mesh data plane where the application pod runs 2/2 with an Envoy sidecar capturing inbound traffic on port 15006, STRICT PeerAuthentication enforces mTLS, and an AuthorizationPolicy evaluates SPIFFE identities for default-deny authorization; the control plane in aks-istio-system runs the revision-suffixed istiod-asm-1-27 and the istio-shared-configmap overlay, pushing xDS config to sidecars over port 15012; and the egress path enforces REGISTRY_ONLY, allowing outbound only to hosts declared by a ServiceEntry and optionally leaving through a managed egress gateway with a static IP prefix. Five numbered badges mark the no-sidecar trap at the app pod, the 503 upstream-connect-failure when STRICT meets a plaintext client, the 403 default-deny black-hole at the AuthorizationPolicy, the STALE or wrong-namespace config problem at the control plane, and the 502 blocked external call at egress, with a legend narrating each as symptom, confirm, and fix

Real-world scenario

Vantage Pay runs its card-processing platform on a regional AKS cluster in Central India, built on Azure CNI Pod Subnet for routable pod IPs (their fraud-scoring service peers directly with an on-prem system over ExpressRoute and needed real pod addresses). The platform team is five engineers; the cluster carries roughly 90 microservices across payments, checkout, ledger and fraud namespaces, and the monthly AKS + mesh spend is about ₹2.1 lakh. Their PCI assessor handed them two non-negotiable requirements from the mesh: strict mTLS between every in-scope service, and a single, fixed source IP for outbound calls to a card-processor partner who allowlists callers by IP.

The team reached for the obvious design — a managed Istio egress gateway over Static Egress Gateway for the predictable IP — and it failed at az aks mesh enable-egress-gateway with an unsupported-configuration error. Static Egress Gateway is not supported on Pod Subnet clusters, so the Istio egress gateway isn’t available there either. The first instinct on the bridge was to re-platform the cluster off Pod Subnet onto Azure CNI Overlay, but that meant re-IP-ing every service and re-validating the ExpressRoute peering — a multi-quarter migration the fraud team would not sign off on.

The breakthrough was realising the two requirements were separable across layers that were available. The mTLS requirement is pure mesh: a mesh-wide STRICT PeerAuthentication in aks-istio-system (rolled out per-namespace first, after confirming every caller was injected and showing 2/2) satisfied the encryption mandate. They added default-deny AuthorizationPolicy objects keyed on SPIFFE principals so “encrypted” became “encrypted and authorized” — the assessor specifically wanted to see that the ledger service could only be written by checkout and payments, not by anything that happened to be in the mesh.

For the fixed egress IP, they pushed the requirement down a layer. They set REGISTRY_ONLY in the shared ConfigMap (istio-shared-configmap-asm-1-27), declared the partner host as a ServiceEntry, and forced that traffic out through Azure Firewall with a fixed public IP via UDR. Istio enforced the L7 allowlist and identity; the firewall provided the stable source IP and the packet capture the auditors wanted. The rollout had one scary moment: the night they flipped payments to STRICT, a batch reconciliation CronJob — which nobody had injected because it lived in a sub-namespace and used istio-injection=enabled — started failing with 503 UC against the ledger API. Ten minutes of istioctl authn tls-check and kubectl get pods (the job pod was 1/1) found it; the fix was relabelling with istio.io/rev and adding sidecar.istio.io/inject: "true" to the job template.

The lesson the team wrote into their platform runbook: validate add-on feature support against your cluster’s network plugin before designing around it, and separate the mesh’s job (identity + encryption) from the network’s job (fixed source IP). Pushing the fixed-IP requirement to Azure Firewall was both compliant and far cheaper than re-platforming, and it shipped in three weeks instead of three quarters.

The incident-and-rollout timeline, because the order of moves is the lesson:

Phase Action Result What it should have been
Design Plan managed egress gateway for fixed IP Fails — unsupported on Pod Subnet Check plugin support first
Reaction Propose re-platform off Pod Subnet Multi-quarter; fraud team blocks Separate the two requirements
mTLS Per-namespace STRICT + SPIFFE authz Encryption + authz satisfied Correct approach
Egress REGISTRY_ONLY + ServiceEntry + Azure Firewall UDR Fixed IP + packet capture, compliant The actual fix
Cutover Flip payments to STRICT CronJob 503 UC (un-injected 1/1 pod) Audit injection before flipping STRICT
Resolve Relabel istio.io/rev + inject: "true" on job Traffic restored in ~10 min Pre-flight every caller

Advantages and disadvantages

The managed add-on trades control for operational relief, and it constrains you in exchange for taking the riskiest lifecycle work off your plate. Weigh it honestly:

Advantages (why the managed add-on helps) Disadvantages (why it constrains you)
Microsoft owns istiod lifecycle, patching and CRD hygiene — the work that sinks self-managed mesh teams You cannot set arbitrary MeshConfig; configSources and other fields are blocked
Canary revision upgrades (two control planes side by side) are wired up and supported Only two revisions supported at a time; n-2 drops out ~6 weeks after newest n
Ingress/egress gateways are provisioned and lifecycled, including per-revision pods behind one stable IP Egress gateway needs Static Egress Gateway — unsupported on Pod Subnet clusters
Patch versions auto-roll in your maintenance window — no manual istiod patching The add-on’s namespaces/labels diverge from upstream, breaking generic tutorials
Telemetry integrates with Managed Prometheus / Container Insights out of the box You still configure all of security/routing/egress — none of it is safe by default
SPIFFE identity, STRICT mTLS and L7 authz are full upstream Istio capabilities The data plane still costs ~50–150 MB and measurable CPU per sidecar
Revision tags make fleet-wide upgrades a single repoint Relabel/repoint does nothing until you restart workloads — a constant footgun

The model is right for teams who want a production mesh on AKS without owning Istio’s control-plane lifecycle, and who can live within the add-on’s guardrails. It bites hardest on teams on Pod Subnet who need the egress gateway, teams that need deep MeshConfig customisation the add-on blocks, and anyone who treats the add-on like upstream Istio and copies the wrong namespace/label/ConfigMap. If you need full control of every mesh knob, self-managed Istio (or Istio ambient mode) is the alternative — at the cost of owning every upgrade.

Hands-on lab

Stand up the add-on on a small cluster, onboard a namespace, prove the sidecar is injected, enforce STRICT, and prove egress is locked — then tear it down. Costs are modest (a 2-node Standard_B2s cluster for an hour); delete at the end. Run in Cloud Shell (Bash).

Step 1 — Variables and a small cluster.

RG=rg-mesh-lab
LOC=eastus2
CLUSTER=aks-mesh-lab
az group create -n $RG -l $LOC -o table
az aks create -g $RG -n $CLUSTER --node-count 2 --node-vm-size Standard_B2s \
  --network-plugin azure --generate-ssh-keys -o table
az aks get-credentials -g $RG -n $CLUSTER

Step 2 — Check available revisions, then enable the add-on pinned.

az aks mesh get-revisions --location $LOC -o table
REV=asm-1-27   # use a value the previous command listed
az aks mesh enable -g $RG -n $CLUSTER --revision $REV -o table
kubectl get pods -n aks-istio-system   # expect istiod-asm-1-27-* Running

Expected: an istiod-asm-1-27-... pod in aks-istio-system showing Running.

Step 3 — Onboard a namespace the RIGHT way and deploy a sample.

ASM_REV=$(az aks show -g $RG -n $CLUSTER --query 'serviceMeshProfile.istio.revisions[0]' -o tsv)
kubectl create namespace demo
kubectl label namespace demo istio.io/rev=$ASM_REV --overwrite
kubectl apply -n demo -f https://raw.githubusercontent.com/istio/istio/release-1.27/samples/httpbin/httpbin.yaml
kubectl rollout restart deployment -n demo
kubectl get pods -n demo   # expect 2/2 once restarted

Expected: the httpbin pod becomes 2/2 (app + istio-proxy). If it is 1/1, you forgot the restart or used the wrong label.

Step 4 — Prove egress is open (ALLOW_ANY) before you lock it.

kubectl exec -n demo deploy/httpbin -c istio-proxy -- \
  curl -sS -o /dev/null -w '%{http_code}\n' https://example.com   # expect 200 (ALLOW_ANY)

Step 5 — Lock egress with REGISTRY_ONLY via the shared ConfigMap.

kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-shared-configmap-$ASM_REV
  namespace: aks-istio-system
data:
  mesh: |-
    accessLogFile: /dev/stdout
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY
EOF
kubectl rollout restart deployment -n demo   # push the new mesh config to the sidecar
sleep 20
kubectl exec -n demo deploy/httpbin -c istio-proxy -- \
  curl -sS -o /dev/null -w '%{http_code}\n' https://example.com   # now expect 502

Expected: the same curl now returns 502 — egress is blocked because no ServiceEntry declares example.com.

Step 6 — Allow exactly one host with a ServiceEntry.

kubectl apply -n demo -f - <<EOF
apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: allow-example
  namespace: demo
spec:
  hosts: ["example.com"]
  ports: [{number: 443, name: tls, protocol: TLS}]
  resolution: DNS
  location: MESH_EXTERNAL
EOF
sleep 10
kubectl exec -n demo deploy/httpbin -c istio-proxy -- \
  curl -sS -o /dev/null -w '%{http_code}\n' https://example.com   # back to 200, but ONLY this host

Step 7 — Verify mesh health, then enforce STRICT.

istioctl proxy-status --istioNamespace aks-istio-system   # all SYNCED, no STALE
kubectl apply -f - <<EOF
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata: {name: default, namespace: demo}
spec: {mtls: {mode: STRICT}}
EOF

Step 8 — Teardown.

az aks mesh disable -g $RG -n $CLUSTER --yes
az group delete -n $RG --yes --no-wait

What each lab step proves, at a glance:

Step Proves If it fails…
2 Add-on installs pinned to a revision Region/version mismatch — re-check get-revisions
3 Correct label + restart → sidecar 1/1 means wrong label or no restart
4 Default egress is wide open (it always is — that’s the point)
5 REGISTRY_ONLY blocks undeclared egress 200 means ConfigMap name ≠ revision, or no restart
6 ServiceEntry re-allows one host Still 502 → wait for push / check host match
7 Proxies synced; STRICT applied STALE → config not converged

Common mistakes & troubleshooting

This is the differentiator. The managed add-on’s failures are silent at apply time and loud at request time. Scan the playbook table, then read the detail for whichever row matches your symptom.

# Symptom Root cause Confirm (exact command) Fix
1 Pod is 1/1, traffic unencrypted istio-injection=enabled used (ignored by add-on) kubectl get ns <ns> --show-labels Relabel istio.io/rev=asm-X-Y; rollout restart
2 Pod still 1/1 after correct label Workload never restarted (injection is at admission) kubectl get pods -n <ns> kubectl rollout restart deployment -n <ns>
3 Mesh-wide STRICT “does nothing” PeerAuthentication placed in istio-system kubectl get peerauthentication -A Move it to aks-istio-system
4 503 UC after enabling STRICT A caller is un-injected (sends plaintext) istioctl authn tls-check <pod>.<ns> Onboard the caller, or stage PERMISSIVE first
5 503 UC from one client only DestinationRule pins client to DISABLE kubectl get destinationrule -A -o yaml | grep -A2 tls Set mode: ISTIO_MUTUAL
6 403 RBAC: access denied First ALLOW policy is default-deny Envoy access log: rbac_access_denied Add the missing source principals/namespaces
7 “Applied VS, nothing changed” Proxy config is STALE istioctl proxy-status --istioNamespace aks-istio-system Wait for push; rollout restart if stuck
8 External call returns 502/000 REGISTRY_ONLY with no ServiceEntry kubectl exec ... -c istio-proxy -- curl <host> Add a ServiceEntry for the host
9 Egress change ignored Shared ConfigMap name ≠ revision kubectl get cm -n aks-istio-system Rename to istio-shared-configmap-asm-X-Y
10 istioctl “no running Istio pods” Missing --istioNamespace aks-istio-system istioctl version --istioNamespace aks-istio-system Always pass the add-on namespace
11 Upgrade ran, workloads unchanged Repointed tag but didn’t restart istioctl proxy-status (mixed revs) kubectl rollout restart per namespace
12 Egress gateway enable fails Static Egress GW unsupported on Pod Subnet az aks show --query networkProfile Use REGISTRY_ONLY + Azure Firewall instead
13 Gateway has no external IP Source-range/subnet annotation wrong, or quota kubectl get svc -n aks-istio-ingress Fix annotation; check public-IP quota
14 Sidecar OOMKilled Watching the whole cluster’s config kubectl describe pod (OOMKilled) Add discoverySelectors; raise proxy memory

No sidecar injected (rows 1–2)

The two most common tickets. The add-on only honours istio.io/rev; istio-injection=enabled is a silent no-op. And even the right label does nothing to running pods, because injection is a mutating admission webhook that fires at pod creation.

Confirm:

kubectl get ns payments --show-labels   # look for istio.io/rev=asm-X-Y (NOT istio-injection)
kubectl get pods -n payments            # 1/1 = no sidecar; 2/2 = injected

Fix: relabel and restart:

kubectl label namespace payments istio.io/rev=$ASM_REV --overwrite
kubectl rollout restart deployment -n payments

STRICT breaks traffic with 503 UC (rows 3–5)

503 UC (upstream-connect-failure) after flipping STRICT means a server now demands mTLS while a client isn’t sending it. Three distinct causes, each with a different fix:

# Is the mesh-wide policy even in the right namespace?
kubectl get peerauthentication -A

# Is mTLS genuinely STRICT and is the client speaking it?
istioctl authn tls-check "$(kubectl get pod -n payments -l app=productpage \
  -o jsonpath='{.items[0].metadata.name}')".payments \
  --istioNamespace aks-istio-system

# Is a DestinationRule forcing the client side to DISABLE?
kubectl get destinationrule -A -o yaml | grep -B3 -A2 'mode:'

The 503 UC decision table:

If tls-check shows… And… It’s probably… Do this
Server STRICT, client 1/1 caller has no sidecar Un-injected client Onboard the caller; or stage PERMISSIVE
Server STRICT, client 2/2 a DestinationRule exists Client pinned to DISABLE Set DR tls.mode: ISTIO_MUTUAL
Policy not listed in aks-istio-system mesh-wide intended Wrong root namespace Move policy to aks-istio-system
Both 2/2, no DR override still failing Stale config istioctl proxy-status; restart

Authz black-holes traffic (403, row 6)

Your first ALLOW AuthorizationPolicy makes the workload default-deny. Anything not enumerated gets 403.

Confirm in the Envoy access log:

kubectl logs -n payments deploy/productpage -c istio-proxy | grep rbac_access_denied

Fix: enumerate every legitimate caller in the policy’s from.source.principals. During triage you can flip the policy to action: AUDIT to log-without-enforce while you discover callers, then switch back to ALLOW.

Config not applied / STALE (rows 7, 11)

istioctl proxy-status is the truth oracle. A STALE row means the proxy has not received the latest config push — usually because the object is in the wrong namespace, or the proxy needs a nudge.

istioctl proxy-status --istioNamespace aks-istio-system
# SYNCED everywhere = good; STALE for CDS/LDS/RDS/EDS = that proxy is behind

Egress blocked / ignored (rows 8–9)

Under REGISTRY_ONLY, an undeclared host returns 502 from the sidecar. If your egress change is ignored entirely, the shared ConfigMap name doesn’t match the running revision.

# The ConfigMap name MUST be istio-shared-configmap-<running-rev>
kubectl get cm -n aks-istio-system | grep shared

# Prove the block (should be 502/000), then add a ServiceEntry and re-test (200)
kubectl exec -n payments deploy/productpage -c istio-proxy -- \
  curl -sS -o /dev/null -w '%{http_code}\n' https://api.payments-partner.com

istioctl talks to nothing (row 10)

Every istioctl invocation needs --istioNamespace aks-istio-system, or it looks for a control plane in istio-system and reports no running pods. Set an alias if you run it often: alias istioctl='istioctl --istioNamespace aks-istio-system'.

Reading Envoy response flags (the real root-cause signal)

When a request fails, the Envoy access log carries a short response flag that names the failure class far more precisely than the HTTP code. These are the flags you will actually see on this add-on, and what each means:

Response flag Meaning Common mesh cause Where to look next
UC Upstream connection termination STRICT vs plaintext/DISABLE mismatch istioctl authn tls-check; the DestinationRule
UF Upstream connection failure Upstream pod down / no endpoints kubectl get endpoints; pod health
UH No healthy upstream All endpoints unhealthy / outlier-ejected Outlier detection in DestinationRule
URX Upstream retry limit exceeded Retries exhausted on a flapping upstream VirtualService retries; upstream stability
NR No route configured VirtualService/Gateway host mismatch Host/gateways fields; proxy-status
RBAC / rbac_access_denied Authorization denied AuthorizationPolicy default-deny The policy’s from.source rules
DC Downstream connection termination Client gave up (often a timeout above) Client/gateway timeout settings
- (none) No special flag Request handled (may still be app 5xx) App logs / Failures

Pull the flag straight from the sidecar:

kubectl logs -n payments deploy/productpage -c istio-proxy --tail=50 \
  | grep -oE '"[A-Z,]+"' | sort | uniq -c   # tally the response flags

Best practices

Security notes

The mesh’s whole reason to exist is security in transit and least-privilege between services; configure it like you mean it.

Control Setting / mechanism Why Verify
Encryption in transit STRICT PeerAuthentication in aks-istio-system mTLS on every hop; no plaintext istioctl authn tls-check <pod>
Least-privilege L7 Default-deny AuthorizationPolicy with principals Identity-scoped, not IP-scoped Envoy log shows enforced denies
Egress control REGISTRY_ONLY + ServiceEntry (+ Firewall) Stop data exfil; audit every external call curl from sidecar → 502 if undeclared
Ingress exposure Source-range annotation on external gateway Limit who can reach the public edge kubectl get svc annotation present
Identity SPIFFE IDs (cluster.local/ns/<ns>/sa/<sa>) Stable across reschedules; per-workload SA Distinct ServiceAccount per workload
Secret/cert lifecycle istiod-issued workload certs (managed) Short-lived, auto-rotated Managed by the add-on
Defense in depth Mesh authz plus NetworkPolicy L3/4 floor under the L7 mesh Pair with Cilium/Azure NPM
Control-plane isolation aks-istio-system is platform-managed Tenants can’t tamper with istiod RBAC on the namespace

Two non-obvious points. First, mTLS proves who a caller is but not whether they are allowed — STRICT without AuthorizationPolicy still lets any meshed workload call any other, so always pair them. Second, the mesh is not a substitute for Kubernetes NetworkPolicy: a sidecar can be bypassed by a pod that opts out of injection, so keep an L3/4 default-deny NetworkPolicy (see Kubernetes Network Policies: Cilium L7 & Default-Deny) under the mesh as a floor. For workload identity at the app layer, dedicate a ServiceAccount per workload so SPIFFE IDs are meaningful (background in Kubernetes RBAC: Least-Privilege Design).

Cost & sizing

The add-on itself has no separate license fee — you pay for the compute the data and control planes consume, plus the Azure resources the gateways create, plus telemetry ingestion. The drivers:

Cost driver What it is Rough magnitude How to control
Sidecar CPU/memory Envoy per meshed pod ~0.05–0.15 vCPU, ~50–150 MB each Right-size requests; discoverySelectors; don’t mesh everything
istiod footprint Control plane pods Scales with config/proxy count Fewer watched namespaces; tags over many revisions
Ingress gateway Standard LB + public IP LB hourly + per-rule + public IP Share gateways across services; internal where possible
Egress gateway Static Egress GW + IP prefix Gateway + reserved IP prefix Only where a fixed IP is mandated
Azure Firewall (alt) Firewall + public IP + per-GB Firewall hourly + data processed One central firewall for the whole VNet
Managed Prometheus Metric ingestion/storage Per metric sample ingested Scope scraping; drop high-cardinality series
Container Insights logs Access-log ingestion Per GB ingested Scope access logs via Telemetry API

Sizing guidance as a table — the lever to pull at each cluster size:

Cluster size Meshed pods Primary cost lever Watch out for
Small (< 50 pods) Tens Sidecar overhead is the bulk Don’t mesh batch/system namespaces
Medium (50–300) Hundreds discoverySelectors; shared gateways istiod memory creep; STALE pushes
Large (300–1000+) Thousands Prune config aggressively; scope telemetry Push convergence time; log ingestion bill

Rough INR/USD anchors (Central India, indicative): a Standard LB for one gateway runs on the order of ₹1,500–2,500 / month (~$18–30) plus a public IP; sidecar overhead at, say, 200 meshed pods at 0.1 vCPU / 100 MB each is roughly 20 vCPU / 20 GB of cluster capacity you must provision — often a node or two. Telemetry is the sleeper cost: mesh-wide access logging on a busy cluster can dwarf the compute, which is why scoping it via the Telemetry API matters. The add-on has no free tier of its own, but a 2-node Standard_B2s lab cluster for an hour is well under ₹100. There is no charge for the canary upgrade machinery — only the brief period of running two control planes’ worth of istiod pods.

Interview & exam questions

1. Why is istio-injection=enabled a no-op on the AKS managed add-on, and what do you use instead? The add-on is revision-scoped and only honours istio.io/rev=asm-X-Y (or a revision tag) for injection. istio-injection=enabled is silently ignored, producing a 1/1 pod with no sidecar and no error. You label the namespace with the running revision (or a tag) and then kubectl rollout restart the workloads.

2. Where do mesh-wide policies live on the add-on, and why does this matter? In aks-istio-system, the add-on’s root namespace — not istio-system as in upstream Istio. A selector-less PeerAuthentication or the shared MeshConfig placed in istio-system is read by nothing, which is the most common reason a “mesh-wide STRICT” change appears to do nothing.

3. Explain the difference between PeerAuthentication STRICT and a DestinationRule’s ISTIO_MUTUAL. PeerAuthentication is server-side: it controls what a workload’s sidecar accepts (STRICT = mTLS only). DestinationRule tls.mode: ISTIO_MUTUAL is client-side: it controls what the client sidecar originates. A 503 UC after enabling STRICT is usually a mismatch — a client still sending plaintext or pinned to DISABLE.

4. Why can adding your first AuthorizationPolicy cause an outage? An AuthorizationPolicy with action: ALLOW and at least one rule makes the targeted workload default-deny — anything not explicitly matched is rejected with 403. If you don’t enumerate every legitimate caller, you black-hole traffic you forgot about.

5. How do you safely migrate mTLS from PERMISSIVE to STRICT? Start PERMISSIVE (default), confirm every client of a service is injected and showing 2/2, then enforce STRICT — ideally per-namespace so the blast radius is one team at a time. Verify with istioctl authn tls-check and watch for 503 UC from any straggler.

6. What does REGISTRY_ONLY do, where do you set it, and what must you add afterward? It makes Envoy block any outbound host not in the service registry. You set it in the shared ConfigMap (istio-shared-configmap-asm-X-Y), which the control plane merges over its reconciled default. After that, every external dependency must be declared as a ServiceEntry, or it returns 502 from the sidecar.

7. Walk through a canary revision upgrade. Check az aks mesh get-upgrades; copy the shared ConfigMap to the new revision name; az aks mesh upgrade start --revision asm-1-28 (runs new istiod alongside the old); repoint a revision tag to the new revision; kubectl rollout restart the namespaces you’re migrating; verify with istioctl proxy-status and dashboards; then complete (removes old) or rollback (removes canary).

8. How do minor upgrades differ from patch upgrades? Minor (revision) upgrades you initiate; they run two control planes and you migrate workloads at your pace. Patch upgrades (e.g. 1.27.2 → 1.27.3) AKS rolls automatically in your maintenance window for istiod and gateways — but your sidecars don’t update until you restart the workloads.

9. Why can’t you always use the managed Istio egress gateway, and what’s the alternative? It requires the AKS Static Egress Gateway feature, which is unsupported on Azure CNI Pod Subnet clusters — so the egress gateway isn’t available there. The alternative is REGISTRY_ONLY + ServiceEntry for the L7 allowlist plus Azure Firewall (fixed public IP via UDR) for a deterministic source IP.

10. What’s the single highest-signal command for “I applied a VirtualService and nothing changed,” and why? istioctl proxy-status --istioNamespace aks-istio-system. A STALE row means that proxy hasn’t received the latest config push — the usual root cause. SYNCED everywhere means the config is live and the problem is elsewhere (e.g. wrong host/match).

11. Why prefer SPIFFE principals over ipBlocks in authorization rules? SPIFFE identities (cluster.local/ns/<ns>/sa/<sa>) are tied to the workload’s ServiceAccount and are stable across pod reschedules and IP changes; mTLS provides exactly this identity. IP-based rules break the moment a pod is rescheduled to a new address.

12. Which certs map to which exams? This material maps to AZ-305 (designing secure Azure solutions / AKS networking), the CKS (cluster security, mesh, network policy, supply chain), and Istio-specific knowledge for vendor mesh assessments. The mTLS/authz/egress patterns also appear in zero-trust architecture questions.

Quick check

  1. A pod in a labelled namespace shows 1/1. Name the two most likely causes.
  2. You set mesh-wide STRICT but nothing is enforced. Where did you probably put the PeerAuthentication, and where should it go?
  3. After enabling STRICT you get 503 UC from one specific client that is 2/2. What’s the likely culprit?
  4. Under REGISTRY_ONLY, an external API call returns 502 from the sidecar. What’s missing?
  5. You repointed the revision tag during an upgrade but workloads still run the old proxy. What step did you skip?

Answers

  1. Either the namespace was labelled istio-injection=enabled (ignored by the add-on — use istio.io/rev=asm-X-Y), or the workloads were never restarted after labelling (injection happens at admission, so kubectl rollout restart).
  2. You likely put it in istio-system; the add-on’s root namespace is aks-istio-system. Move it there.
  3. A DestinationRule for that host is pinning the client side to tls.mode: DISABLE (or SIMPLE) while the server now demands mTLS. Set it to ISTIO_MUTUAL.
  4. A ServiceEntry declaring that host. Under REGISTRY_ONLY, undeclared hosts are blocked; add a ServiceEntry (MESH_EXTERNAL, port 443).
  5. The kubectl rollout restart of the workloads. Repointing the tag/relabelling changes nothing until pods are recreated and re-injected on the new revision.

Glossary

Next steps

aksistioservice-meshmtlsnetworkingegressingressobservability
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments