Azure Networking

Running the Managed Istio Add-on on AKS: mTLS, Ingress Gateways, and Egress Control

The AKS managed Istio add-on (Azure Service Mesh, branded as asm-X-Y revisions) takes the part of Istio that most teams get wrong — control-plane lifecycle, upgrades, and CRD hygiene — and makes it Microsoft’s problem. What it does not do is make your security posture, routing, or egress correct by default. The add-on ships with permissive mTLS and ALLOW_ANY egress out of the box; everything that makes a mesh worth running is something you still have to configure deliberately.

This guide walks the full path a platform team takes: enable the add-on, label namespaces correctly, enforce strict mTLS scoped with authorization policy, stand up managed ingress gateways, lock down egress to a registry-only allowlist, run a canary revision upgrade end to end, and wire telemetry into Managed Prometheus. Every command and manifest here is the managed-add-on variant, which differs from upstream Istio in ways that bite if you copy generic tutorials.

1. Managed add-on vs self-managed Istio: the revision model

The single most important mental model: the add-on is revision-scoped. There is no “Istio version” on the cluster — there is a revision like asm-1-27, and almost every object you touch (the istiod deployment, the shared ConfigMap, the namespace injection label, the ingress/egress gateway deployments) is suffixed or keyed by that revision string. This is what makes the canary upgrade model work, and it is why generic Istio docs lead you astray.

Key constraints that differ from self-managed Istio:

Check what is actually available in your region before you do anything else. Compatibility is a function of both the AKS version and the region:

az aks mesh get-revisions --location eastus2 -o table

2. Enabling the add-on and the namespace labeling strategy

Enable on an existing cluster. If you omit --revision, AKS picks a current default — fine for a lab, but in production you want to pin the revision so it does not drift between environments:

export RESOURCE_GROUP=rg-platform
export CLUSTER=aks-prod-eastus2
export REV=asm-1-27

az aks mesh enable \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER \
  --revision $REV

Two hard prerequisites: Azure CLI 2.57.0+ (egress gateways need 2.80.0+), and the Open Service Mesh add-on must be removed first — the two cannot coexist. The add-on also requires Kubernetes >= 1.23.

Confirm the mesh mode and the control-plane pods, then pull the live revision so the rest of your scripts are not hard-coded:

az aks show -g $RESOURCE_GROUP -n $CLUSTER --query 'serviceMeshProfile.mode' -o tsv
# -> Istio

az aks get-credentials -g $RESOURCE_GROUP -n $CLUSTER
kubectl get pods -n aks-istio-system
# -> istiod-asm-1-27-... Running

ASM_REV=$(az aks show -g $RESOURCE_GROUP -n $CLUSTER \
  --query 'serviceMeshProfile.istio.revisions[0]' -o tsv)

Now the labeling strategy. Do not label everything. Onboard namespaces deliberately, because injection changes the pod spec and forces a restart. The label must match the running revision exactly:

# Correct for the add-on:
kubectl label namespace payments istio.io/rev=$ASM_REV --overwrite

# WRONG — silently skipped by the add-on, no sidecar injected:
# kubectl label namespace payments istio-injection=enabled

Labeling alone does nothing to running pods. Injection happens at admission, so you must restart existing workloads to get a sidecar:

kubectl rollout restart deployment -n payments
kubectl get pods -n payments
# Each pod should now show 2/2 READY (app container + istio-proxy)

A practical governance tactic at scale: pair the revision label with discoverySelectors in MeshConfig so istiod only watches mesh-labeled namespaces. On a large cluster this materially reduces istiod and Envoy memory by pruning irrelevant config from every proxy’s push.

3. Enforcing STRICT PeerAuthentication and scoping with AuthorizationPolicy

Out of the box the mesh runs PERMISSIVE mTLS: sidecars accept both plaintext and mTLS. That is the right default during onboarding (un-injected clients keep working) and the wrong default for production. The migration sequence matters: you turn on STRICT only after every client of a service is inside the mesh, or you cause an outage.

Mesh-wide STRICT goes in the root namespace (aks-istio-system), with no selector:

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: aks-istio-system   # add-on root namespace, NOT istio-system
spec:
  mtls:
    mode: STRICT

A safer rollout pattern is per-namespace STRICT, so you can flip services one blast radius at a time. Pair it with an AuthorizationPolicy to move from “encrypted” to “encrypted and authorized” — mTLS proves identity, but without an authorization policy every authenticated workload can still call every other one:

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: payments
spec:
  mtls:
    mode: STRICT
---
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: payments-allow-checkout
  namespace: payments
spec:
  action: ALLOW
  rules:
  - from:
    - source:
        # SPIFFE identity, not IP — survives pod reschedules
        principals: ["cluster.local/ns/checkout/sa/checkout-api"]
    to:
    - operation:
        methods: ["POST"]
        paths: ["/v1/charges"]

A subtle, important rule: an AuthorizationPolicy with an ALLOW action and at least one rule is default-deny for that workload — anything not explicitly matched is rejected. Adding your first ALLOW policy to a namespace can therefore black-hole traffic you forgot to enumerate. Prefer source principals and namespaces over IP-based rules; identities are stable across reschedules and are exactly what mTLS gives you.

4. Provisioning managed ingress gateways

The add-on provisions and lifecycles the gateways for you. Enable an external (internet-facing) and an internal (VNet-only) gateway:

az aks mesh enable-ingress-gateway \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --ingress-gateway-type external

az aks mesh enable-ingress-gateway \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --ingress-gateway-type internal

This creates two LoadBalancer services in aks-istio-ingress: aks-istio-ingressgateway-external (public IP) and aks-istio-ingressgateway-internal (an internal Standard LB IP, reachable only from the VNet). The label you bind your Gateway to is on those services, e.g. istio: aks-istio-ingressgateway-internal.

kubectl get svc -n aks-istio-ingress

Customize the underlying Azure LB via annotations on the service. Two that almost every enterprise needs — pin the internal gateway to a dedicated subnet, and restrict the external gateway’s source ranges:

# Internal gateway -> specific subnet (must be in the mesh's VNet)
kubectl annotate svc aks-istio-ingressgateway-internal -n aks-istio-ingress \
  service.beta.kubernetes.io/azure-load-balancer-internal-subnet=snet-ingress --overwrite

# External gateway -> allow only known source CIDRs (e.g. your WAF / Front Door egress)
kubectl annotate svc aks-istio-ingressgateway-external -n aks-istio-ingress \
  service.beta.kubernetes.io/azure-allowed-ip-ranges="203.0.113.0/24,198.51.100.0/24" --overwrite

Bind a Gateway + VirtualService to the internal gateway. Note the selector points at the service label, and the gateway lives in the application namespace, not in aks-istio-ingress:

apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: storefront-internal
  namespace: payments
spec:
  selector:
    istio: aks-istio-ingressgateway-internal
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "shop.internal.contoso.com"
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: storefront
  namespace: payments
spec:
  hosts:
  - "shop.internal.contoso.com"
  gateways:
  - storefront-internal
  http:
  - route:
    - destination:
        host: productpage
        port:
          number: 9080

If you need the real client IP at the gateway (for WAF logging or rate limiting), set externalTrafficPolicy: Local on the external service. It preserves source IP and removes a hop, at the cost of less even traffic spreading across nodes:

kubectl patch svc aks-istio-ingressgateway-external -n aks-istio-ingress \
  --type merge -p '{"spec":{"externalTrafficPolicy":"Local"}}'

5. Routing: VirtualService, DestinationRule, and subset traffic splitting

Canary application releases (distinct from mesh-revision upgrades) are driven by a DestinationRule that declares subsets and a VirtualService that weights them. Define the subsets against pod labels, then split:

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: reviews
  namespace: payments
spec:
  host: reviews
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL   # use mesh-issued mTLS to the upstream
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
  namespace: payments
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
        subset: v1
      weight: 90
    - destination:
        host: reviews
        subset: v2
      weight: 10

Setting trafficPolicy.tls.mode: ISTIO_MUTUAL in the DestinationRule is what tells the client sidecar to originate mTLS. STRICT PeerAuthentication governs the server side (what it accepts); the DestinationRule governs the client side (what it sends). When you flip a namespace to STRICT, make sure no DestinationRule is overriding the client side back to DISABLE for that host — a mismatch here is the classic “503 UC” / upstream-connect-failure you will spend an afternoon chasing.

6. Locking down egress with ServiceEntry and REGISTRY_ONLY

By default outboundTrafficPolicy.mode is ALLOW_ANY: a compromised pod can call any host on the internet, and your mesh provides zero egress control. Flip it to REGISTRY_ONLY so Envoy blocks anything not explicitly in the service registry. You set this via the shared ConfigMap, whose name is revision-specific and which the control plane merges over its reconciled default (you never edit the default istio-asm-X-Y ConfigMap directly):

apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-shared-configmap-asm-1-27   # must match your revision
  namespace: aks-istio-system
data:
  mesh: |-
    accessLogFile: /dev/stdout
    outboundTrafficPolicy:
      mode: REGISTRY_ONLY

With that applied, every external dependency must be declared as a ServiceEntry. This turns egress into an auditable allowlist living in Git:

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: contoso-payments-api
  namespace: payments
spec:
  hosts:
  - api.payments-partner.com
  ports:
  - number: 443
    name: tls
    protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL

For defense-in-depth — a predictable source IP that a partner or Azure Firewall can allowlist — route that traffic through a managed Istio egress gateway, which builds on the AKS Static Egress Gateway feature. Provision it against a StaticGatewayConfiguration that owns a fixed egress IP prefix:

az aks mesh enable-egress-gateway \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --istio-egressgateway-name egress-partners \
  --istio-egressgateway-namespace aks-istio-egress \
  --gateway-configuration-name sgc-partners

Caveat worth knowing before you design around it: the Istio egress gateway requires Static Egress Gateway, which is not supported on Azure CNI Pod Subnet clusters — so the egress gateway isn’t either. On those clusters, enforce egress with REGISTRY_ONLY + ServiceEntry + Azure Firewall instead, and skip the gateway.

7. Canary revision upgrades: tag, shift, roll back

This is where the managed add-on earns its keep. A minor revision upgrade runs the new istiod alongside the old one; you migrate workloads at your own pace and can roll back at any point before completing. You can move n+1 or skip to n+2, provided both are supported and AKS-compatible.

First, see your valid targets (if a newer revision is missing here, your AKS version is too old and must be upgraded first):

az aks mesh get-upgrades --resource-group $RESOURCE_GROUP --name $CLUSTER

If you set any custom MeshConfig, copy your shared ConfigMap to the new revision’s name first (e.g. istio-shared-configmap-asm-1-28) — it has to exist the moment the new control plane comes up. Then start the canary:

az aks mesh upgrade start \
  --resource-group $RESOURCE_GROUP --name $CLUSTER \
  --revision asm-1-28

Now both control planes are running. Rather than relabel every namespace (tedious and error-prone), use revision tags as a stable indirection. Point a tag at the old revision, label namespaces with the tag, and later you just repoint the tag:

# istioctl must target the add-on namespace
istioctl tag set prod --revision asm-1-27 --istioNamespace aks-istio-system
kubectl label namespace payments istio.io/rev=prod --overwrite

# When ready to shift, repoint the tag — all 'prod'-tagged namespaces move at once
istioctl tag set prod --revision asm-1-28 --istioNamespace aks-istio-system --overwrite

# Relabeling/repointing does nothing until you restart workloads:
kubectl rollout restart deployment -n payments

Verify both control planes — and, if ingress is enabled, the per-revision gateway pods sitting behind one shared, immutable service IP:

kubectl get pods -n aks-istio-system   # istiod-asm-1-27-* AND istiod-asm-1-28-*
kubectl get pods -n aks-istio-ingress  # gateway pods for both revisions; same LB IP

Check your dashboards, then commit or revert. Completing removes the old control plane; rollback (after repointing the tag and restarting workloads back) removes the canary:

# Healthy -> finalize
az aks mesh upgrade complete --resource-group $RESOURCE_GROUP --name $CLUSTER

# Regression -> repoint tag to old rev, restart workloads, then:
az aks mesh upgrade rollback --resource-group $RESOURCE_GROUP --name $CLUSTER

Patch versions (e.g. 1.27.2 -> 1.27.3) are different: AKS rolls them out automatically for istiod and gateways inside your planned maintenance window. Your sidecars do not update until you restart the workloads — patching the control plane alone leaves data-plane proxies on the old build.

8. Telemetry: metrics, access logs, and Managed Prometheus

Istio exposes rich Envoy metrics on each pod’s merged telemetry endpoint, port 15020 (/stats/prometheus). Azure Managed Prometheus does not scrape pod-annotation targets by default — you opt in by editing the ama-metrics-settings-configmap to enable pod-annotation-based scraping, then annotating mesh pods:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ama-metrics-settings-configmap
  namespace: kube-system
data:
  pod-annotation-based-scraping: |-
    podannotationnamespaceregex = "payments|checkout|aks-istio-ingress"

Annotate the mesh workloads so the agent knows where to scrape. Envoy merges its own and the app’s metrics onto 15020, so a single scrape target covers both:

# pod template annotations on your Deployments
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "15020"
    prometheus.io/path: "/stats/prometheus"

For access logs, the accessLogFile: /dev/stdout line in the shared ConfigMap (from section 6) emits structured per-request logs to the istio-proxy container, where Container Insights picks them up. Be deliberate: mesh-wide access logging measurably increases Envoy CPU and log volume. Scope it with the Telemetry API to the namespaces that need it rather than blasting it across the fleet.

Once metrics land in your Managed Prometheus workspace, the golden signals are standard Istio series. A request-success-rate query in KQL against the Azure Monitor workspace:

Metrics
| where Name == "istio_requests_total"
| extend code = tostring(parse_json(Tags)["response_code"])
| summarize total = sum(Val), errors = sumif(Val, toint(code) >= 500) by bin(TimeGenerated, 5m)
| extend success_rate = todouble(total - errors) / total
| project TimeGenerated, success_rate

Verify

Run these after each major step; they catch the failure modes that produce confusing 503s and silent plaintext.

# 1. Sidecars actually injected (expect 2/2 READY across mesh namespaces)
kubectl get pods -n payments

# 2. mTLS is genuinely STRICT for a workload (no plaintext listener)
istioctl authn tls-check "$(kubectl get pod -n payments -l app=productpage \
  -o jsonpath='{.items[0].metadata.name}')".payments \
  --istioNamespace aks-istio-system

# 3. Config is valid and pushed to every proxy (no SYNCED=STALE)
istioctl proxy-status --istioNamespace aks-istio-system

# 4. Egress is locked down: this MUST fail under REGISTRY_ONLY without a ServiceEntry
kubectl exec -n payments deploy/productpage -c istio-proxy -- \
  curl -sS -o /dev/null -w '%{http_code}\n' https://example.com   # expect 502/000, not 200

# 5. Ingress is reachable on the gateway's LB IP
kubectl get svc aks-istio-ingressgateway-external -n aks-istio-ingress \
  -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

istioctl proxy-status is the highest-signal command in the set: if a proxy shows STALE for any config type, that workload is running stale routing or policy, which is the usual root cause of “I applied the VirtualService but nothing changed.”

Enterprise scenario

A payments platform team ran a regional AKS cluster on Azure CNI Pod Subnet for routable pod IPs. Their PCI scope required two things from the mesh: strict mTLS between all in-scope services, and a single, fixed source IP for outbound calls to a card-processor partner who allowlisted by IP. The obvious design — a managed Istio egress gateway over Static Egress Gateway for the predictable IP — failed at az aks mesh enable-egress-gateway, because Static Egress Gateway is not supported on Pod Subnet clusters, so the Istio egress gateway isn’t available there either.

Rather than re-platform the cluster off Pod Subnet (a multi-quarter migration), they split the two requirements across the layers that were available. Mesh-wide STRICT PeerAuthentication in aks-istio-system satisfied the encryption requirement. For the fixed egress IP, they set REGISTRY_ONLY in the shared ConfigMap, declared the partner host as a ServiceEntry, and forced that traffic out through Azure Firewall with a fixed public IP via UDR — Istio enforced the L7 allowlist and identity, the firewall provided the stable source IP and packet capture for the auditors. The ServiceEntry that made it auditable in Git:

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: card-processor
  namespace: payments
spec:
  hosts:
  - secure.cardprocessor.example
  ports:
  - number: 443
    name: tls
    protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL

The lesson the team wrote into their platform runbook: validate add-on feature support against your cluster’s network plugin before designing around it. The mesh and the cluster network model are coupled, and the managed add-on’s constraints (egress gateway, Pod Subnet) are not interchangeable with self-managed Istio’s. Pushing the fixed-IP requirement down to Azure Firewall was both compliant and cheaper than re-platforming.

Checklist

aksistioservice-meshmtlsnetworking

Comments

Keep Reading