Azure Arc-Enabled Kubernetes: GitOps, Policy, and Fleet Governance for Hybrid Clusters

A platform team running EKS in one account, GKE in another, and three on-prem clusters in a colo does not have a Kubernetes problem. It has a governance problem: there is no single place to assert “every cluster runs this GitOps config, denies privileged pods, ships logs to one workspace, and is reachable for debugging without poking inbound holes in five firewalls.” Every cluster is a snowflake with its own RBAC, its own admission controller (or none), its own log destination, and its own bastion. Azure Arc-enabled Kubernetes projects any conformant cluster into Azure Resource Manager as a Microsoft.Kubernetes/connectedClusters resource, so the same management-group hierarchy, Azure Policy assignments, and RBAC you already use for native Azure resources now reach the cluster — wherever it physically runs.

This walkthrough onboards a non-Azure cluster, then layers the four controls that actually matter at fleet scale: Flux v2 GitOps for desired-state config, Azure Policy (Gatekeeper) for admission guardrails, cluster connect for kubectl without inbound firewall changes, and Container Insights plus workload identity for observability and secretless Key Vault access. Throughout I assume you have cluster-admin on the target cluster and Owner (or sufficient RBAC) on the Azure side. The goal is not a demo of one cluster — it is the machinery that turns forty snowflakes into “one policy, one Git repo, one identity boundary.”

Arc projects the cluster; it does not run it. The control plane, scheduler, and your nodes stay exactly where they are. Arc adds a set of agents that maintain an outbound connection to Azure and reconcile ARM intent into the cluster. If Azure is unreachable, the cluster keeps serving traffic — only the management plane pauses.

What problem this solves

The pain is operational drift across a heterogeneous fleet. Without a projection layer, every governance question becomes N separate answers. “Are privileged pods blocked everywhere?” means SSHing into N clusters or trusting N different OPA setups. “Who can debug the loyalty cluster at 02:00?” means N bastions, N VPNs, and N firewall change tickets. “Where are the logs?” means N workspaces and no fleet-wide query. When an auditor asks “prove no cluster runs hostPath mounts,” you have no single control plane to answer from.

What breaks without it: configuration entropy (each cluster diverges from the golden baseline because changes are applied by hand), inconsistent security posture (one cluster forgot the admission webhook and now runs root containers), blind operations (an outage on an edge cluster is invisible until a human notices), and access sprawl (every team cuts inbound firewall holes for kubectl, each one a new attack surface). Who hits this: platform/SRE teams running multi-cloud or hybrid Kubernetes, regulated shops that must prove uniform controls, and edge fleets (retail, manufacturing, telco) where clusters sit behind carrier-grade NAT with no public ingress.

Pain without Arc	What it costs you	How Arc fixes it
Config drift across N clusters	Snowflakes; “works on cluster A, broken on B”	Flux reconciles one Git repo to every cluster, `prune=true`
No uniform admission policy	One cluster runs root pods, fails audit	Azure Policy → Gatekeeper assigned at management-group scope
Inbound firewall holes for kubectl	N attack surfaces, N change tickets	Cluster connect — outbound-only, no inbound port
Logs scattered in N places	No fleet-wide incident view	Container Insights → one Log Analytics workspace
Static secrets in manifests	Credential sprawl, no per-app audit	Workload identity + Key Vault CSI, secretless
Onboarding a cluster is manual	Days per cluster, human error	MG inheritance — new cluster self-bootstraps baseline

Learning objectives

By the end of this article you can:

Explain the Arc agent architecture and the outbound-only connectivity model, and enumerate every required FQDN — including the *.servicebus.windows.net websocket dependency that breaks cluster connect when proxied.
Onboard an on-prem or EKS/GKE cluster with az connectedk8s connect, including the proxy flags (--proxy-https, --proxy-skip-range, --proxy-cert) that locked-down networks actually need.
Configure Flux v2 GitOps via the microsoft.flux extension with correctly scoped Kustomizations, prune=true, and dependsOn ordering, identically across Arc and AKS.
Assign Azure Policy (Gatekeeper) initiatives at management-group scope, roll out safely in audit before deny, and exclude system namespaces so you do not block Arc’s own agents.
Grant cluster connect access with Azure RBAC and use az connectedk8s proxy for kubectl with zero inbound firewall changes.
Enable Container Insights with managed-identity auth (no workspace key in the cluster) and scope ingestion to control cost.
Federate a user-assigned managed identity to a Kubernetes service account so pods read Key Vault secrets with no credential in the cluster.
Operate the controls at fleet scale using management groups, tags, Azure Resource Graph inventory, and Bicep-as-intent so new clusters inherit the baseline automatically.

Prerequisites & where this fits

You should be comfortable with core Kubernetes (Deployments, namespaces, RBAC, admission webhooks), kubectl and kubeconfig contexts, and Helm at a basic level. On the Azure side you need an understanding of Azure Resource Manager, management groups, Azure RBAC role assignments, and Azure Policy assignments. Familiarity with GitOps as a concept (desired state in Git, a controller reconciles) makes section 3 land faster — if you want a refresher, the Flux CD GitOps: Monorepo, Kustomize, and Multi-Tenancy and Argo CD App-of-Apps Multi-Cluster GitOps deep-dives cover the upstream engines Arc wraps.

Where this fits in the bigger picture: Arc-enabled Kubernetes is the hybrid arm of a wider Azure governance story. Management groups and Policy initiatives are the same primitives you would use in an Azure landing zone management group and Azure Policy at scale. Arc for servers is the sibling for VMs — see Azure Arc-Enabled Servers: Machine Configuration & Extended Security Updates. If your target is actually a managed Azure cluster, much of this carries over to AKS day-two operations covered in AKS Day-Two: Upgrades & Fleet Operations.

You should already know…	Why it matters here	If shaky, read
Kubernetes RBAC + admission webhooks	Policy = Gatekeeper webhook; access = impersonation	(K8s docs)
`kubeconfig` contexts	`connect` uses the current context to deploy agents	(kubectl basics)
Azure management groups	Policy + RBAC inherit down the MG tree	azure-landing-zone-management
Azure Policy assignments	Initiatives become in-cluster constraints	azure-policy-governance-scale
GitOps reconcile model	Flux is the desired-state engine	flux-cd-gitops-monorepo-kustomize-multi-tenancy
Managed identity + federation	Workload identity = secretless Key Vault	entra-managed-identities-deep-dive-user-assigned-fic-rbac

Core concepts

Arc-enabled Kubernetes is a thin projection: a Helm release of agents inside the cluster, a resource in ARM, and a set of cluster extensions that deliver capabilities (Flux, Policy, Monitor, Key Vault). Internalize this vocabulary before the deep sections — every later table assumes it.

Concept	One-line definition	Where it lives	Why it matters
Connected cluster	The ARM resource projecting your cluster	`Microsoft.Kubernetes/connectedClusters`	The handle Policy/RBAC/extensions attach to
Arc agents	Helm release in `azure-arc` namespace	In-cluster	Maintain the outbound channel + reconcile intent
Cluster extension	A managed add-on lifecycled by Arc	`Microsoft.KubernetesConfiguration/extensions`	How Flux/Policy/Monitor/KV get installed + upgraded
`microsoft.flux`	The Flux v2 GitOps extension	Cluster extension	Delivers source/kustomize/helm controllers
`fluxConfigurations`	ARM resource describing a Git source + Kustomizations	ARM + in-cluster	Desired-state intent, applied by `config-agent`
Azure Policy add-on	Gatekeeper v3 (OPA) admission webhook	`Microsoft.PolicyInsights` extension	Turns ARM initiatives into in-cluster `Constraint`s
Cluster connect	Outbound channel for `kubectl` from anywhere	`clusterconnect-agent`	`kubectl` with no inbound port / VPN
`kube-aad-proxy`	Entra authN + user impersonation shim	In-cluster	Maps an Azure token to a K8s identity
Container Insights	Logs/metrics/inventory extension	`Microsoft.AzureMonitor.Containers`	Fleet telemetry into one workspace
Workload identity	Federated UAMI → K8s service account	Entra + cluster	Secretless Key Vault / Azure API access
Key Vault CSI	Secrets Store CSI driver + Azure provider	`Microsoft.AzureKeyVaultSecretsProvider`	Mounts vault secrets on tmpfs, no creds in-cluster
Management group	A scope above subscriptions	ARM hierarchy	Policy + RBAC inheritance to all child clusters

The two control planes

Arc gives you two distinct planes, and confusing them is the root of most early mistakes. The management plane is ARM: management groups, Policy assignments, role assignments, extension lifecycle. It is eventually consistent — Policy syncs roughly every 15 minutes, Flux on its own interval. The data plane is your cluster’s kube-apiserver, untouched and authoritative for what actually runs. Arc never inserts itself in the request path of your workloads; it only reconciles intent and brokers kubectl.

Plane	Owns	Latency	Authoritative for	If Azure is down
Management (ARM)	Policy, RBAC, extensions, GitOps intent	Eventual (~15 min Policy)	Desired state	Reconcile pauses
Data (apiserver)	Pods, services, actual admission	Real-time	Actual state	Cluster keeps serving

1. Agent architecture, connectivity, and outbound requirements

az connectedk8s connect installs a Helm release into the azure-arc namespace. The agents are all-outbound by design — there is no inbound listener Azure dials into. Each agent has a single, separable job; knowing which one owns what turns a vague “Arc is broken” into a targeted fix.

Agent	Role	Owns this failure when it breaks
`clusterconnect-agent`	Reverse proxy brokering the cluster-connect channel	`kubectl`-over-Arc hangs / times out
`kube-aad-proxy`	Entra authN on incoming connect requests, then impersonates the user	`kubectl` returns `forbidden` / authN errors
`config-agent`	Watches ARM for `fluxConfigurations` and applies them	Flux config never reconciles
`extension-manager`	Installs and lifecycles cluster extensions	Extension stuck `Creating`/`Failed`
`clusteridentityoperator`	Maintains the cluster’s MSI certificate used to auth to Azure	Cluster goes `Disconnected`, cert renewal fails
`resource-sync-agent`	Syncs cluster inventory back to the ARM resource	`connectivityStatus`/inventory stale
`cluster-metadata-operator`	Publishes cluster metadata (version, distribution) to ARM	Resource Graph shows blank distribution/version
`flux` controllers (with extension)	`source-controller`, `kustomize-controller`, `helm-controller`	Source pull / apply failures

Every agent talks outbound over https://:443 and websockets. The non-obvious requirement is *.servicebus.windows.net with websockets enabled on your proxy/firewall — cluster connect rides Azure Relay over that endpoint, and a Layer-7 proxy that blocks websocket upgrades will let onboarding succeed but break kubectl-over-Arc later. This single trap accounts for the majority of “onboarded fine but proxy hangs” tickets.

Required outbound endpoints

FQDN	Port	Purpose	Breaks if blocked
`management.azure.com`	443	ARM API (resource, extensions)	Onboarding, all management
`login.microsoftonline.com`	443	Entra ID token issuance	All auth
`mcr.microsoft.com`	443	Agent + extension container images	Agents can’t pull
`*.data.mcr.microsoft.com`	443	MCR image data edges	Image pull (CDN)
`*.dp.kubernetesconfiguration.azure.com`	443	Flux/config data plane	GitOps + extensions
`guestnotificationservice.azure.com`	443	Notifications + the allowlist API	Connect signalling
`*.servicebus.windows.net`	443	Azure Relay for cluster connect (websockets)	`kubectl`-over-Arc
`*.his.arc.azure.com`	443	Hybrid identity service (MSI cert)	Identity/cert renewal
`gbl.his.arc.azure.com`	443	Global hybrid identity endpoint	First MSI provisioning
`*.obo.arc.azure.com`	443	On-behalf-of token exchange	Cluster connect authZ
`*.oms.opinsights.azure.com`	443	Container Insights ingestion	Log shipping
`*.monitoring.azure.com`	443	Metrics ingestion	Prometheus/metrics
`*.vault.azure.net`	443	Key Vault data plane (CSI)	Secret retrieval

The wildcard Service Bus endpoints resolve per-region; never hard-block them on a deny-by-default proxy without first expanding them for your regions. Expand with:

# Region-specific allowlist to replace the *.servicebus.windows.net wildcard
curl -s "https://guestnotificationservice.azure.com/urls/allowlist?api-version=2020-01-01&location=eastus"

There is no “Azure-initiated inbound” connectivity mode for Arc Kubernetes — it is outbound-only, which is precisely why it fits locked-down on-prem and multi-cloud egress postures. Choose your egress posture deliberately:

Egress posture	What you configure	Pros	Cons
Direct outbound `:443`	Nothing extra	Simplest; least to break	Requires open egress to listed FQDNs
Explicit proxy	`--proxy-http/https/skip-range`	Centralized inspection/logging	Proxy must allow websockets to Relay
Proxy + custom root CA	add `--proxy-cert`	TLS-inspecting proxies work	Cert rotation must be maintained
Private endpoint (Arc PL)	Private endpoints for Arc data plane	Traffic stays on backbone	More setup; per-region endpoints

Connectivity status meanings

`connectivityStatus`	Meaning	Likely cause	Confirm	Fix
`Connected`	Agents heartbeating normally	—	`az connectedk8s show ... -o tsv`	(healthy)
`Offline`	No heartbeat for >15 min	Egress blocked / agents down	`kubectl get pods -n azure-arc`	Restore egress; restart agents
`Connecting`	Onboarding/handshake in progress	Just connected; provisioning	Wait; check agent logs	Usually transient
`Expired`	MSI certificate expired	`clusteridentityoperator` stuck / egress to `*.his.arc.azure.com` blocked	Check that agent’s logs	Allow HIS endpoints; restart agent

2. Onboard an on-prem or EKS/GKE cluster

Point your kubeconfig at the target cluster (kubectl config use-context my-eks), then prep the Azure side. Register the resource providers once per subscription — registration is asynchronous and can take ~10 minutes, so gate on it.

az extension add --name connectedk8s

az provider register --namespace Microsoft.Kubernetes
az provider register --namespace Microsoft.KubernetesConfiguration
az provider register --namespace Microsoft.ExtendedLocation

# Registration can take ~10 min; gate on it before connecting
az provider show -n Microsoft.Kubernetes --query registrationState -o tsv   # -> Registered

Resource provider	Why you register it	Needed for
`Microsoft.Kubernetes`	Creates the connected-cluster resource	Onboarding (always)
`Microsoft.KubernetesConfiguration`	Flux configs + cluster extensions	GitOps, all extensions
`Microsoft.ExtendedLocation`	Custom locations on the cluster	Arc-enabled services (App Svc, data)
`Microsoft.PolicyInsights`	Azure Policy for Kubernetes	Gatekeeper guardrails
`Microsoft.OperationalInsights`	Log Analytics workspaces	Container Insights destination

Create a resource group to hold the connected-cluster resources, then connect. connect uses the current kubeconfig context to deploy the Arc agents:

export RESOURCE_GROUP=rg-arc-fleet
export LOCATION=eastus
export CLUSTER_NAME=eks-prod-use1

az group create --name $RESOURCE_GROUP --location $LOCATION -o table

# Uses the CURRENT kubeconfig context to deploy the Arc agents
az connectedk8s connect \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --location $LOCATION

connect installs its own Helm v3 binary under ~/.azure (it never touches a Helm you already have) and deploys the agents. The flags you will reach for most:

Flag	What it does	When to use	Gotcha
`--name`	Connected-cluster resource name	Always	Must be unique in the RG
`--resource-group`	Target RG	Always	RG location ≠ cluster location is fine
`--location`	ARM region for the resource	Always	Pick a region near you for control latency
`--proxy-https`	HTTPS proxy for in-cluster agents	Behind a proxy	Agents inherit it, not just your shell
`--proxy-http`	HTTP proxy	Behind a proxy	Pair with `--proxy-https`
`--proxy-skip-range`	CIDRs/suffixes to bypass the proxy	Behind a proxy	Must include service CIDR + `.svc`
`--proxy-cert`	Trusted root the proxy presents	TLS-inspecting proxy	Only for injecting a CA, not to “use a proxy”
`--distribution`	Override detected distro	Detection wrong	Improves support/telemetry accuracy
`--kube-config` / `--kube-context`	Target a specific kubeconfig/context	Multiple clusters in one config	Avoids onboarding the wrong cluster
`--disable-auto-upgrade`	Pin agent version	Change-controlled fleets	You own upgrades thereafter
`--container-log-path`	Custom container log path	Non-standard distros	For Insights log discovery

If the cluster egresses through a proxy, do not rely on HTTP_PROXY alone — pass it so the in-cluster agents inherit it. Always include the cluster’s service CIDR in --proxy-skip-range, or in-cluster service-to-service calls will be wrongly routed at the proxy:

az connectedk8s connect \
  --name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --proxy-https https://proxy.corp.local:8080 \
  --proxy-http  http://proxy.corp.local:8080 \
  --proxy-skip-range 10.0.0.0/16,kubernetes.default.svc,.svc.cluster.local,.svc \
  --proxy-cert /etc/ssl/certs/corp-root.crt

--proxy-cert is only for injecting a trusted root the proxy presents; it is not required just to use a proxy. The three flags most environments actually need are --proxy-http, --proxy-https, and --proxy-skip-range.

Distribution support and what changes

Arc onboards any CNCF-conformant cluster. The distribution mostly affects telemetry and which extensions are validated, not whether onboarding works.

Distribution	Onboards	Notes
AWS EKS	Yes	Common multi-cloud target; works as `connectedClusters`
Google GKE	Yes	Detected as `gke`; full extension support
k3s / k0s	Yes	Edge favourite; ensure adequate node resources
RKE / RKE2	Yes	Rancher-managed; conformant
OpenShift (OKD/OCP)	Yes	SCCs may interact with policy; validate
kind / minikube	Yes (dev)	Fine for labs; not for production fleets
AKS (managed)	Use `managedClusters`	Already in Azure — Arc K8s is for non-AKS
AKS on Azure Stack HCI / Edge Essentials	Provisioned-cluster path	Slightly different onboarding

Onboarding errors you will actually hit

Symptom / error	Likely cause	Confirm	Fix
`MSI certificate is not ready`	Egress to `*.his.arc.azure.com` blocked	`clusteridentityoperator` logs	Allow HIS FQDNs; retry
Agents stuck `Pending` / `ImagePullBackOff`	`mcr.microsoft.com` blocked	`kubectl describe pod -n azure-arc`	Allow MCR + data edges
`connectivityStatus = Connecting` forever	Websocket/egress partial	Agent logs; firewall logs	Open `*.servicebus`, retry
`Helm release failed` on connect	Stale prior install in `azure-arc`	`helm list -n azure-arc`	`az connectedk8s delete` then re-connect
`Insufficient permissions`	Caller lacks RBAC on RG/sub	`az role assignment list`	Grant Contributor + K8s onboarding role
`Provider not registered`	RP registration incomplete	`az provider show`	Re-run register; wait for `Registered`
In-cluster calls fail post-connect	Service CIDR not in skip-range	DNS/connectivity tests	Add CIDR + `.svc` to `--proxy-skip-range`
Onboard OK, `proxy` hangs	L7 proxy strips websockets	`az connectedk8s proxy -d` (debug)	Allow Relay FQDNs with websockets

3. Configure Flux v2 GitOps via the Arc extension

Arc’s GitOps is Flux v2 delivered as the microsoft.flux cluster extension (it installs fluxconfig-agent and fluxconfig-controller alongside the upstream source/kustomize/helm controllers). You rarely install the extension by hand — creating your first fluxConfigurations pulls it in automatically. Register the configuration with az k8s-configuration flux create, scoped at the cluster level, with one or more Kustomizations:

# Needs the k8s-configuration CLI extension
az extension add --name k8s-configuration

az k8s-configuration flux create \
  --name fleet-baseline \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --namespace cluster-config \
  --scope cluster \
  --url https://github.com/acme-platform/fleet-gitops \
  --branch main \
  --kustomization name=infra path=./infrastructure prune=true \
  --kustomization name=apps  path=./apps/prod prune=true dependsOn=["infra"]

`flux create` options that change behaviour

Option	Values	Default	When to change	Trade-off / gotcha
`--scope`	`cluster` \| `namespace`	`cluster`	Tenant-confined config	`namespace` can’t create CRDs/ClusterRoles
`--namespace`	any	(required)	Where Flux objects live	Created if absent
`--kind`	`git` \| `bucket` \| `azblob`	`git`	Non-Git sources	Auth differs per kind
`--url`	repo URL	(required)	—	`https://` or `ssh://`
`--branch` / `--tag` / `--semver` / `--commit`	a ref	`branch=main`-ish	Pin to a release	Tag/commit = immutable rollout
`--interval`	duration	`10m`	Faster/slower polls	Lower = more API + Git load
`--kustomization prune=`	`true` \| `false`	`false`	Always `true` for real GitOps	Without it, Git ≠ truth
`--kustomization dependsOn=`	list	none	Order infra before apps	Cycles = stuck reconcile
`--kustomization sync_interval=`	duration	`10m`	Per-Kustomization cadence	Independent of source interval
`--kustomization retry_interval=`	duration	source interval	Faster retry on failure	Lower = more churn on broken state
`--kustomization timeout=`	duration	`10m`	Long applies (CRDs, big charts)	Too low = false failures
`--kustomization force=`	`true` \| `false`	`false`	Recreate immutable fields	Can cause disruptive replace
`--https-user` / `--https-key`	string	none	Private HTTPS repo (PAT)	Stored as a secret
`--ssh-private-key` / `--ssh-private-key-file`	key	none	Private SSH repo	Add known-hosts too
`--known-hosts` / `--known-hosts-file`	string	none	SSH host verification	Omit → host-key errors
`--local-auth-ref`	secret name	none	Reference a pre-made secret	Bring-your-own auth
`--suspend`	flag	off	Freeze reconcile	Drift not corrected while set

The mechanics worth internalising:

--scope cluster lets the Kustomizations create cluster-scoped objects (CRDs, namespaces, ClusterRoles). Use --scope namespace for tenant-confined configs that may only touch their own namespace.
prune=true is non-negotiable for real GitOps: delete a manifest from Git and Flux garbage-collects the object from the cluster. Without it, Git stops being the source of truth.
dependsOn orders reconciliation — apps waits for infra to go Ready, so your ingress controller and CRDs land before the workloads that need them.
The same command works against AKS by passing --cluster-type managedClusters. That symmetry is the whole point: one Git repo, one CLI, identical config across Arc and AKS.

Source kinds and how each authenticates

`--kind`	Source	Auth options	Use when
`git`	GitHub/GitLab/Azure Repos/Bitbucket	public, PAT (`--https-`), SSH (`--ssh-`)	The default — Git is source of truth
`bucket`	S3-compatible object store	access key/secret	Manifests in an S3/MinIO bucket
`azblob`	Azure Blob Storage	account key, SAS, managed identity	Azure-native artifact store + WI

For a connected (non-AKS) cluster you do not need a managed identity to read a public Git repo — the source controller pulls directly. For private repos, pass --https-user/--https-key (PAT) or SSH key material; for Azure Blob sources with workload identity, the azblob kind federates to a UAMI (see section 7).

Flux config status and reconciliation states

`complianceState` / condition	Meaning	Likely cause	Confirm	Fix
`Compliant`	Source + all Kustomizations applied	—	`az k8s-configuration flux show`	(healthy)
`Non-Compliant`	Apply failed / drift uncorrected	Manifest error, RBAC, `--scope` too narrow	`kubectl -n flux-system logs deploy/kustomize-controller`	Fix manifest/scope; re-reconcile
`Pending`	First reconcile in progress	Just created	Watch source-controller logs	Usually transient
Source `not ready`	Can’t pull the repo	Bad URL/branch, auth, host key	source-controller events	Fix URL/auth/known-hosts
Kustomization `dependency not ready`	Waiting on `dependsOn`	Upstream Kustomization not Ready	`flux show` per Kustomization	Fix the dependency first
`health check failed`	Applied but objects unhealthy	App crashing / not Ready	`kubectl get` the objects	Fix the workload

Force a reconcile without waiting for the interval by annotating the source/Kustomization (flux reconcile ... if the Flux CLI is installed), or simply bump a commit. On Arc, the config-agent will also re-pull on the next ARM sync.

4. Apply Azure Policy (Gatekeeper) at fleet scope

Azure Policy for Kubernetes extends Gatekeeper v3 (the OPA admission webhook) so you can author guardrails once in ARM and enforce them as in-cluster admission decisions across the fleet. Install the extension per cluster, then assign initiatives at a scope that covers many clusters. Register the provider and install the extension (Microsoft.PolicyInsights):

az provider register --namespace Microsoft.PolicyInsights

az k8s-extension create \
  --cluster-type connectedClusters \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --extension-type Microsoft.PolicyInsights \
  --name azurepolicy

Now assign a built-in initiative. The Pod Security baseline standards for Linux workloads initiative (a8640138-9b0a-4a28-b8cb-1666c838647d) bundles the deny rules most teams want — no privileged containers, no host namespaces, no hostPath, drop dangerous capabilities. Assign it at a management group so it lands on every connected cluster underneath, and exclude the system namespaces (otherwise you will block Arc’s own agents):

az policy assignment create \
  --name "psp-baseline-fleet" \
  --display-name "Pod Security baseline - Arc fleet" \
  --policy-set-definition "a8640138-9b0a-4a28-b8cb-1666c838647d" \
  --scope "/providers/Microsoft.Management/managementGroups/mg-arc-prod" \
  --params '{
    "effect": { "value": "deny" },
    "excludedNamespaces": { "value": ["kube-system","gatekeeper-system","azure-arc"] }
  }'

Policy effects and what each does in-cluster

Effect	In-cluster behaviour	When to use	Risk
`audit`	Logs non-compliant; admits the object	Brownfield rollout, discovery	None to workloads; just visibility
`deny`	Gatekeeper rejects the admission	Steady-state enforcement	Blocks bad deploys (and false positives)
`disabled`	Policy inert	Temporarily pause a rule	Drift uncorrected
`audit` (mutation n/a here)	—	—	—

The Kubernetes add-on supports audit, deny, and disabled effects. There is no deployIfNotExists inside the cluster — remediation of K8s objects is via GitOps, not Policy mutation.

Built-in initiatives worth knowing

Initiative	Definition ID (set)	What it enforces
Pod Security Baseline (Linux)	`a8640138-9b0a-4a28-b8cb-1666c838647d`	No privileged, no host ns, no hostPath, drop caps
Pod Security Restricted (Linux)	(restricted set)	Baseline + runAsNonRoot, seccomp, no privilege-escalation
Deployment safeguards (general)	(built-in set)	Resource limits, no `:latest`, approved registries

Common single-rule built-ins (assemble custom initiatives)

Rule (policy definition)	Effect surface	Catches
No privileged containers	deny/audit	`securityContext.privileged: true`
No host network/PID/IPC	deny/audit	`hostNetwork`/`hostPID`/`hostIPC`
No `hostPath` volumes	deny/audit	Node filesystem mounts
Allowed capabilities / drop NET_RAW	deny/audit	Dangerous Linux caps
`runAsNonRoot` required	deny/audit	Root containers
CPU/memory limits required	deny/audit	Unbounded pods
Allowed container registries	deny/audit	Pulls from untrusted registries
No `:latest` image tag	deny/audit	Unpinned images
Allowed external IPs / no NodePort	deny/audit	Unexpected exposure
Read-only root filesystem	deny/audit	Writable container roots

Two operational realities to respect:

Roll out in audit before deny. Set effect to audit, watch the compliance results in Azure Policy for a week, fix the violators, then flip to deny. Flipping straight to deny on a brownfield cluster will reject existing Deployments on their next rollout and page you at 02:00.
Constraints are pulled, not instant. The add-on syncs assignments roughly every 15 minutes and writes Gatekeeper Constraint objects whose names start with azurepolicy-. Inspect them in-cluster with kubectl get constrainttemplates and kubectl get constraints.

Policy troubleshooting playbook

#	Symptom	Root cause	Confirm (exact cmd / path)	Fix
1	Deploys suddenly rejected	Initiative flipped to `deny`, real violation	`kubectl get events`; Policy compliance blade	Remediate manifest; or revert to `audit`
2	Arc/system pods blocked	System namespaces not excluded	`kubectl get constraints -o yaml` (excludedNamespaces)	Add `kube-system`,`gatekeeper-system`,`azure-arc`
3	No constraints in cluster	Assignment not synced yet	`kubectl get constraints` (empty)	Wait ~15 min; check add-on `provisioningState`
4	Compliance shows “no data”	Add-on not installed / unhealthy	`az k8s-extension show --name azurepolicy`	(Re)install; check `gatekeeper-system` pods
5	Legit pod flagged non-compliant	Rule stricter than intended	Compliance reason on the resource	Tune params / switch initiative tier
6	`deny` blocks a needed exception	No per-namespace carve-out	Identify the namespace	Exclude namespace or scope assignment narrower
7	Custom rule never fires	ConstraintTemplate/Rego error	`kubectl describe constrainttemplate ...`	Fix Rego; re-publish definition
8	Webhook latency/timeouts	Gatekeeper under-resourced	`gatekeeper-system` pod CPU/mem	Raise limits; reduce constraint count
9	Negative test still admits	Constraints not synced / wrong scope	`kubectl run pwn --privileged` admits	Verify MG scope; wait for sync
10	Compliance lags reality	15-min add-on + 24h full scan cadence	Compare event time vs compliance time	Allow for eventual consistency

For org-specific rules beyond the built-ins (e.g. “all images must come from acme.azurecr.io”), author a custom constraint template + Rego and ship it as a custom policy definition — same assignment model, same fleet scope. If you treat policy definitions as source-controlled artifacts, the Azure Policy as Code pipeline pattern applies unchanged here.

5. Cluster connect: kubectl without inbound firewall changes

This is the feature that wins over on-prem teams. The clusterconnect-agent holds an outbound channel open; az connectedk8s proxy uses your Azure token to open a local proxy and writes a kubeconfig that targets it. No inbound port, no VPN, no bastion. First grant access. With Azure RBAC, assign the user/group a built-in role at the cluster scope — no kubectl ClusterRoleBinding required:

ARM_ID=$(az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query id -o tsv)
AAD_ID=$(az ad signed-in-user show --query id -o tsv)

# "Cluster User Role" grants the cluster-connect channel; "Viewer/Writer" grants in-cluster RBAC
az role assignment create --role "Azure Arc Enabled Kubernetes Cluster User Role" --assignee $AAD_ID --scope $ARM_ID
az role assignment create --role "Azure Arc Kubernetes Viewer" --assignee $AAD_ID --scope $ARM_ID

Arc Kubernetes built-in roles

Role	Grants	Use for
Azure Arc Enabled Kubernetes Cluster User Role	The cluster-connect channel (ability to open `proxy`)	Anyone who needs `kubectl` access at all
Azure Arc Kubernetes Viewer	Read-only in-cluster RBAC (no Secrets)	Read access across the fleet
Azure Arc Kubernetes Writer	Read/write most namespaced objects	Operators deploying via kubectl
Azure Arc Kubernetes Admin	Admin within namespaces (not cluster-scoped escalation)	Namespace owners
Azure Arc Kubernetes Cluster Admin	Full cluster-admin equivalent	Break-glass / platform owners

The Cluster User Role only opens the channel; it grants no in-cluster permissions. You must also assign a Viewer/Writer/Admin role for the request to do anything once impersonated. Granting one without the other is the classic “I can connect but everything is forbidden” mistake.

Then open the proxy (it blocks the shell) and run kubectl from a second shell:

# Shell 1 - opens the proxy, blocks
az connectedk8s proxy -n $CLUSTER_NAME -g $RESOURCE_GROUP

# Shell 2 - normal kubectl, routed over the Arc channel
kubectl get pods -A

If you prefer native Kubernetes RBAC over Azure RBAC, bind a service account token instead and pass --token $TOKEN to the proxy command. Either way, the request path is: your token → Azure Relay → clusterconnect-agent → kube-aad-proxy (Entra auth + user impersonation) → kube-apiserver. The impersonation step is why a fleet-wide Azure Arc Kubernetes Viewer role gives read-only kubectl on every cluster at once.

Azure RBAC vs native Kubernetes RBAC for connect

Aspect	Azure RBAC	Native K8s RBAC
Where you grant	ARM role assignment (cluster/MG scope)	`RoleBinding`/`ClusterRoleBinding` in-cluster
Fleet-wide grant	One assignment at MG scope covers all	Per-cluster bindings
Identity	Entra users/groups/SPs	Service account token
Audit	Entra sign-in + Activity log	apiserver audit log
Proxy flag	(default)	`--token $TOKEN`
Best for	Centralized human access at scale	App/CI tokens, fine-grained in-cluster

Cluster connect failure modes

Symptom	Root cause	Confirm	Fix
`proxy` hangs / never binds	L7 proxy strips websockets to Relay	`az connectedk8s proxy -d`	Allow regional `*.servicebus` with websockets
Connect OK, all `forbidden`	Only Cluster User Role assigned	`az role assignment list --scope $ARM_ID`	Add Viewer/Writer/Admin role
`Long running operation failed`	`clusterconnect-agent` down	`kubectl get pods -n azure-arc`	Restart agent; check egress
Token/auth error	Stale Azure CLI login	`az account show`	`az login` again
Works for you, not teammates	Their identity unassigned	Check their role assignments	Assign at group/MG scope
Intermittent drops	Relay/egress flapping	Firewall + agent logs	Stabilize egress; check proxy timeouts

6. Enable Azure Monitor Container Insights

Ship stdout/stderr logs, inventory, and container metrics from every Arc cluster into one Log Analytics workspace via the Microsoft.AzureMonitor.Containers extension. Use managed identity auth (amalogs.useAADAuth=true) so there is no workspace key sitting in the cluster:

WORKSPACE_ID="/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-fleet"

az k8s-extension create \
  --name azuremonitor-containers \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --extension-type Microsoft.AzureMonitor.Containers \
  --configuration-settings \
      logAnalyticsWorkspaceResourceID=$WORKSPACE_ID \
      amalogs.useAADAuth=true

The extension deploys the ama-logs DaemonSet (every node) and ama-logs-rs ReplicaSet (cluster-level) into kube-system. To control ingestion cost on chatty clusters, scope collection to specific namespaces with dataCollectionSettings at install time:

az k8s-extension create \
  --name azuremonitor-containers \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --extension-type Microsoft.AzureMonitor.Containers \
  --configuration-settings amalogs.useAADAuth=true \
      dataCollectionSettings='{"interval":"1m","namespaceFilteringMode":"Include","namespaces":["prod","ingress"],"enableContainerLogV2":true}'

Container Insights configuration settings

Setting	Values	Default	Effect	Cost lever
`amalogs.useAADAuth`	`true`/`false`	`false`	Managed-identity auth (no workspace key)	— (security)
`logAnalyticsWorkspaceResourceID`	ARM ID	(auto)	Destination workspace	Consolidate to one
`dataCollectionSettings.interval`	`1m`–`30m`	`1m`	Metric scrape cadence	Higher = cheaper
`namespaceFilteringMode`	`Include`/`Exclude`/`Off`	`Off`	Which namespaces collect logs	Big lever
`namespaces`	list	—	Namespace allow/deny list	Trim noisy ns
`enableContainerLogV2`	`true`/`false`	varies	Richer schema, multi-line	Slightly more data
`streams`	list	all	Which tables to ingest	Drop unused streams

Key Container Insights tables (KQL)

Table	Holds	Typical query use
`ContainerLogV2`	stdout/stderr lines	Error mining across fleet
`KubePodInventory`	pod state, restarts	Crash/restart hunting
`KubeNodeInventory`	node status, conditions	NotReady nodes
`KubeEvents`	cluster events	OOMKilled, FailedScheduling
`InsightsMetrics`	container/node metrics	CPU/mem saturation
`ContainerInventory`	image, repo, ports	Image/registry audit

Once data lands, query the whole fleet from one workspace. Container logs carry the cluster identity, so a single KQL query slices across every onboarded cluster:

ContainerLogV2
| where TimeGenerated > ago(1h)
| where LogLevel in ("error","critical")
| summarize Errors = count() by Computer, ContainerName, _ResourceId
| sort by Errors desc

Note the migration: the legacy Helm-chart onboarding for the Container Insights agent is retired. On Arc, install via the Microsoft.AzureMonitor.Containers extension — that is the supported path and the one that participates in extension lifecycle/upgrades.

If you want metrics in Prometheus/Grafana rather than (or alongside) Log Analytics, the managed Prometheus/Grafana pattern in Azure Monitor: Managed Prometheus & Managed Grafana for AKS applies to Arc clusters via the metrics extension. For shaping ingestion with data collection rules, see Azure Monitor: Data Collection Rules, Workbooks & Alerting.

7. Workload identity and Key Vault secret access

Static secrets in manifests are the failure mode Arc lets you finally kill. The Azure Key Vault Secrets Provider extension (Microsoft.AzureKeyVaultSecretsProvider) installs the Secrets Store CSI Driver plus the Azure provider, so pods mount Key Vault secrets as files on tmpfs with no credential in the cluster:

az k8s-extension create \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type connectedClusters \
  --extension-type Microsoft.AzureKeyVaultSecretsProvider \
  --name akvsecretsprovider \
  --configuration-settings \
      secrets-store-csi-driver.enableSecretRotation=true \
      secrets-store-csi-driver.rotationPollInterval=2m \
      secrets-store-csi-driver.syncSecret.enabled=true

Key Vault CSI extension settings

Setting	Values	Default	When to change	Trade-off
`enableSecretRotation`	`true`/`false`	`false`	You rotate secrets	Polls vault; small overhead
`rotationPollInterval`	duration (e.g. `2m`)	`2m`	Faster/slower rotation pickup	Lower = more vault calls
`syncSecret.enabled`	`true`/`false`	`false`	Need a native K8s `Secret` for env vars	Env vars still need pod restart

For the auth itself, federate a user-assigned managed identity to a Kubernetes service account (workload identity) so the CSI provider exchanges the pod’s projected token for an Entra token — no client secret anywhere. A SecretProviderClass ties the service account to the vault:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: app-kv
  namespace: prod
spec:
  provider: azure
  parameters:
    clientID: "<USER_ASSIGNED_CLIENT_ID>"   # the federated UAMI
    keyvaultName: "kv-acme-prod"
    tenantId: "<TENANT_ID>"
    objects: |
      array:
        - |
          objectName: db-connection-string
          objectType: secret
  # Optional: project mounted secrets into a native K8s Secret for env vars
  secretObjects:
    - secretName: app-db
      type: Opaque
      data:
        - objectName: db-connection-string
          key: DB_CONN

Grant the UAMI Key Vault Secrets User on the vault via Azure RBAC, federate it to the service account’s OIDC subject, then any pod using that service account and mounting this SecretProviderClass reads the secret. Because the access is scoped per-service-account, you get least privilege and clean per-app audit instead of one node-wide credential.

Secret access patterns compared

Pattern	Credential in cluster?	Rotation	Audit granularity	Verdict
Hard-coded secret in manifest	Yes (in Git!)	Manual	None	Never
K8s `Secret` (base64)	Yes (etcd)	Manual	Per-Secret	Weak
Sealed/SOPS-encrypted in Git	Encrypted at rest	Re-encrypt	Per-Secret	OK for some
Workspace-key CSI	Workspace key in-cluster	Vault-side	Per-app (if scoped)	Avoid the key
Workload-identity CSI	No	Vault + poll	Per-service-account	Best

Workload identity federation mapping

Element	Value source	Notes
UAMI `clientID`	The federated user-assigned identity	Goes in `SecretProviderClass`
OIDC issuer	Cluster’s projected-token issuer URL	Must be reachable by Entra
Subject	`system:serviceaccount:<ns>:<sa>`	The federated credential subject
Vault RBAC	`Key Vault Secrets User` on the vault	Least-privilege data-plane role
Audience	`api://AzureADTokenExchange`	Standard WI audience

Rotation caveat: enableSecretRotation=true refreshes the mounted file on the poll interval. Apps that read the file each request pick up new values automatically; apps that load secrets once at boot, or consume the synced Secret as env vars, still need a restart to see a rotated value. Env vars are snapshotted at pod start — the kernel cannot rewrite a running process’s environment.

The same federation model underpins Azure Key Vault Workload Identity for Secrets and the AKS-flavoured Secrets Store CSI with Key Vault sync & rotation; the deep mechanics of federated credentials are in Entra Managed Identities: User-Assigned, FIC & RBAC. For rotation strategy across the vault itself, see Azure Key Vault Secret Rotation with Managed Identity.

8. Scale governance across many clusters

Onboarding one cluster is a demo. Governing forty is the job. Three primitives make Arc fleet-ready.

Management groups carry policy and RBAC. Place subscriptions (and therefore their connected clusters) under a management-group hierarchy and assign Policy initiatives + Arc Kubernetes roles at the MG level. A new cluster onboarded into any child subscription inherits the baseline the moment it appears — you do not touch it cluster-by-cluster.

Fleet primitive	What it gives you	Mechanism
Management-group inheritance	Policy + RBAC apply to all child clusters	Assign at MG, not per-cluster
Tags	Targeting, chargeback, inventory slicing	`--tags` on connect; ARG queries
GitOps-as-intent	New clusters self-bootstrap baseline	Bicep `fluxConfigurations`
Extension defaults	Consistent add-on versions	Pin via IaC; `--auto-upgrade` policy
Azure Resource Graph	Single-pane fleet inventory	`resources` queries

Tags drive targeting and chargeback. Tag connected clusters with environment, owner, and data-classification, then write policy assignments that key off tags or build Azure Resource Graph queries for fleet inventory:

// Every Arc cluster, its agent version, and connectivity health
resources
| where type == "microsoft.kubernetes/connectedclusters"
| project name, location,
          distribution = properties.distribution,
          k8sVersion   = properties.kubernetesVersion,
          connectivity = properties.connectivityStatus,
          agentVersion = properties.agentVersion,
          env = tags.environment
| order by connectivity asc

Fleet inventory queries worth saving

Question	ARG `where` / `project` focus
Which clusters are Offline?	`connectivityStatus == "Offline"`
Agent version spread	`summarize count() by agentVersion`
Distribution mix	`summarize count() by distribution`
Untagged clusters	`isnull(tags.owner)`
Stale Kubernetes versions	`project kubernetesVersion` then sort
Clusters per management group	join to subscription/MG

GitOps is the fleet rollout mechanism. Because the same az k8s-configuration flux create works across every connected cluster, codify it. The Bicep below registers the Flux config as ARM intent, so onboarding a cluster and deploying a Policy assignment that requires this config means new clusters self-bootstrap their baseline:

resource fluxBaseline 'Microsoft.KubernetesConfiguration/fluxConfigurations@2023-05-01' = {
  name: 'fleet-baseline'
  scope: connectedCluster      // the Microsoft.Kubernetes/connectedClusters resource
  properties: {
    scope: 'cluster'
    namespace: 'cluster-config'
    sourceKind: 'GitRepository'
    gitRepository: {
      url: 'https://github.com/acme-platform/fleet-gitops'
      repositoryRef: { branch: 'main' }
    }
    kustomizations: {
      infra: { path: './infrastructure', prune: true }
      apps:  { path: './apps/prod', prune: true, dependsOn: ['infra'] }
    }
  }
}

The end state: a cluster joins the fleet, ARM applies the inherited Policy initiative (admission guardrails), the Flux config (desired state), the Monitor extension (telemetry), and the role assignments (kubectl access) — all without a human SSHing into the cluster.

Architecture at a glance

Read the diagram left to right as the path that intent travels and telemetry returns. On the far left, the platform SRE and the Git repository are the sources of truth — humans issue az commands and assign Policy, while desired configuration lives as YAML on branch: main. That intent lands in the Azure control plane zone: a management group that carries Policy and RBAC down to every child cluster, the Azure Policy engine that compiles initiatives into Gatekeeper constraints, and the Log Analytics workspace that all clusters report into. Critically, nothing in this zone reaches into your network — it publishes intent to ARM and waits.

The Arc agents zone is the bridge, living in the azure-arc namespace inside your cluster and dialling outbound only. The clusterconnect agent (badge 1) holds the Azure Relay channel open over *.servicebus.windows.net:443 so kubectl works with no inbound port; the config + extension manager (badge 3) pulls Flux/Policy/Monitor intent and reconciles it; and kube-aad-proxy (badge 4) authenticates each kubectl caller with Entra and impersonates them against the apiserver. Finally, the hybrid cluster zone — EKS, GKE, or k3s — keeps its kube-apiserver exactly where it was, runs your workloads with prune=true GitOps, and mounts Key Vault secrets via the CSI driver (badge 5). The two return flows (badge 2 marks the Policy admission decision; the amber arrow carries inventory and logs back) close the loop: intent flows right, evidence flows left, and not one inbound firewall rule was opened.

Real-world scenario

A retail platform team ran 28 store-edge clusters (k3s on ruggedised hardware, one per regional distribution center) plus a GKE cluster for their loyalty service. Security mandated two things the existing setup could not deliver: a centrally enforced ban on privileged containers, and break-glass kubectl access for the on-call SRE without opening inbound ports on store networks — the stores sat behind carrier-grade NAT with no public ingress and a websocket-stripping Layer-7 proxy.

The constraint that bit them first was the proxy. Onboarding succeeded, Flux reconciled, Policy enforced — but az connectedk8s proxy hung, because cluster connect rides Azure Relay over *.servicebus.windows.net and the proxy silently dropped the websocket upgrade. The fix was an allow-rule for the resolved, regional Service Bus endpoints with websockets explicitly permitted, expanded from the wildcard via the guest-notification allowlist API:

# Run per store region; feed results into the proxy allowlist with websockets enabled
for region in eastus westus2 centralus; do
  curl -s "https://guestnotificationservice.azure.com/urls/allowlist?api-version=2020-01-01&location=$region"
done

With egress fixed, they assigned the Pod Security baseline initiative at the mg-retail-edge management group — in audit first. The audit results surfaced exactly the violators they expected: a legacy label-printer DaemonSet that ran privileged to access /dev. They refactored it to a specific device plugin, then flipped the initiative to deny. New store clusters now onboard via a pipeline that runs az connectedk8s connect, and inherit the deny policy and the Flux baseline automatically from the management group — zero per-store configuration.

Decision	What they chose	Why
Onboarding	Pipeline-driven `connect`	28 stores, no manual touch
Policy rollout	`audit` → fix → `deny`	Avoid breaking brownfield workloads
Access	Arc Cluster User Role at MG scope	Any store, no inbound port
Egress fix	Regional Relay FQDNs + websockets	Cluster connect over CGNAT
Secrets	Workload identity + KV CSI	No keys on store hardware
Telemetry	Container Insights → one workspace	Fleet-wide error queries

On-call SREs hold Azure Arc Enabled Kubernetes Cluster User Role at the MG scope, giving them az connectedk8s proxy into any store on earth without a single inbound firewall rule. The whole 28-cluster fleet went from “28 snowflakes” to “one policy, one Git repo, one identity boundary” in under a sprint. The lasting win was not any single control — it was that adding store #29 became a pipeline run, not a project.

Advantages and disadvantages

Advantages	Disadvantages
One control plane for hybrid/multi-cloud K8s	Management plane is eventually consistent (~15 min Policy)
Outbound-only — no inbound firewall holes	Hard dependency on egress to Azure FQDNs
Policy + RBAC inherit via management groups	Mis-scoped assignment can hit many clusters at once
GitOps identical across Arc and AKS	Flux/Gatekeeper add their own in-cluster footprint
Secretless Key Vault via workload identity	Federation setup is fiddly the first time
Fleet telemetry in one Log Analytics workspace	Ingestion cost grows with cluster/namespace count
New clusters self-bootstrap from MG inheritance	If Azure is unreachable, management pauses (data plane keeps running)
Works behind CGNAT / locked-down on-prem	Websocket-stripping proxies break cluster connect

When each matters: the outbound-only model is decisive for edge and regulated on-prem where inbound is simply not allowed. Management-group inheritance is the multiplier once you pass ~5 clusters — below that, the per-cluster effort is small and Arc’s value is mostly uniformity, not labour saved. The eventual-consistency caveat matters most for security expectations: do not assume a freshly assigned deny is enforced the instant you click save; budget ~15 minutes and verify with a negative test. The egress dependency is the thing that bites in practice — almost every painful Arc incident traces back to a firewall or proxy, not to Arc itself.

Hands-on lab

This lab onboards a local kind cluster (free, no cloud cost beyond minimal ARM/Log Analytics) and layers Policy + cluster connect. You need Azure CLI, kubectl, Docker, and an Azure subscription.

# 0) Prereqs
az login
az extension add --name connectedk8s
az extension add --name k8s-configuration
az extension add --name k8s-extension

# 1) A throwaway local cluster
kind create cluster --name arc-lab
kubectl config use-context kind-arc-lab

# 2) Register providers (idempotent; wait for Registered)
for ns in Microsoft.Kubernetes Microsoft.KubernetesConfiguration Microsoft.ExtendedLocation Microsoft.PolicyInsights; do
  az provider register --namespace $ns
done
az provider show -n Microsoft.Kubernetes --query registrationState -o tsv   # -> Registered

# 3) Onboard
export RESOURCE_GROUP=rg-arc-lab LOCATION=eastus CLUSTER_NAME=kind-arc-lab
az group create -n $RESOURCE_GROUP -l $LOCATION -o table
az connectedk8s connect -n $CLUSTER_NAME -g $RESOURCE_GROUP -l $LOCATION

# Expected: connectivityStatus -> Connected; azure-arc pods Running
az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query connectivityStatus -o tsv
kubectl get pods -n azure-arc

# 4) GitOps against a public repo
az k8s-configuration flux create \
  --name lab-baseline -g $RESOURCE_GROUP \
  --cluster-name $CLUSTER_NAME --cluster-type connectedClusters \
  --namespace cluster-config --scope cluster \
  --url https://github.com/Azure/gitops-flux2-kustomize-helm-mt \
  --branch main \
  --kustomization name=infra path=./infrastructure prune=true

# 5) Policy add-on + a deny baseline at the SUBSCRIPTION scope for the lab
az k8s-extension create --cluster-type connectedClusters \
  --cluster-name $CLUSTER_NAME -g $RESOURCE_GROUP \
  --extension-type Microsoft.PolicyInsights --name azurepolicy

SUB=$(az account show --query id -o tsv)
az policy assignment create \
  --name psp-baseline-lab \
  --policy-set-definition a8640138-9b0a-4a28-b8cb-1666c838647d \
  --scope "/subscriptions/$SUB" \
  --params '{"effect":{"value":"audit"},"excludedNamespaces":{"value":["kube-system","gatekeeper-system","azure-arc"]}}'

# 6) Cluster connect — grant yourself, then proxy
ARM_ID=$(az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query id -o tsv)
ME=$(az ad signed-in-user show --query id -o tsv)
az role assignment create --role "Azure Arc Enabled Kubernetes Cluster User Role" --assignee $ME --scope $ARM_ID
az role assignment create --role "Azure Arc Kubernetes Cluster Admin" --assignee $ME --scope $ARM_ID
# Shell 1: az connectedk8s proxy -n $CLUSTER_NAME -g $RESOURCE_GROUP
# Shell 2: kubectl get nodes      # routed over the Arc channel

# 7) TEARDOWN (avoid lingering cost)
az policy assignment delete --name psp-baseline-lab --scope "/subscriptions/$SUB"
az connectedk8s delete -n $CLUSTER_NAME -g $RESOURCE_GROUP --yes
az group delete -n $RESOURCE_GROUP --yes --no-wait
kind delete cluster --name arc-lab

Step	You should see	If you don’t
3 onboard	`Connected`; `azure-arc` pods Running	Check egress to MCR + ARM
4 GitOps	`complianceState: Compliant` after a minute	`flux show`; check the repo URL
5 Policy	`azurepolicy-*` constraints after ~15 min	`kubectl get constraints` empty → wait
6 connect	`kubectl get nodes` via proxy	Proxy hung → websocket egress
7 teardown	Resources gone	Re-run delete; `--no-wait` is async

Common mistakes & troubleshooting

These are the failure modes that actually generate tickets, in rough order of frequency. The first is responsible for more lost hours than the rest combined.

#	Symptom	Root cause	Confirm (exact cmd / portal path)	Fix
1	Onboards fine, `az connectedk8s proxy` hangs	L7 proxy strips websocket upgrade to `*.servicebus.windows.net`	`az connectedk8s proxy -d` (debug); firewall logs	Allow resolved regional Service Bus FQDNs with websockets enabled
2	New `deny` policy pages you at 02:00	Flipped straight to `deny` on brownfield	Policy compliance → violating resources	Always `audit` → fix → `deny`
3	Arc agents themselves blocked by policy	System namespaces not excluded	`kubectl get constraints -o yaml`	Exclude `kube-system`,`gatekeeper-system`,`azure-arc`
4	“I can connect but everything is forbidden”	Only Cluster User Role assigned	`az role assignment list --scope $ARM_ID`	Add Viewer/Writer/Admin role too
5	Flux never reconciles	Private repo, missing/invalid auth	`az k8s-configuration flux show`; source-controller logs	Pass PAT/SSH; add `--known-hosts`
6	`prune` deletes more than expected	Wrong `path`/`--scope`, shared namespace	Inspect the Kustomization path	Narrow path; separate namespaces
7	Cluster shows `Offline`	Egress lost / MSI cert expired	`az connectedk8s show`; `clusteridentityoperator` logs	Restore egress to HIS endpoints; restart agent
8	Extension stuck `Creating`/`Failed`	`extension-manager` can’t pull / RBAC	`az k8s-extension show ... provisioningState`; pod logs	Fix egress/RBAC; delete + recreate
9	In-cluster service calls fail after connect	Service CIDR not in `--proxy-skip-range`	DNS/connectivity test in a pod	Re-connect with CIDR + `.svc` in skip-range
10	Secret rotated but app still old value	Env-var snapshot at pod start	Compare mounted file vs env var	Read file per request, or restart pods
11	Log Analytics bill spikes	Collecting all namespaces, V2 on everything	Usage by `_ResourceId`	Scope `dataCollectionSettings` namespaces
12	Negative policy test still admits privileged pod	Constraints not synced / wrong scope	`kubectl run pwn --image=nginx --privileged=true -n prod` admits	Verify MG scope; wait ~15 min
13	`connect` fails: Helm release exists	Stale prior onboarding	`helm list -n azure-arc`	`az connectedk8s delete` then re-connect
14	Resource Graph shows blank distribution/version	`cluster-metadata-operator` unhealthy	That agent’s logs	Restart agent; check egress

A fast negative test for Policy: kubectl run pwn --image=nginx --privileged=true -n prod should be denied by the Gatekeeper webhook once the baseline initiative is in deny mode. If it succeeds, your assignment scope or namespace exclusions are wrong, or the constraints have not synced yet.

# Verify each layer landed before you call a cluster "governed"
az connectedk8s show -n $CLUSTER_NAME -g $RESOURCE_GROUP --query connectivityStatus -o tsv   # -> Connected
kubectl get pods -n azure-arc          # all Running
az k8s-configuration flux show --name fleet-baseline -g $RESOURCE_GROUP \
  --cluster-name $CLUSTER_NAME --cluster-type connectedClusters \
  --query "statuses[].complianceState" -o tsv
kubectl get constraints                 # azurepolicy-* present
kubectl get ds ama-logs -n kube-system  # Monitor agent shipping

Best practices

Treat egress as the dependency that decides everything. Pre-stage the FQDN allowlist (especially *.servicebus.windows.net with websockets) before onboarding, per region. Most Arc incidents are firewall incidents.
Always roll policy audit → remediate → deny. Never flip a brownfield fleet straight to deny. Watch compliance for a week first.
Exclude kube-system, gatekeeper-system, and azure-arc from every policy assignment. Forgetting this blocks Arc’s own agents.
Assign Policy and RBAC at the management-group scope, not per cluster. Inheritance is the whole point; per-cluster assignments do not scale and drift.
Set prune=true on every Kustomization. Without it, Git is not the source of truth and deletes silently leak.
Use dependsOn to land CRDs/ingress before workloads that need them, or reconciliation order will bite you on a fresh cluster.
Grant both the Cluster User Role and an in-cluster role (Viewer/Writer/Admin). One without the other is useless.
Use managed-identity auth everywhere — amalogs.useAADAuth=true for Monitor, workload identity for Key Vault. No keys in the cluster, ever.
Scope Container Insights ingestion with dataCollectionSettings on chatty clusters; default “collect everything” is a cost trap at fleet scale.
Codify the baseline as Bicep fluxConfigurations so new clusters self-bootstrap. Onboarding should be a pipeline run, not a runbook.
Tag every connected cluster (env, owner, data-classification) and keep saved Resource Graph queries for inventory and drift.
Pin agent/extension versions in change-controlled fleets (--disable-auto-upgrade) and roll upgrades through a ring, the same way you would for AKS day-two upgrades.

Security notes

Arc’s security model is “outbound-only management plane + least-privilege identity,” and you should keep it that way deliberately.

Concern	Default / mechanism	Hardening action
Inbound exposure	None — agents dial out only	Do not add inbound rules “to make it work”; fix egress instead
Cluster→Azure identity	MSI cert (`clusteridentityoperator`)	Allow only HIS FQDNs; monitor cert renewal
Human `kubectl` access	Azure RBAC via Entra	Least-privilege roles; PIM for admin; group, not user, assignments
Admission guardrails	Gatekeeper via Policy	Enforce baseline/restricted at MG; deny privileged/hostPath
Secrets	Workload identity + KV CSI	No static secrets; per-service-account scope; rotation on
Workspace key in cluster	Avoided with `useAADAuth=true`	Never store the Log Analytics key in-cluster
Private Git creds	PAT/SSH stored as secret	Prefer SSH deploy keys or WI (`azblob`); rotate PATs
Egress trust	TLS to Azure	TLS-inspecting proxy → supply `--proxy-cert`; pin allowlist
Network segmentation	Per-region private endpoints (optional)	Use Arc Private Link to keep data-plane off public internet
Audit	Entra sign-in + Activity log + apiserver audit	Centralize all three; alert on role-assignment changes

The identity boundary is the crown jewel: because human access is Entra-mediated and impersonated per request, you get one place (Entra + Activity log) to answer “who touched which cluster, when.” Protect the Arc Cluster Admin role with PIM and just-in-time elevation; a standing Cluster Admin at MG scope is a standing cluster-admin on every cluster in the fleet.

Cost & sizing

Arc-enabled Kubernetes has no per-cluster Arc fee for the core control plane (onboarding, cluster connect, GitOps, Policy). What you pay for is the value-added services that ride on top — chiefly log/metric ingestion and any Arc-enabled data/app services. Sizing is therefore mostly an observability-cost exercise plus a small in-cluster resource footprint for the agents and add-ons.

Cost driver	What it bills on	Rough figure	How to control
Arc K8s control plane	Onboarding/connect/GitOps/Policy	No core charge	—
Container Insights ingestion	GB ingested to Log Analytics	~₹230–290 / USD 2.76 per GB (pay-as-you-go)	Scope namespaces; `interval`; commitment tiers
Log Analytics retention	GB-month beyond free 31 days	Per-GB-month	Shorten retention; archive tier
Managed Prometheus metrics	Metric samples ingested	Per-sample pricing	Scrape interval; drop unused series
Arc-enabled SQL/data services	vCore/usage of the data service	Service-specific	Right-size the data workload
In-cluster agent footprint	Node CPU/mem for agents + add-ons	~0.5–1 vCPU + ~1–2 GB cluster-wide	Don’t run add-ons you don’t use
Egress/proxy infra	Your firewall/proxy capacity	Your existing infra	Allowlist precisely; no new ingress

Right-sizing the in-cluster footprint

Component	Approx. resource ask	Notes
Arc agents (`azure-arc`)	Modest; a handful of small pods	Always present
Flux controllers	CPU on reconcile spikes	Scales with repo size + interval
Gatekeeper	Scales with constraint count	Trim constraints; set limits
`ama-logs` DaemonSet	Per-node; scales with log volume	Biggest variable; scope namespaces

Practical guidance: on a 28-cluster edge fleet, the dominant line item is almost always Container Insights ingestion, not anything Arc-specific. Turn on namespaceFilteringMode: Include for just prod/ingress, raise the metric interval to 5m where 1-minute resolution is not needed, and move long-tail logs to a cheaper retention/archive tier. Free-tier-wise, Log Analytics gives a small daily ingestion allowance and 31 days retention at no charge — enough to validate the pipeline in the lab above without a meaningful bill. (INR figures approximate at ~₹84/USD and vary by region and commitment tier; treat them as order-of-magnitude.)

Interview & exam questions

1. What does Arc-enabled Kubernetes actually add to a cluster, and what does it not touch? It installs a Helm release of outbound-only agents in the azure-arc namespace and creates a connectedClusters ARM resource; it adds Policy/GitOps/Monitor/Key Vault via cluster extensions. It does not change your control plane, scheduler, nodes, or the data path of your workloads — Arc reconciles intent and brokers kubectl, nothing more.

2. Why is *.servicebus.windows.net special in the egress allowlist? Cluster connect rides Azure Relay over that endpoint using websockets. A Layer-7 proxy that blocks websocket upgrades will let onboarding succeed but break kubectl-over-Arc, which is a notoriously confusing failure. You must allow the resolved regional FQDNs with websockets enabled.

3. Why roll Azure Policy out in audit before deny? deny causes Gatekeeper to reject non-compliant admissions, so flipping straight to deny on a brownfield cluster rejects existing Deployments on their next rollout. audit surfaces violators without blocking, letting you remediate first, then promote to deny safely.

4. Which namespaces must you exclude from policy assignments, and why? kube-system, gatekeeper-system, and azure-arc. They run system and Arc agent workloads that may legitimately need elevated settings; failing to exclude them can block Arc’s own agents and brick management.

5. Explain the cluster-connect request path. Your Azure token → Azure Relay → clusterconnect-agent → kube-aad-proxy (Entra authN + user impersonation) → kube-apiserver. Impersonation is why a fleet-wide Viewer role grants read-only kubectl on every cluster at once.

6. Cluster User Role is assigned but everything returns forbidden. Why? The Cluster User Role only opens the connect channel; it grants no in-cluster permissions. You must also assign an in-cluster role (Viewer/Writer/Admin) for the impersonated request to do anything.

7. What does prune=true change, and why is it non-negotiable? With prune=true, deleting a manifest from Git causes Flux to garbage-collect the corresponding object from the cluster. Without it, deletions never propagate, so Git stops being the authoritative source of truth.

8. How do you give every new cluster the baseline automatically? Assign Policy initiatives and Arc Kubernetes roles at a management group, and register the Flux config as Bicep fluxConfigurations. A cluster onboarded into any child subscription inherits the policy, GitOps config, and access without per-cluster work.

9. How do you read a Key Vault secret with no credential in the cluster? Install the Key Vault Secrets Provider extension, federate a user-assigned managed identity to a Kubernetes service account (workload identity), and bind a SecretProviderClass. The CSI provider exchanges the pod’s projected token for an Entra token — no client secret anywhere.

10. A secret was rotated but the app still uses the old value. Why, and what fixes it? Rotation refreshes the mounted file on tmpfs, but environment variables are snapshotted at pod start and the synced Secret-as-env path is static. Apps that read the file per request pick up changes; apps that load at boot or use env vars need a pod restart.

11. How does Arc Kubernetes differ from AKS for these controls? The controls are nearly identical — same az k8s-configuration flux, same Policy initiatives, same extensions — but Arc uses --cluster-type connectedClusters and runs on non-Azure clusters with an outbound-only agent set, while AKS uses managedClusters and is already in Azure. The symmetry is intentional.

12. Which certs map to which exam? This material maps to AZ-305 (designing governance/hybrid) and AZ-104 (Arc, Policy, RBAC), with Kubernetes depth overlapping CKA/CKS for the in-cluster admission and RBAC mechanics.

Quick check

What single egress FQDN, if proxied without websockets, lets onboarding succeed but breaks kubectl-over-Arc?
You assigned the Arc Cluster User Role but every kubectl command is forbidden. What did you forget?
Name the three namespaces you must exclude from a fleet policy assignment.
What does prune=true do when you delete a manifest from Git?
Why does a freshly assigned deny policy sometimes still admit a privileged pod for a few minutes?

Answers

*.servicebus.windows.net — cluster connect rides Azure Relay over websockets there. Allow the resolved regional FQDNs with websockets enabled.
An in-cluster role. The Cluster User Role only opens the connect channel; you must also assign Viewer/Writer/Admin for the impersonated request to have permissions.
kube-system, gatekeeper-system, and azure-arc — excluding them keeps the policy from blocking system and Arc agent workloads.
Flux garbage-collects the corresponding object from the cluster, keeping Git as the source of truth.
The Policy add-on syncs assignments roughly every 15 minutes and writes the azurepolicy-* Gatekeeper constraints on that cadence; until they land (or if the scope is wrong), admission is not yet enforced.

Glossary

Term	Definition
Connected cluster	The `Microsoft.Kubernetes/connectedClusters` ARM resource projecting a non-Azure cluster into Azure.
Arc agents	The outbound-only Helm release in the `azure-arc` namespace that maintains the channel and reconciles intent.
Cluster extension	A managed add-on (Flux, Policy, Monitor, Key Vault) installed and lifecycled via `Microsoft.KubernetesConfiguration`.
`microsoft.flux`	The Flux v2 GitOps cluster extension delivering source/kustomize/helm controllers.
`fluxConfigurations`	The ARM resource describing a Git source + Kustomizations that `config-agent` applies.
Kustomization	A Flux unit that applies a path from a source, with `prune`, `dependsOn`, and intervals.
Gatekeeper	The OPA admission webhook (v3) that Azure Policy uses to enforce in-cluster constraints.
Constraint	The in-cluster object (`azurepolicy-*`) Gatekeeper enforces, generated from a Policy assignment.
Cluster connect	The outbound channel that lets `az connectedk8s proxy` provide `kubectl` with no inbound port.
`kube-aad-proxy`	The in-cluster shim that performs Entra authN and impersonates the user against the apiserver.
Container Insights	The `Microsoft.AzureMonitor.Containers` extension shipping logs/metrics/inventory to Log Analytics.
Workload identity	A federated user-assigned managed identity bound to a Kubernetes service account for secretless Azure access.
`SecretProviderClass`	The CSI object tying a service account + vault + secret list together for tmpfs mounting.
Management group	An ARM scope above subscriptions through which Policy and RBAC inherit to child clusters.
Azure Relay	The Azure service (over `*.servicebus.windows.net`) that brokers the cluster-connect websocket channel.

Next steps

Azure Policy at Scale: Governance with Management Groups & Initiatives — go deeper on the policy engine you assigned here, fleet-wide.
Flux CD GitOps: Monorepo, Kustomize & Multi-Tenancy — structure the Git repo your Arc clusters reconcile from.
Azure Key Vault Workload Identity for Secrets — the federation model behind secretless secret access.
Azure Arc-Enabled Servers: Machine Configuration & Extended Security Updates — the VM sibling that completes your hybrid Arc estate.
AKS Day-Two: Upgrades & Fleet Operations — apply the same fleet discipline to managed Azure clusters.