A fleet of 600 Windows and Linux servers spread across two colo datacenters, an AWS account, and a GCP project is not a “we’ll get to it” problem. The moment your security team asks “which of these is missing a CIS baseline, which still runs an out-of-support OS, and who can touch them,” you need one control plane. Azure Arc-enabled servers projects each machine into Azure Resource Manager as a Microsoft.HybridCompute/machines resource. From there, the same management groups, Azure Policy assignments, RBAC, and Resource Graph queries you use for native Azure VMs reach into your hybrid estate — without lifting a single workload.
The pain this kills is governance fragmentation. Every server outside Azure today lives in a different tool: SCCM here, Ansible there, a spreadsheet of “boxes we should really patch.” There is no single answer to “is this fleet compliant,” no single RBAC plane, no single audit trail. Arc collapses that into ARM: one inventory, one policy engine, one identity model, one query language. The agent is read-mostly and outbound-only, so the security review is tractable; the per-machine cost for the management plane is zero (you pay only for the value-add services — Defender, Update Manager extras, ESU — you actually consume).
This walkthrough does the work a platform team actually has to do: onboard at scale non-interactively, manage extensions, enforce in-guest configuration with Machine Configuration (formerly Guest Configuration), report compliance through Azure Policy, deliver Extended Security Updates (ESU) for Windows Server 2012/2012 R2 through Arc, lock agent traffic behind a Private Link Scope, and scope RBAC so a stolen credential’s blast radius stays small. By the end you will be able to onboard a machine, prove it compliant, and explain every byte the agent puts on the wire. I assume Owner on the subscription and root/administrator on the machines.
What problem this solves
Hybrid estates rot in the dark. A server in a colo that nobody onboarded to any management plane is invisible: it does not appear in your secure-score, it does not get patched on a schedule, its drift from the CIS baseline is unknown, and the list of who can RDP to it lives in someone’s head. When the auditor or the breach forces the question, you have no answer and no tooling to get one fast. Multiply by 600 machines across three clouds and the gap is not an inconvenience — it is the finding that fails the audit.
What breaks without Arc: you run N management tools for N environments, each with its own identity model and its own blind spots. Patching is manual or scripted per-site, so a critical CVE sits unpatched on the boxes nobody remembered. Compliance is a quarterly fire drill of screenshots instead of a live Resource Graph query. Secrets sit on disk because there is no managed identity to authenticate scripts. And legacy Windows Server 2012/2012 R2 hosts — the ones you cannot migrate for 18 months — run with no security updates at all because ESU outside Azure requires a delivery channel you do not have.
Who hits this: every organization with servers outside Azure that still need Azure-grade governance — regulated enterprises (PCI, HIPAA), companies mid-migration with a long on-prem tail, and multicloud shops who refuse to run three separate governance stacks. The teams who feel it most acutely are the platform/SRE function asked to produce one compliance dashboard, and the security function that needs RBAC and audit to reach the boxes ARM cannot otherwise see.
To frame the whole field before the deep dive, here is every capability Arc adds, the on-prem pain it replaces, and where in this article it lives:
| Capability | What it gives you | Pain it replaces | Covered in |
|---|---|---|---|
| Inventory in ARM | Each server a HybridCompute/machines resource |
Spreadsheets, drift, “what do we even own” | Core concepts |
| Managed identity | Per-machine MI via local IMDS, no stored secrets | Service-account passwords on disk | Core concepts, §Agent |
| Machine Configuration | DSC-style in-guest audit + remediation | Manual hardening, no drift detection | §Machine Configuration |
| Azure Policy at scale | Assign baselines at management-group scope | Per-site scripts, no central reporting | §Policy & compliance |
| Extensions | AMA, Custom Script, dependency agent like a VM | Per-tool agent sprawl | §Extensions |
| Extended Security Updates | Patches for WS2012/2012 R2 off-Azure | Unpatched out-of-support OS | §ESU |
| Private Link | Agent data plane over a private endpoint | Agent telemetry on the public internet | §Private Link |
| RBAC + audit | Purpose-built roles, ARM activity log | Credentials in someone’s head | §RBAC |
Learning objectives
By the end of this article you can:
- Explain the Connected Machine agent architecture — its services, its outbound-only
443connectivity, the local IMDS endpoint, and the three connectivity modes — and validate reachability withazcmagent checkbefore onboarding. - Onboard hundreds of servers non-interactively with a least-privilege service principal, keep the secret off the command line and out of logs, and handle golden-image cloning without hostname collisions.
- Author, sign, and publish a Machine Configuration package, assign it through Azure Policy, and choose correctly between
Audit,ApplyAndMonitor, andApplyAndAutoCorrect. - Drive Azure Policy guest assignments at management-group scope, run remediation tasks against a pre-existing fleet, and answer “what’s non-compliant” with one Resource Graph query.
- Provision and link Extended Security Updates licenses for Windows Server 2012/2012 R2 with the correct
Type/Edition/Processors, manage them as code, and stop billing cleanly on decommission. - Stand up an Azure Arc Private Link Scope with the right private DNS zones, and identify the two endpoints (Entra ID, ARM) that never traverse the scope.
- Scope RBAC with purpose-built Arc roles instead of Contributor, layer
denypolicy guardrails on extensions and tags, and reason about blast radius. - Run a symptom → cause → confirm → fix playbook for the agent failures you will actually hit in production.
Prerequisites & where this fits
You should be comfortable with the Azure governance fundamentals: how management groups, subscriptions, and resource groups nest (see Azure Resource Hierarchy Explained), how Azure Policy definitions, initiatives, assignments, and effects work (see Azure Policy: Governance at Scale), and how RBAC role assignments scope down (see Azure Entra RBAC Governance Deep Dive). You should be able to run az in Cloud Shell, read JSON output, and write a small Bicep template. Familiarity with managed identities (Entra Managed Identities Deep Dive) and Private Link / private DNS (Azure Private Link & Private DNS for PaaS) will make the networking sections land faster.
This sits in the Hybrid & Governance track. Arc-enabled servers is the foundation; its siblings extend the same control plane to other resource types — Azure Arc-Enabled Kubernetes: GitOps, Policy & Fleet Management does for clusters what this article does for VMs. Downstream of onboarding sit the value-add services: Azure Update Manager: Maintenance Configurations & Patch Orchestration with Arc for fleet patching, Defender for Servers for threat protection, and the Azure Monitor data-collection pipeline for telemetry. In a full platform this all lands inside an Enterprise-Scale Landing Zone management-group hierarchy.
A quick map of who owns what during a rollout, so you assign the work correctly:
| Layer | What lives here | Who usually owns it | What goes wrong if neglected |
|---|---|---|---|
| Network egress | Firewall/NSG rules, proxy, Private Link | Network team | Onboarding hangs; agent can’t reach Azure |
| Identity | Onboarding SPN, machine MI, RBAC | Identity team | Over-privileged SPN; stolen-secret blast radius |
| Agent lifecycle | Install, connect, upgrade, agent version | Platform/Ops | Stale agents; missed CVE fixes in the agent itself |
| Policy & config | Initiatives, Machine Config packages, remediation | Governance team | “0% compliant”; drift undetected |
| Patching | Update Manager, maintenance configs, ESU | Platform/Ops | Unpatched fleet; ESU billing surprises |
| Audit & reporting | Resource Graph, Log Analytics, diagnostics | Security/Audit | No evidence trail for the auditor |
Core concepts
Five mental models make every later decision obvious.
Arc projects a server into ARM as a resource you can govern. A connected machine becomes a Microsoft.HybridCompute/machines resource with the same primitives as a native VM: tags, RBAC at the resource scope, policy assignments inherited from its resource group / subscription / management group, and a row in Resource Graph. Nothing about the workload changes — the OS, the apps, the network all stay put. What changes is that Azure’s governance plane now reaches the box. The mental shift: Arc is not a migration and not an agent that runs your workload; it is a control-plane projection.
The agent is low-privilege, outbound-only, and identity-bearing. The Connected Machine agent is one package — the azcmagent CLI plus three services (the Hybrid Instance Metadata Service, the GuestConfig service, and the extension manager). It runs as a low-privilege service, polls Azure over outbound HTTPS (443) only, and never opens an inbound port. On connect, the machine receives a system-assigned managed identity backed by a certificate the agent rotates; in-guest tooling reads tokens from a local IMDS endpoint at http://localhost:40342, so scripts authenticate with no stored secret.
Machine Configuration is the in-guest enforcement engine, and it is in-box. Azure Policy alone sees ARM-level properties; it cannot see a registry value or a sysctl setting inside the OS. Machine Configuration runs DSC-style packages inside the guest to assert that state — registry keys, file contents, installed packages, service states, secedit/sysctl. Critically, on Arc servers the engine ships in the agent — you do not deploy a separate ConfigurationforWindows/ConfigurationforLinux extension as you do on Azure VMs. That removes a whole DeployIfNotExists prerequisite from your design.
Connectivity has three modes, and two endpoints always go public. The agent connects in direct mode (outbound to public endpoints, optionally via proxy), proxy mode, or Private Link mode (the his and guestconfiguration data planes over a private endpoint). The trap that sinks rollouts: Entra ID (login.microsoftonline.com) and Azure Resource Manager (management.azure.com) traffic always use public endpoints, even with Private Link. Block them and the agent cannot authenticate no matter how healthy the private endpoint is.
Governance reaches in: policy, ESU, and patching all ride the projection. Once a machine is in ARM, the value flows: assign baselines via policy at management-group scope and they cascade; deliver Extended Security Updates to off-Azure WS2012/2012 R2 boxes through a license resource you link to the machine; orchestrate patches across the whole fleet with Update Manager; reach the box with az ssh arc through ARM with no inbound port. The single projection is the foundation every other capability stands on.
The vocabulary in one table
Pin down every moving part before the deep sections. The glossary at the end repeats these for lookup; this table is the model side by side:
| Concept | One-line definition | Where it lives | Why it matters |
|---|---|---|---|
| Arc-enabled server | An off-Azure machine projected into ARM | Microsoft.HybridCompute/machines |
The unit you govern |
azcmagent |
The Connected Machine agent CLI | On the server | Connect, check, config, show |
| Hybrid IMDS | Local metadata + token endpoint | localhost:40342 |
Secretless in-guest auth |
| System-assigned MI | Per-machine identity, cert-backed | The machine resource | RBAC for in-guest scripts |
| Machine Configuration | In-guest DSC audit/remediation engine | In-box in the agent | Sees registry/file/service state |
| Guest assignment | A Machine Config package bound to a machine | guestConfigurationAssignments |
Carries compliance status |
assignmentType |
Audit vs Apply behaviour of an assignment | On the guest assignment | Read-only vs self-healing |
| ESU license | A first-class ARM resource for WS2012 patches | HybridCompute/licenses |
Funds the off-Azure patch channel |
| License profile | Links an ESU license to a machine | machines/licenseProfiles/default |
Activates ESU on that box |
| Private Link Scope | Routes agent data plane privately | HybridCompute/privateLinkScopes |
Keeps telemetry off the internet |
| Connectivity mode | direct / proxy / Private Link | Agent config | Shapes firewall design |
| Onboarding SPN | Least-privilege identity that connects machines | Entra app registration | Blast-radius control |
1. The Connected Machine agent: architecture and connectivity modes
The agent is a single package — the azcmagent CLI plus services (the Hybrid Instance Metadata Service, the GuestConfig service, the extension manager). It runs low-privilege, polls Azure over outbound HTTPS (443) only, and never opens an inbound port. Three facts shape every design decision:
- Identity. On connect, the machine gets a system-assigned managed identity backed by a certificate the agent rotates. In-guest tooling reads tokens from a local IMDS endpoint (
http://localhost:40342), so scripts authenticate with no stored secret. - Machine Configuration is built in. Unlike Azure VMs, Arc servers do not need the
ConfigurationforWindows/ConfigurationforLinuxextension deployed separately — the agent ships it in-box, removing a wholeDeployIfNotExistsstep from your design. - Connectivity modes: direct (outbound to public endpoints, optionally via proxy), proxy, and Private Link (his/guestconfiguration data-plane over a private endpoint). Entra ID and Resource Manager traffic always use public endpoints even with Private Link — plan firewall rules accordingly.
The agent’s moving parts, what each does, and what it talks to:
| Component | Role | Talks to | If it’s unhealthy you see |
|---|---|---|---|
azcmagent (CLI) |
Connect/disconnect, config, check, show | Local services | Can’t run lifecycle commands |
| Hybrid Instance Metadata Service (HIMDS) | Identity + metadata, token broker | localhost:40342, Entra ID |
Scripts can’t get a token |
| GuestConfig service | Runs Machine Config packages in-guest | *.guestconfiguration.azure.com |
Compliance never evaluates |
| Extension manager | Installs/updates extensions (AMA, etc.) | *.his.arc.azure.com, download CDN |
Extensions stuck “Creating” |
| Auto-upgrade service | Self-updates the agent | Download CDN | Agent drifts to stale versions |
The three connectivity modes, side by side — pick per security posture:
| Mode | Data-plane path | Setup | When to use | Limitation |
|---|---|---|---|---|
| Direct | Public endpoints over 443 | None (default) | Simplest; non-regulated estates | Telemetry traverses the internet |
| Proxy | Public endpoints via HTTP proxy | azcmagent config set proxy.url |
Egress only allowed via proxy | Proxy must allow the FQDN set |
| Private Link | his + guestconfiguration over private endpoint | Private Link Scope + DNS | Regulated; no public agent traffic | Entra ID + ARM still go public |
Every endpoint the agent needs, the mode that uses it, and what breaks if you block it:
| Endpoint (FQDN) | Purpose | Direct | Private Link | Block it and… |
|---|---|---|---|---|
login.microsoftonline.com |
Entra ID auth (token) | Public | Public | Agent can’t authenticate |
pas.windows.net |
Entra ID (PoP) | Public | Public | Auth flows fail |
management.azure.com |
Azure Resource Manager | Public | Public | Connect/heartbeat fails |
*.his.arc.azure.com |
Hybrid identity + metadata data plane | Public | Private | Heartbeat/MI breaks |
*.guestconfiguration.azure.com |
Machine Config data plane | Public | Private | Compliance never runs |
*.guestnotificationservice.azure.com |
Notifications (SSH, Run Command) | Public | Public | az ssh arc / Run Command fail |
Download CDN (aka.ms, download.microsoft.com) |
Agent + extension binaries | Public | Public | Install/upgrade fails |
Configure the proxy before connecting so onboarding itself can route out:
azcmagent config set proxy.url "http://proxy.corp.local:3128"
azcmagent config set proxy.bypass "Arc,ArcData" # built-in bypass lists
# Verify reachability of every required endpoint BEFORE onboarding
azcmagent check --location eastus
azcmagent check returns a pass/fail per required FQDN (*.his.arc.azure.com, *.guestconfiguration.azure.com, login.microsoftonline.com, management.azure.com, and the download CDN). Bake it into golden-image validation.
The azcmagent subcommands you will actually use, and when:
| Command | What it does | Run it when |
|---|---|---|
azcmagent check --location <r> |
Pre-flight every required FQDN | Before onboarding; image validation |
azcmagent connect --config <f> |
Onboard the machine to ARM | First-boot automation |
azcmagent show |
Status, mode, heartbeat, MI, agent version | Verifying health |
azcmagent config set proxy.url <u> |
Point the agent at an HTTP proxy | Proxy-only egress |
azcmagent config list |
Dump current agent configuration | Auditing a box’s settings |
azcmagent logs |
Bundle agent logs for support | Diagnosing a failed connect |
azcmagent disconnect |
Cleanly remove the ARM resource + MI | Decommissioning |
azcmagent upgrade |
Manually upgrade the agent | When not on auto-upgrade |
Two firewall facts that catch teams every single time:
| Fact | The trap | The rule |
|---|---|---|
| Entra ID + ARM are always public | “We’re on Private Link, why won’t it connect?” | Allow AzureActiveDirectory + AzureResourceManager service tags |
| Notifications use a separate FQDN | SSH/Run Command “just doesn’t work” | Allow *.guestnotificationservice.azure.com |
2. At-scale onboarding with a service principal
Interactive login does not scale to 600 servers. Create a dedicated onboarding service principal with the narrowest role for the job — Azure Connected Machine Onboarding — scoped to the single resource group that holds the machines. It can create Arc server resources and nothing else; a leaked secret cannot pivot.
# Dedicated onboarding identity, scoped to one RG, narrowest built-in role
az ad sp create-for-rbac \
--name "sp-arc-onboarding" \
--role "Azure Connected Machine Onboarding" \
--scopes "/subscriptions/<sub-id>/resourceGroups/rg-arc-servers"
Never put the secret on a command line — azcmagent echoes arguments to logs in some failure paths. Use a config file referenced with --config; the agent reads the credential from disk and keeps it out of the console:
# /etc/arc-onboard.json (mode 0600, deleted after onboarding)
cat > /etc/arc-onboard.json <<'JSON'
{
"subscriptionId": "<sub-id>",
"resourceGroup": "rg-arc-servers",
"location": "eastus",
"tenantId": "<tenant-id>",
"servicePrincipalId": "<app-id>",
"servicePrincipalSecret": "<secret>",
"cloud": "AzureCloud"
}
JSON
chmod 600 /etc/arc-onboard.json
azcmagent connect \
--config /etc/arc-onboard.json \
--tags "Datacenter=COLO1,App=Payments,Owner='Platform Eng'" \
--correlation-id "$(uuidgen)"
shred -u /etc/arc-onboard.json # remove the secret immediately
Windows uses the same azcmagent connect --config from an elevated session. For golden images, do not onboard before cloning — install the agent, leave it disconnected, and let first-boot automation (cloud-init, Ansible, an MDT/Intune task) run connect with a per-machine --resource-name so hostnames do not collide. Certificate-based SPN auth (--service-principal-cert) is better where you can distribute certs — it removes the long-lived secret entirely. Use --use-azcli (agent 1.59+) only for ad-hoc operator onboarding, never unattended fleets.
The onboarding methods, ranked by fleet-fit:
| Method | Auth | Best for | Avoid when |
|---|---|---|---|
SPN secret via --config |
Client secret in a 0600 file | Unattended fleets | Cert distribution is feasible (use cert) |
SPN certificate (--service-principal-cert) |
Cert, no long-lived secret | Highest-security fleets | No PKI to distribute certs |
--use-azcli |
Operator’s az login |
Ad-hoc, a few boxes | Any unattended/at-scale flow |
| Interactive device code | Browser login | One box, a lab | Anything past a handful |
azcmagent connect with --access-token |
Pre-fetched ARM token | Pipelines that already hold a token | Long-lived storage of the token |
The azcmagent connect flags that matter at scale, and why:
| Flag | Purpose | Default / note |
|---|---|---|
--config <file> |
Read params + secret from disk, off the CLI | Keeps the secret out of logs |
--resource-name <name> |
Set the ARM resource name explicitly | Avoids hostname collisions on cloned images |
--tags "k=v,..." |
Stamp governance tags at onboard | Pairs with deny-untagged policy |
--correlation-id <guid> |
Group a batch onboarding for support | One UUID per rollout wave |
--private-link-scope <id> |
Onboard directly into a Private Link Scope | Regulated estates |
--cloud <name> |
Target sovereign clouds | AzureCloud default |
--service-principal-cert <path> |
Cert-based auth, no secret | Preferred over --config secret |
--correlation-id + --tags together |
Auditable, attributable onboarding | Always in production |
The least-privilege role math — what each onboarding-related role can and cannot do:
| Role | Can | Cannot | Give it to |
|---|---|---|---|
Azure Connected Machine Onboarding |
Create + read Arc machine resources | Manage extensions, ESU, delete arbitrary resources | The onboarding SPN |
Azure Connected Machine Resource Administrator |
Manage machines, extensions, ESU | Touch unrelated resource types | Platform/Ops humans |
Contributor (anti-pattern) |
Everything in scope | — | No one for onboarding |
Reader on the Arc RG |
Read inventory + compliance | Change anything | Auditors, monitoring |
A pre-onboarding readiness checklist, as a table you can tick:
| Check | Command / action | Pass criteria |
|---|---|---|
| Endpoints reachable | azcmagent check --location <r> |
All FQDNs PASS |
| Proxy configured (if used) | azcmagent config list |
proxy.url set, bypass set |
| SPN scoped to one RG | az role assignment list --assignee <app-id> |
Single RG scope, onboarding role only |
| Secret off the CLI | Review automation | Secret only in 0600 --config file |
| Tags planned | Onboarding script | Owner, Datacenter, DataClassification present |
| Image not pre-connected | Golden image build | Agent installed, disconnected |
3. Machine Configuration: audit and remediation in-guest
Machine Configuration runs DSC-style packages inside the OS to assert state Azure Policy alone cannot see — registry values, file contents, installed packages, service states, sysctl/secedit settings. The engine is in-box on Arc servers, so you only assign policy.
Every configuration is an MOF compiled into a signed .zip package published to Blob Storage, then referenced by a policy definition. Author it with the GuestConfiguration PowerShell module:
Install-Module -Name GuestConfiguration -Scope CurrentUser
# Compile your DSC config (here: assert a registry value) then package it
New-GuestConfigurationPackage `
-Name 'EnforceTlsRegistry' `
-Configuration './EnforceTlsRegistry.mof' `
-Type 'ApplyAndAutoCorrect' ` # Audit | ApplyAndMonitor | ApplyAndAutoCorrect
-Path './package'
# Test against the local machine before publishing
Get-GuestConfigurationPackageComplianceStatus `
-Path './package/EnforceTlsRegistry.zip'
The -Type you compile in determines behavior and maps directly to the assignmentType on the resulting guest assignment resource:
| assignmentType | Test result false ⇒ | Use it for |
|---|---|---|
Audit |
report NonCompliant, do nothing |
read-only compliance reporting |
ApplyAndMonitor |
apply once at assignment, then only report drift | one-time enforcement, manual re-apply |
ApplyAndAutoCorrect |
run Set to remediate on every evaluation |
continuous, self-healing enforcement |
A subtlety that bites people: when a custom policy first deploys an assignment, assignmentType can briefly read Null before resolving (typically within an hour). Do not alert on that transient state.
Generate a policy definition from the package and assign it. For audit-only baselines (the CIS/STIG built-ins), the initiatives already exist — assign those directly rather than authoring your own.
New-GuestConfigurationPolicy `
-PolicyId (New-Guid) `
-ContentUri 'https://stgarc.blob.core.windows.net/pkgs/EnforceTlsRegistry.zip' `
-DisplayName 'Enforce TLS registry baseline' `
-Platform 'Windows' `
-PolicyVersion '1.0.0' `
-Mode 'ApplyAndAutoCorrect' `
-Path './policy'
What Machine Configuration can assert (and what it cannot)
The engine reaches deep into the OS, but it is not a general-purpose config-management tool. Know the boundary:
| Resource class | Examples it can assert | Platform |
|---|---|---|
| Registry | Key/value presence, type, data | Windows |
| File / directory | Existence, content hash, ACL | Windows + Linux |
| Service / daemon | Running/stopped, start mode | Windows + Linux |
| Installed package | Present/absent, version | Windows + Linux |
| Security policy | secedit settings, audit policy | Windows |
| Kernel/sysctl | sysctl parameters |
Linux |
| Local users/groups | Membership, presence | Windows + Linux |
| Environment | Environment variables | Windows + Linux |
The authoring-to-enforcement pipeline, stage by stage:
| Stage | Tool / artifact | Output | Gotcha |
|---|---|---|---|
| 1. Author config | DSC config → .mof |
Compiled MOF | Test resources exist on the platform |
| 2. Package | New-GuestConfigurationPackage |
Signed .zip |
-Type bakes in the behaviour |
| 3. Test locally | Get-…PackageComplianceStatus |
Pass/fail | Run on the target OS family |
| 4. Publish | Upload to Blob Storage | Public/SAS ContentUri |
Lock down the container |
| 5. Generate policy | New-GuestConfigurationPolicy |
Policy definition JSON | Mode must match package -Type |
| 6. Assign | az policy assignment create |
Assignment at scope | Identity needed for Apply modes |
| 7. Remediate | az policy remediation create |
Existing fleet brought in | DINE/Modify ignore existing without this |
Built-in initiatives you should assign rather than author from scratch:
| Built-in initiative | Asserts | Mode |
|---|---|---|
| CIS Microsoft Windows Server benchmark | CIS hardening controls | Audit |
| Windows machines should meet STIG requirements | DISA STIG controls | Audit |
| Linux machines should meet STIG requirements | DISA STIG (Linux) | Audit |
| Audit machines with insecure password security settings | Password policy | Audit |
| Deploy prerequisites to enable Guest Configuration | Identity + (VM) extension wiring | DeployIfNotExists |
Common authoring mistakes and their fix:
| Mistake | Symptom | Fix |
|---|---|---|
| Package unsigned where signing is required | Assignment fails to apply | Sign the package; set the signature validation policy |
Mode ≠ package -Type |
Apply does nothing / errors | Regenerate policy with matching Mode |
ContentUri not reachable from the guest |
Status stuck, never evaluates | Public/SAS URL the agent can GET |
| Tested on the wrong OS family | “Compliant” locally, fails in fleet | Test on the actual target OS |
Alerting on transient Null assignmentType |
False “broken” alerts on day one | Exclude the first hour after assign |
4. Extension management and VM-like operations
Arc servers accept the same extension model as Azure VMs through Microsoft.HybridCompute/machines/extensions. Day one you deploy the Azure Monitor Agent (telemetry to a Data Collection Rule) and, where used, the Custom Script Extension. Push them at scale with policy, or imperatively for a single box:
# Install the Azure Monitor Agent extension on an Arc server
az connectedmachine extension create \
--resource-group "rg-arc-servers" \
--machine-name "colo1-pay-01" \
--name "AzureMonitorWindowsAgent" \
--publisher "Microsoft.Azure.Monitor" \
--type "AzureMonitorWindowsAgent" \
--enable-auto-upgrade true
--enable-auto-upgrade true opts the extension into automatic minor-version upgrades — set it everywhere so you are not chasing CVEs in the agents themselves. Keep the agent current too via automatic agent upgrade so azcmagent self-updates.
Beyond extensions, Arc unlocks VM-like operations: SSH/RDP over Arc (az ssh arc, through ARM with no inbound port), Azure Update Manager for cross-fleet patch orchestration, and Run Command for one-off scripts audited through ARM — replacing bastion and jump-box sprawl with RBAC-governed, logged access.
The extensions you will actually deploy on Arc servers, and what each delivers:
| Extension | Publisher / type | Delivers | Auto-upgrade? |
|---|---|---|---|
| Azure Monitor Agent (Windows) | Microsoft.Azure.Monitor / AzureMonitorWindowsAgent |
Logs + metrics to a DCR | Yes — set it |
| Azure Monitor Agent (Linux) | Microsoft.Azure.Monitor / AzureMonitorLinuxAgent |
Logs + metrics to a DCR | Yes — set it |
| Custom Script (Windows) | Microsoft.Compute / CustomScriptExtension |
Run a script once | Manual |
| Custom Script (Linux) | Microsoft.Azure.Extensions / CustomScript |
Run a script once | Manual |
| Dependency agent | Microsoft.Azure.Monitoring.DependencyAgent |
VM Insights service map | Yes |
| Defender for Servers | via Defender plan | EDR / vuln assessment | Managed by Defender |
VM-like operations Arc unlocks, and what they replace:
| Operation | Command / entry point | Replaces | Inbound port? |
|---|---|---|---|
| SSH over Arc | az ssh arc -n <m> -g <rg> |
Bastion / jump box | None |
| RDP over Arc | SSH tunnel via Arc | RDP gateway | None |
| Run Command | az connectedmachine run-command create |
PsExec, ad-hoc SSH | None |
| Patch orchestration | Azure Update Manager | WSUS/SCCM per site | None |
| Inventory & changes | Azure Inventory / Change Tracking | Manual audits | None |
| Telemetry | AMA → DCR → Log Analytics | Per-tool log agents | None |
Extension lifecycle states and what each means:
| State | Meaning | Action |
|---|---|---|
Creating |
Install in progress | Wait; check after a few minutes |
Succeeded |
Installed and healthy | None |
Failed |
Install/run errored | Read extension status message; reinstall |
Updating |
Auto-upgrade applying | Wait |
Deleting |
Removal in progress | Wait |
Stuck Creating (>15 min) |
Extension manager can’t reach endpoints | Check his/CDN egress |
5. Azure Policy guest assignments and compliance reporting
The standard path is the built-in initiative “Deploy prerequisites to enable Guest Configuration policies on virtual machines.” On Azure VMs it deploys the extension and a system-assigned identity; on Arc servers the extension half is a no-op (it’s in-box) but the identity wiring still applies. Assign it at the management-group level so new machines inherit it.
DeployIfNotExists and Modify assignments act only on new or updated resources. To bring the existing 600 into compliance you must create a remediation task — the single most common reason teams see “0% compliant” and panic. Trigger it on the assignment:
# Remediate all existing in-scope machines for one policy assignment
az policy remediation create \
--name "remediate-machinecfg-baseline" \
--policy-assignment "<assignment-id>" \
--resource-discovery-mode ReEvaluateCompliance
Compliance lands in Azure Resource Graph — how you answer “what’s broken” across the whole fleet in one query instead of clicking through the portal:
// Non-compliant Machine Configuration assignments across the estate
guestconfigurationresources
| where type =~ "microsoft.guestconfiguration/guestconfigurationassignments"
| extend status = tostring(properties.complianceStatus)
| extend machine = tostring(split(id, "/")[8])
| where status =~ "NonCompliant"
| project machine, name, status, lastComplianceChecked = properties.lastComplianceStatusChecked
| order by machine asc
Policy effects you will combine for an Arc estate, and what each does:
| Effect | Behaviour | Acts on existing? | Use it for |
|---|---|---|---|
Audit |
Flags non-compliance, changes nothing | Yes (reports) | CIS/STIG reporting |
AuditIfNotExists |
Audits when a related resource is missing | Yes (reports) | “MI not enabled” checks |
DeployIfNotExists |
Deploys the missing piece | No — needs remediation | Wiring identity/extensions |
Modify |
Adds/updates a property (e.g. tags) | No — needs remediation | Tag normalization |
Deny |
Blocks the create/update | N/A (preventive) | Extension allowlist, required tags |
Disabled |
Turns the rule off | — | Temporarily silencing |
The complianceStatus values and how to read them:
| Status | Meaning | Likely action |
|---|---|---|
Compliant |
Guest assertion passed | None |
NonCompliant |
Assertion failed (or DINE not remediated) | Run remediation / fix drift |
Null (transient) |
Assignment just created, not evaluated | Wait up to ~1 hour |
Pending |
Evaluation queued | Wait |
Error |
Package couldn’t run | Check ContentUri, agent health |
The “0% compliant and panicking” decision table:
| If you see… | It’s probably… | Do this |
|---|---|---|
Every machine NonCompliant right after assigning DINE |
Remediation never ran | az policy remediation create |
Null status across a new assignment |
First-hour transient | Wait, then re-check |
| Some machines missing entirely | MI/prereqs not wired | Assign the prerequisites initiative at MG scope |
Error on specific machines |
ContentUri unreachable or agent down |
Fix egress / azcmagent show |
| Compliant locally, NonCompliant in fleet | Tested on wrong OS family | Re-test on the target OS |
Scopes and inheritance — assign high, let it cascade:
| Assign at… | Inherited by | Use for |
|---|---|---|
| Management group | All child subs + RGs + machines | Org-wide baselines (CIS) |
| Subscription | All RGs + machines in the sub | Per-environment policy |
| Resource group | Machines in that RG | The Arc-servers RG specifically |
| Resource | One machine | Exceptions (rare; prefer exclusions) |
6. Extended Security Updates for Windows Server 2012/2012 R2 through Arc
This is frequently the business case that funds the entire rollout. Windows Server 2012/2012 R2 are out of support; ESU delivers patches for up to three more years, and Arc is the delivery mechanism for machines not in Azure. You provision a license resource, then link it to each eligible server; patches flow through Windows Update / Azure Update Manager and bill monthly — no MAK keys to distribute.
The license is a first-class ARM resource (Microsoft.HybridCompute/licenses). Provision it with the CLI, attesting to Software Assurance or SPLA coverage:
# Provision a Datacenter physical-core ESU license (min 16 physical cores)
az connectedmachine license create \
--license-name "esu-ws2012-dc-colo1" \
--resource-group "rg-arc-servers" \
--location "eastus" \
--license-type "ESU" \
--state "Activated" \
--target "Windows Server 2012 R2" \
--edition "Datacenter" \
--type "pCore" \
--processors 16
Watch the licensing rules that actually cost money:
TypeispCoreorvCore. Physical-core licenses carry a mandatory 16-core minimum; virtual-core a minimum of 8 per VM. The three valid combinations are Standard vCore, Standard pCore, and Datacenter pCore.- You can resize
--processorsafter provisioning and move licenses between resource groups/subscriptions — they are normal ARM resources, queryable in Resource Graph. - Attesting to SA/SPLA coverage is a licensing commitment, not a checkbox.
The license parameters and their rules — get these wrong and you over- or under-pay:
| Parameter | Values | Rule / minimum | Notes |
|---|---|---|---|
--license-type |
ESU |
Only ESU today | The resource type |
--state |
Activated / Deactivated |
Deactivate to stop billing | PATCH to change |
--target |
Windows Server 2012 / 2012 R2 |
Match the OS exactly | Mismatched target won’t link |
--edition |
Standard / Datacenter |
Datacenter only with pCore |
Drives price tier |
--type |
pCore / vCore |
pCore min 16; vCore min 8 | Physical vs virtual cores |
--processors |
integer | ≥ minimum for the type | Resizable post-provisioning |
The three valid license combinations (anything else is invalid):
| Combination | Edition | Type | Minimum cores |
|---|---|---|---|
| Standard vCore | Standard | vCore | 8 per VM |
| Standard pCore | Standard | pCore | 16 physical |
| Datacenter pCore | Datacenter | pCore | 16 physical |
Linking is a licenseProfiles/default child on the machine. Declare it in Bicep so it lives in source control alongside the license:
@description('Resource ID of the ESU license to assign')
param esuLicenseId string
param machineName string
resource esuLink 'Microsoft.HybridCompute/machines/licenseProfiles@2023-06-20-preview' = {
name: '${machineName}/default'
location: resourceGroup().location
properties: {
esuProfile: {
assignedLicense: esuLicenseId
}
}
}
To unlink (machine decommissioned, or moved into Azure where ESU is free), PUT the same licenseProfiles/default with an empty esuProfile: {}. Deactivate a license by PATCHing its state to Deactivated so billing stops.
The ESU lifecycle, operation by operation:
| Operation | How | Billing effect |
|---|---|---|
| Provision license | az connectedmachine license create --state Activated |
Billing starts on activation |
| Link to machine | PUT licenseProfiles/default with assignedLicense |
Machine becomes eligible for patches |
| Resize cores | Update --processors |
Bill follows new core count |
| Move license | Move the ARM resource to another RG/sub | No billing change |
| Unlink machine | PUT licenseProfiles/default with esuProfile: {} |
Machine no longer eligible |
| Deactivate license | PATCH state = Deactivated |
Billing stops |
| Delete license | Delete the ARM resource | Removed entirely |
ESU eligibility and “do I even need it” decision table:
| Situation | ESU via Arc needed? | Why |
|---|---|---|
| WS2012/2012 R2 in a colo/on-prem | Yes | Out of support; Arc is the channel |
| WS2012/2012 R2 in another cloud | Yes | Same — off-Azure |
| WS2012/2012 R2 already in Azure (IaaS) | No | ESU is free for Azure VMs |
| WS2016+ | No | Still in support |
| Migrating off 2012 within months | Maybe | Bridge until migration completes |
7. Private Link Scope for secure agent-to-Azure traffic
For regulated estates that forbid agent traffic over the public internet, an Azure Arc Private Link Scope (Microsoft.HybridCompute/privateLinkScopes) routes the his and guestconfiguration data planes through one private endpoint over ExpressRoute or VPN. One scope serves many machines; a virtual network maps to at most one scope.
# Create the scope, then a private endpoint bound to its 'hybridcompute' group
az connectedmachine private-link-scope create \
--resource-group "rg-arc-net" \
--location "eastus" \
--scope-name "pls-arc-prod" \
--public-network-access Disabled
scopeId=$(az connectedmachine private-link-scope show \
--resource-group "rg-arc-net" --scope-name "pls-arc-prod" --query id -o tsv)
az network private-endpoint create \
--resource-group "rg-arc-net" \
--name "pe-arc-prod" \
--location "eastus" \
--vnet-name "vnet-hub" \
--subnet "snet-pe" \
--private-connection-resource-id "$scopeId" \
--group-id "hybridcompute" \
--connection-name "arc-conn"
Two private DNS zones must resolve to the endpoint’s private IPs (a third only if you also run Arc-enabled Kubernetes):
privatelink.his.arc.azure.com
privatelink.guestconfiguration.azure.com
# privatelink.dp.kubernetesconfiguration.azure.com # only for Arc K8s
Critically, Microsoft Entra ID (login.microsoftonline.com, pas.windows.net) and Azure Resource Manager (management.azure.com) do not traverse the scope — they keep using public endpoints. Allow those via the AzureActiveDirectory and AzureResourceManager service tags on your firewall/NSG, or servers fail to authenticate even with a healthy private endpoint. Onboard new machines with --private-link-scope <scope-resource-id>; associate existing ones afterward (up to 15 minutes to start accepting connections).
What goes private versus what stays public — the table that prevents the #1 Private Link failure:
| Traffic | Endpoint | Path with Private Link | Firewall requirement |
|---|---|---|---|
| Hybrid identity / metadata | *.his.arc.azure.com |
Private (via scope) | Private DNS zone privatelink.his.arc.azure.com |
| Machine Configuration | *.guestconfiguration.azure.com |
Private (via scope) | Private DNS zone privatelink.guestconfiguration.azure.com |
| Entra ID auth | login.microsoftonline.com, pas.windows.net |
Public | Allow AzureActiveDirectory service tag |
| Azure Resource Manager | management.azure.com |
Public | Allow AzureResourceManager service tag |
| Notifications | *.guestnotificationservice.azure.com |
Public | Allow the FQDN |
| Agent/extension binaries | Download CDN | Public | Allow the CDN FQDNs |
Private DNS zones required, keyed by what you run:
| Private DNS zone | Required for | Maps to |
|---|---|---|
privatelink.his.arc.azure.com |
All Arc servers via Private Link | PE private IP |
privatelink.guestconfiguration.azure.com |
Machine Configuration | PE private IP |
privatelink.dp.kubernetesconfiguration.azure.com |
Arc-enabled Kubernetes only | PE private IP |
Private Link Scope constraints worth internalizing:
| Constraint | Value | Implication |
|---|---|---|
| Scopes per VNet | At most 1 | Plan one scope per hub VNet |
| Machines per scope | Many | One scope serves a whole datacenter |
| Association propagation | Up to ~15 min | Don’t expect instant connect after associating |
public-network-access |
Disabled for strict |
Forces all data-plane traffic private |
| Entra ID + ARM | Always public | Service-tag rules are mandatory |
8. RBAC scoping and operational guardrails
The whole point of projecting servers into ARM is that existing governance applies. Use the purpose-built Arc roles instead of broad Contributor:
| Role | Grants | Give it to |
|---|---|---|
Azure Connected Machine Onboarding |
create/read Arc server resources only | the onboarding SPN |
Azure Connected Machine Resource Administrator |
manage Arc servers, extensions, ESU | platform/ops team |
Reader on the Arc RG |
read-only inventory and compliance | auditors, monitoring |
Layer policy guardrails on top so the estate stays inside the rails:
- A
denyonMicrosoft.HybridCompute/machines/extensionsrestricted to an allowlist of publishers/types — extensions run as root/SYSTEM, so stop arbitrary ones from being pushed. - A
denyon machines missing required tags (Owner,Datacenter,DataClassification) so nothing onboards anonymously. - Diagnostic settings shipping agent and policy events to a central Log Analytics workspace.
// Deny any Arc extension not on the allowlist (policyRule fragment)
"if": {
"allOf": [
{ "field": "type", "equals": "Microsoft.HybridCompute/machines/extensions" },
{ "not": {
"field": "Microsoft.HybridCompute/machines/extensions/type",
"in": ["AzureMonitorWindowsAgent", "AzureMonitorLinuxAgent", "CustomScriptExtension"]
}}
]
},
"then": { "effect": "deny" }
The guardrail policies every Arc estate should carry, and the blast radius each contains:
| Guardrail | Effect | Contains | Without it |
|---|---|---|---|
| Extension allowlist | Deny |
Arbitrary root/SYSTEM code via extensions | Any operator pushes anything |
| Required-tags | Deny |
Anonymous/unowned onboarding | Ungoverned ghost machines |
| Diagnostic settings to LA | DeployIfNotExists |
Missing audit trail | No central evidence |
| Allowed locations | Deny |
Sprawl into unintended regions | Cost + data-residency leaks |
| Agent auto-upgrade enforced | Audit/config |
Stale, vulnerable agents | CVEs in the agent itself |
Blast-radius reasoning — what a stolen credential can do, by identity:
| Compromised identity | Can do | Cannot do | Mitigation |
|---|---|---|---|
| Onboarding SPN | Create Arc machines in one RG | Manage extensions, delete, pivot | Cert auth; rotate; scope to one RG |
| Machine MI (one box) | Whatever RBAC you granted that MI | Anything you didn’t grant | Grant MI least privilege; per-machine |
| Resource Administrator human | Manage all Arc machines + extensions | Touch unrelated resource types | PIM/JIT; conditional access |
| Reader | View inventory/compliance | Change anything | Fine as-is |
Architecture at a glance
Read the diagram left to right as the control-plane projection it is. On the far left, an off-Azure server in a colo or another cloud runs the Connected Machine agent — azcmagent plus the HIMDS, GuestConfig, and extension-manager services — listening on no inbound port and reaching out only on HTTPS 443. That outbound traffic splits at the connectivity layer: with Private Link, the his and guestconfiguration data planes ride a private endpoint through your hub VNet over ExpressRoute/VPN, while Entra ID and Azure Resource Manager stay on public endpoints (the rule that trips everyone — allow the AzureActiveDirectory and AzureResourceManager service tags or nothing authenticates). Once through, the agent authenticates to Entra ID, the machine surfaces in ARM as a HybridCompute/machines resource with a system-assigned managed identity, and from there the governance plane takes over.
The right of the diagram is where the value lands. Azure Policy assigns CIS/STIG baselines and Machine Configuration packages that the in-guest engine evaluates and (optionally) auto-corrects; ESU licenses link to each WS2012/2012 R2 box to fund off-Azure patching through Update Manager; and Resource Graph + Log Analytics roll the whole fleet’s compliance and telemetry into one queryable plane. The numbered badges mark the five places this breaks in production — onboarding egress, the always-public Entra ID/ARM path, the in-box Machine Config engine, the ESU link, and the private-endpoint DNS — and the legend narrates each as symptom · confirm · fix. Follow the path once and you have the whole system: agent out on 443, identity to Entra ID, resource in ARM, governance reaching back in.
Real-world scenario
A payments platform team I worked with — call them NorthPay — ran 420 Windows Server 2012 R2 hosts across two PCI-scoped datacenters. The constraint was hard: the auditor would not accept agent telemetry crossing the public internet, and every box needed ESU because migrating the legacy payment gateway off 2012 R2 was an 18-month project they could not front-load. Their first attempt onboarded everything in direct mode and immediately failed the network review.
The fix had three parts. First, a Private Link Scope per datacenter, fronted by a private endpoint on the existing ExpressRoute-connected hub VNet, with the his and guestconfiguration zones in central private DNS. Second — the part that broke — they had blocked all outbound internet at the firewall, and onboarding hung. The agent still needs Entra ID and ARM over the public internet even behind Private Link, so they added exactly two service-tag rules and nothing else:
# The only public egress the agent needs behind Private Link
az network nsg rule create -g rg-pci-net --nsg-name nsg-arc \
--name AllowAAD --priority 150 --direction Outbound --access Allow \
--protocol Tcp --source-address-prefixes VirtualNetwork \
--destination-address-prefixes AzureActiveDirectory --destination-port-ranges 443
az network nsg rule create -g rg-pci-net --nsg-name nsg-arc \
--name AllowARM --priority 151 --direction Outbound --access Allow \
--protocol Tcp --source-address-prefixes VirtualNetwork \
--destination-address-prefixes AzureResourceManager --destination-port-ranges 443
Third, ESU as code: one Datacenter pCore license per physical host (2-socket boxes, well over the 16-core floor), provisioned and linked through a Bicep loop over an inventory file with assignedLicense referencing the license resource ID. Compliance — the CIS baseline via Machine Configuration ApplyAndMonitor and ESU coverage — rolled up into one Resource Graph dashboard the auditor could query directly. The review passed on the second pass, and the only public traffic on the wire was two service tags to identity and ARM.
The numbers told the story to the CFO. The management plane itself cost ₹0 per machine; the spend was ESU (the unavoidable cost of running an out-of-support OS for 18 more months) plus a modest Log Analytics ingestion bill and the private-endpoint hours. Against the alternative — a forced, rushed migration of the payment gateway, or a failed PCI audit — it was trivially justified. The lesson NorthPay wrote on the wall: “Private Link makes the data plane private; it does not make identity private. Allow Entra ID and ARM, or you have a beautiful endpoint nobody can authenticate through.”
The rollout as a timeline, because the order of moves is the lesson:
| Phase | Action | Result | What it taught |
|---|---|---|---|
| Week 1 | Onboard all in direct mode | Failed network review | Read the security requirement first |
| Week 2 | Private Link Scope per DC + private DNS | Data plane private | Scope-per-hub-VNet pattern |
| Week 2 | Blocked all egress → onboarding hung | Agents couldn’t connect | Entra ID + ARM are always public |
| Week 2 | Added two service-tag rules | Onboarding succeeded | Minimal public egress, nothing more |
| Week 3 | ESU as code (Bicep loop over inventory) | All 420 licensed + linked | Licenses are ARM resources — treat as code |
| Week 4 | CIS via Machine Config + Resource Graph dashboard | Auditor queried compliance directly | One queryable plane beats screenshots |
| Audit | Second-pass review | Passed | Two service tags on the wire, nothing else |
Advantages and disadvantages
Projecting servers into ARM is powerful, but it is not free of trade-offs. Weigh it honestly:
| Advantages | Disadvantages |
|---|---|
| One control plane (policy, RBAC, Resource Graph) across on-prem + multicloud | Another agent to install, version, and keep healthy on every box |
| Management plane is free per machine; pay only for value-add services | Value-add services (Defender, ESU, extra ingestion) do cost real money |
| Managed identity per machine — no service-account secrets on disk | Misunderstanding “always-public Entra ID/ARM” stalls Private Link rollouts |
| Machine Configuration sees in-guest state Azure Policy alone cannot | Authoring/signing custom packages has a learning curve |
| ESU through Arc is the only sane channel for off-Azure WS2012/2012 R2 | ESU licensing rules (16-core floor, edition×type combos) are easy to over-buy |
| Same extension model as Azure VMs (AMA, Custom Script, Defender) | Extensions run as root/SYSTEM — a real attack surface without a deny allowlist |
az ssh arc / Run Command replace bastion + jump-box sprawl, fully audited |
Outbound-only by design — no inbound management without going through ARM |
The model is right when you have servers outside Azure that genuinely need Azure-grade governance, identity, and patching — regulated estates, long migration tails, multicloud shops. It is over-engineering for a handful of boxes you will retire next quarter, or for workloads that have no compliance, identity, or patching requirement at all. The disadvantages are all manageable — but only if you know they exist, which is the point of this article.
Hands-on lab
Onboard a single machine, prove it healthy, assign an audit baseline, and tear it down — all on a free-tier-friendly Linux VM (you can use a small Azure VM as the “off-Azure” stand-in, or any Ubuntu box you control). Run the Azure-side commands in Cloud Shell (Bash); run the agent commands on the target machine.
Step 1 — Variables and resource group.
RG=rg-arc-lab
LOC=eastus
SP_NAME=sp-arc-lab-onboard
az group create -n $RG -l $LOC -o table
Expected: a resource-group row with provisioningState: Succeeded.
Step 2 — Create the least-privilege onboarding SPN.
az ad sp create-for-rbac \
--name "$SP_NAME" \
--role "Azure Connected Machine Onboarding" \
--scopes "$(az group show -n $RG --query id -o tsv)"
# Note the appId, password, tenant — you'll put them in a 0600 file on the box
Expected: JSON with appId, password, tenant. Treat the password like a secret.
Step 3 — On the target machine, install the agent. (Linux one-liner; Windows uses the MSI.)
# On the Ubuntu box (run as root)
wget https://aka.ms/azcmagent -O ~/install_linux_azcmagent.sh
bash ~/install_linux_azcmagent.sh
azcmagent version # confirm the CLI is installed
Step 4 — Pre-flight the endpoints before connecting.
azcmagent check --location eastus
# Expect PASS for his, guestconfiguration, login.microsoftonline.com, management.azure.com, CDN
If any FQDN fails, fix egress before continuing — onboarding will hang otherwise.
Step 5 — Connect using the SPN via a 0600 config file (secret off the CLI).
cat > /etc/arc-onboard.json <<JSON
{ "subscriptionId":"<sub-id>","resourceGroup":"rg-arc-lab","location":"eastus",
"tenantId":"<tenant>","servicePrincipalId":"<appId>","servicePrincipalSecret":"<password>",
"cloud":"AzureCloud" }
JSON
chmod 600 /etc/arc-onboard.json
azcmagent connect --config /etc/arc-onboard.json \
--tags "Owner=Lab,Datacenter=LAB,DataClassification=None" \
--correlation-id "$(uuidgen)"
shred -u /etc/arc-onboard.json
Expected: Connected machine to Azure. The box now exists in ARM.
Step 6 — Verify health from both sides.
azcmagent show # on the box: Status: Connected, an Agent Version, a MI principal id
# From Cloud Shell
az connectedmachine show -g $RG -n "$(hostname)" \
--query "{status:status, agentVersion:agentVersion, mi:identity.principalId}" -o jsonc
Expected: status: Connected, a non-null mi.
Step 7 — Assign an audit-only CIS-style baseline at the RG scope. Use a built-in audit initiative so there’s nothing to author:
# Example: assign a built-in 'audit insecure password settings' style policy at the RG
az policy assignment create \
--name "lab-audit-baseline" \
--scope "$(az group show -n $RG --query id -o tsv)" \
--policy-set-definition "<built-in-initiative-id>" # pick an Arc-applicable audit initiative
Compliance takes time to evaluate; check it in Resource Graph after ~30–60 minutes with the guestconfigurationresources query from section 5.
Step 8 — Teardown (stop all billing and remove the resource).
# On the box: cleanly disconnect (removes the ARM resource + MI)
azcmagent disconnect --config /dev/null 2>/dev/null || azcmagent disconnect
# From Cloud Shell: nuke the RG and the SPN
az group delete -n $RG --yes --no-wait
az ad sp delete --id "<appId>"
Expected: the machine disappears from ARM; the RG deletes asynchronously. Nothing here incurs ongoing cost once removed.
Common mistakes & troubleshooting
Most Arc incidents are one of a dozen failure modes, and each has a precise signal. Scan the playbook, then read the detail for the row that matches.
| # | Symptom | Root cause | Confirm (exact command / path) | Fix |
|---|---|---|---|---|
| 1 | Onboarding hangs / times out | Egress blocked to a required FQDN | azcmagent check --location <r> |
Allow the failing FQDN / service tag |
| 2 | “Connected” but no managed identity | his data plane unreachable | azcmagent show (MI null) |
Allow *.his.arc.azure.com / fix PE DNS |
| 3 | Private Link healthy, auth still fails | Entra ID/ARM blocked (always public) | NSG/firewall rules review | Allow AzureActiveDirectory + AzureResourceManager tags |
| 4 | Compliance shows 0%/all NonCompliant |
DINE/Modify never remediated existing fleet | Compliance blade; assignment type | az policy remediation create |
| 5 | assignmentType reads Null |
Transient first-hour state | guestconfigurationresources query |
Wait ~1 hour; don’t alert on it |
| 6 | Machine Config never evaluates | guestconfiguration data plane blocked | azcmagent check; status Error |
Allow *.guestconfiguration.azure.com |
| 7 | ESU machine still unpatched | License not linked, or wrong target | licenseProfiles/default.esuProfile empty |
Link license; match --target to OS |
| 8 | ESU bill higher than expected | Over-provisioned cores / wrong type | az connectedmachine license show |
Resize --processors; correct pCore/vCore |
| 9 | Cloned image → duplicate/colliding names | Onboarded before cloning | Two machines, same name in ARM | Onboard at first boot with --resource-name |
| 10 | Extension stuck Creating |
Extension manager can’t reach his/CDN | Extension status; egress | Allow his + download CDN FQDNs |
| 11 | Secret leaked in logs | Secret passed on the CLI | Review automation/log capture | Use --config 0600 file; rotate the secret |
| 12 | Agent on a stale version (CVE) | Auto-upgrade not enabled | azcmagent show version |
Enable automatic agent upgrade |
| 13 | az ssh arc fails |
Notifications FQDN blocked | Test SSH; egress | Allow *.guestnotificationservice.azure.com |
| 14 | Disconnected machine still billing ESU | License left Activated | az connectedmachine license show --query state |
PATCH state to Deactivated |
Onboarding hangs (rows 1–3)
By far the most common rollout failure, and almost always egress. The agent must reach five endpoint classes; block any and connect stalls. The decision table:
If azcmagent check fails on… |
It’s probably… | Do this |
|---|---|---|
login.microsoftonline.com / management.azure.com |
Entra ID/ARM blocked (even behind Private Link) | Allow AzureActiveDirectory + AzureResourceManager tags |
*.his.arc.azure.com |
his data plane / private DNS broken | Fix PE + privatelink.his.arc.azure.com zone |
*.guestconfiguration.azure.com |
guestconfiguration data plane blocked | Allow it (or fix its private DNS zone) |
| Download CDN | Binary download blocked | Allow aka.ms / download.microsoft.com |
| Everything | No egress at all / wrong proxy | Set proxy.url; open 443 outbound |
Compliance shows 0% (rows 4–6)
The panic moment. Nine times in ten it is a remediation task that never ran, because DeployIfNotExists and Modify only act on new/updated resources:
| Observation | Cause | Fix |
|---|---|---|
| All existing machines NonCompliant after assigning DINE | No remediation task | az policy remediation create --policy-assignment <id> |
Brand-new assignment shows Null |
First-hour transient | Wait; exclude from alerting |
Status Error on specific boxes |
guestconfiguration unreachable | Fix egress; azcmagent show |
| Some machines absent from results | MI prerequisites not assigned | Assign the prerequisites initiative at MG scope |
ESU surprises (rows 7, 8, 14)
ESU is the part that touches the invoice, so its failure modes cost money, not just availability:
| Symptom | Cause | Confirm | Fix |
|---|---|---|---|
| Machine eligible but unpatched | License not linked | licenseProfiles/default.esuProfile.assignedLicense empty |
PUT the profile with the license ID |
| “Won’t link” error | --target ≠ the OS |
Compare license target to OS version | Recreate license with correct target |
| Bill too high | Cores over-provisioned | az connectedmachine license show --query processors |
Resize down; verify pCore vs vCore |
| Still billing after decommission | License left Activated |
--query state |
PATCH state = Deactivated |
Best practices
- Onboard with a dedicated, RG-scoped SPN carrying only
Azure Connected Machine Onboarding; prefer certificate auth, pass any secret via a 0600--configfile, andshredit after. - Bake
azcmagent checkinto golden-image validation. Never ship an image that can’t pre-flight its endpoints; configure the proxy beforeconnect. - Never pre-connect a golden image. Install the agent disconnected, and onboard at first boot with a per-machine
--resource-nameto avoid name collisions. - Tag at onboard (
Owner,Datacenter,DataClassification) and enforce a deny-untagged policy so nothing onboards anonymously. - Assign policy at the management-group scope so new machines inherit baselines automatically; reserve resource-scope assignments for genuine exceptions.
- Always run a remediation task after any
DeployIfNotExists/Modifyassignment against a pre-existing fleet — assignment alone never touches existing machines. - Pick
assignmentTypedeliberately:Auditfor reporting,ApplyAndMonitorfor one-time enforcement,ApplyAndAutoCorrectonly where self-healing is genuinely wanted. - Restrict extensions to an allowlist with a
denypolicy — extensions run as root/SYSTEM and are a real attack surface. - Enable auto-upgrade on both the agent and every extension so you are not chasing CVEs in your own tooling.
- Treat ESU licenses as code (Bicep), with the correct edition×type combo and core count; deactivate on decommission so billing stops.
- For regulated estates, use a Private Link Scope per hub VNet — and always allow the
AzureActiveDirectory+AzureResourceManagerservice tags for the unavoidable public egress. - Ship agent and policy diagnostics to a central Log Analytics workspace and build the compliance view in Resource Graph, not portal screenshots.
Security notes
The security posture of an Arc estate rests on three pillars: identity blast radius, the in-guest attack surface, and network exposure. Tighten each deliberately.
| Control | Default / risk | Hardened state |
|---|---|---|
| Onboarding identity | A broad SPN can pivot if leaked | RG-scoped onboarding role, cert auth, rotation |
| Secret handling | Secret on the CLI leaks to logs | 0600 --config file, shredded after use |
| Machine MI privilege | MI inherits whatever you grant | Least-privilege RBAC per machine |
| Extensions | Run as root/SYSTEM, push anything | Deny allowlist of publishers/types |
| Inbound exposure | — | None by design; outbound 443 only |
| Agent data plane | Telemetry over the public internet | Private Link Scope + private DNS |
| Entra ID + ARM egress | Often over-broad “allow internet” | Scoped AzureActiveDirectory + AzureResourceManager tags |
| Audit trail | Scattered/none | ARM activity log + diagnostics to central LA |
| Access to boxes | Standing RDP/SSH, jump boxes | az ssh arc / Run Command via ARM, PIM-gated |
Identity-specific guidance:
| Identity | Least-privilege rule | Extra hardening |
|---|---|---|
| Onboarding SPN | Onboarding role, one RG | Certificate auth; short rotation; alert on use |
| Machine system-assigned MI | Grant only what the in-guest scripts need | Per-machine scoping; review grants quarterly |
| Operator (Resource Administrator) | Scoped to the Arc RG/sub | PIM/JIT activation; conditional access |
| Auditor (Reader) | Read-only inventory + compliance | No write paths at all |
The network exposure model in one line per layer: inbound — nothing, ever (the agent opens no port); outbound — 443 only, to a known FQDN set, ideally split into private (data plane) and tightly-scoped public (identity/ARM); lateral — a compromised box’s MI can do only what you granted it, so least-privilege the MI as if it were a user.
Cost & sizing
The headline that funds the rollout: the Arc management plane is free. There is no per-machine charge to project a server into ARM, run Machine Configuration audits, assign policy, or query Resource Graph. You pay only for value-add services you opt into. Knowing exactly what bills prevents both sticker shock and the opposite error — assuming Arc itself costs money and under-deploying.
| Component | Bills? | Driver | Rough figure |
|---|---|---|---|
| Arc management plane (inventory, policy, Resource Graph) | No | — | ₹0 |
| Machine Configuration audit/remediation | No | — | ₹0 |
| Azure Monitor Agent data ingestion | Yes | GB ingested to Log Analytics | ~₹220–280 / GB ingested |
| Log Analytics retention beyond free period | Yes | GB-months retained | Per GB-month after 31 days free |
| Defender for Servers (Plan 1/2) | Yes | Per server/hour | ~$15/server/month (Plan 2) |
| Update Manager (Arc machines) | Yes | Per Arc server/hour for patch mgmt | ~$5/Arc-server/month equiv |
| Extended Security Updates | Yes | Core count × edition × year | Year 1 lowest, rises Y2/Y3 |
| Private endpoint | Yes | Hours + GB processed | ~₹0.90/hr + per-GB |
ESU sizing is where the real money is, and it scales by cores, not machines:
| Lever | Effect on ESU bill | Right-sizing move |
|---|---|---|
Core count (--processors) |
Linear | Provision the actual cores, not a round-up |
Type (pCore vs vCore) |
pCore floors at 16 | Use vCore (min 8) for small VMs |
| Edition (Standard vs Datacenter) | Datacenter costs more | Standard unless density justifies DC |
| Year of coverage | Y1 < Y2 < Y3 | Migrate before Y3 to cap exposure |
| Deactivation on decommission | Stops billing | PATCH state=Deactivated promptly |
Right-sizing rules of thumb:
| If you have… | Choose | Why |
|---|---|---|
| A small 2-core VM on WS2012 R2 | vCore, Standard |
pCore’s 16-core floor over-buys |
| A dense 2-socket physical host | pCore, Datacenter |
DC covers unlimited VMs on the host |
| A short migration runway | ESU Year 1 only | Cap the most expensive years |
| Light telemetry needs | Tight DCR scope | Ingestion is the sleeper cost |
| Heavy compliance/threat needs | Defender Plan 2 | Worth it for EDR + vuln assessment |
The free tier in practice: onboarding, inventory, policy, Machine Configuration, and Resource Graph cost nothing — you can govern a 600-machine estate’s compliance for ₹0. The bill arrives only with ingestion (control your DCR scope), Defender (opt in where threat protection is needed), Update Manager extras, ESU (the cost of running an out-of-support OS), and private endpoints. Budget those five line items, not “Arc.”
Interview & exam questions
These map to AZ-104, AZ-500, and AZ-305, where hybrid governance, Arc, and Machine Configuration appear.
1. What does Azure Arc-enabled servers actually do to an on-prem machine? It projects the machine into Azure Resource Manager as a Microsoft.HybridCompute/machines resource, so ARM governance — RBAC, Azure Policy, Resource Graph, tags — reaches it. The workload, OS, and network are unchanged; only the control plane extends.
2. The agent is connected but in-guest scripts can’t get a token. Likely cause? The his data plane (*.his.arc.azure.com) is unreachable, so the Hybrid IMDS at localhost:40342 can’t broker tokens. Confirm with azcmagent show (null MI) and fix the egress or the private DNS zone.
3. You enabled a Private Link Scope but onboarding still fails to authenticate. Why? Entra ID (login.microsoftonline.com) and ARM (management.azure.com) always use public endpoints even with Private Link. You must allow the AzureActiveDirectory and AzureResourceManager service tags; the private endpoint only carries the his/guestconfiguration data planes.
4. Difference between Audit, ApplyAndMonitor, and ApplyAndAutoCorrect? Audit reports non-compliance and changes nothing. ApplyAndMonitor applies once at assignment then only reports drift. ApplyAndAutoCorrect runs the Set to remediate on every evaluation — continuous self-healing.
5. You assigned a DeployIfNotExists policy but the existing fleet shows 0% compliant. Fix? DINE and Modify act only on new/updated resources. Create a remediation task (az policy remediation create) to bring existing machines into scope.
6. Why don’t Arc servers need the Guest Configuration extension? The Machine Configuration engine ships in-box with the Connected Machine agent. On Azure VMs you deploy the extension separately; on Arc that half of the prerequisites initiative is a no-op (the identity wiring still applies).
7. What’s the minimum core count for an ESU pCore license, and which edition pairs with it? 16 physical cores minimum for pCore. Datacenter is only valid with pCore; the three valid combos are Standard vCore (min 8), Standard pCore (min 16), and Datacenter pCore (min 16).
8. How do you onboard 500 servers non-interactively without leaking a secret? A dedicated SPN with Azure Connected Machine Onboarding scoped to one RG; pass the secret via a 0600 --config file (never the CLI, which can echo to logs) and shred it after; prefer certificate auth where you can distribute certs.
9. A cloned golden image produced colliding machine names in ARM. What did they do wrong? They onboarded before cloning. Install the agent disconnected in the image, and run connect at first boot with a per-machine --resource-name.
10. Which role should the onboarding identity have, and why not Contributor? Azure Connected Machine Onboarding — it can only create/read Arc machine resources. Contributor would let a leaked secret manage extensions (root/SYSTEM code), ESU, and other resources, a far larger blast radius.
11. How do you stop ESU billing for a decommissioned server? Unlink it (PUT licenseProfiles/default with an empty esuProfile: {}) and, if no other machine uses the license, PATCH the license state to Deactivated to stop billing.
12. What is the single most common reason a Private Link Arc rollout fails the first time? Blocking all outbound internet, forgetting that Entra ID and ARM must stay public. The endpoint looks healthy but nothing can authenticate until the two service tags are allowed.
Quick check
- Which two endpoint classes always use public endpoints, even behind an Arc Private Link Scope?
- You see all existing machines as
NonCompliantright after assigning aDeployIfNotExistspolicy. What did you forget? - What is the mandatory minimum core count for an ESU
pCorelicense? - Where does in-guest tooling read managed-identity tokens from, with no stored secret?
- Why must you avoid passing the SPN secret on the
azcmagent connectcommand line?
Answers
- Microsoft Entra ID (
login.microsoftonline.com,pas.windows.net) and Azure Resource Manager (management.azure.com). Allow theAzureActiveDirectoryandAzureResourceManagerservice tags or the agent can’t authenticate. - A remediation task.
DeployIfNotExists/Modifyact only on new/updated resources; runaz policy remediation createagainst the assignment to bring the existing fleet in. - 16 physical cores. (
vCoreminimum is 8 per VM; Datacenter is only valid withpCore.) - The local Hybrid IMDS endpoint at
http://localhost:40342, brokered by the agent’s system-assigned managed identity. azcmagentcan echo command-line arguments to logs in some failure paths, leaking the secret. Use a 0600--configfile andshredit after onboarding (or use certificate auth).
Glossary
- Azure Arc-enabled servers — Projection of an off-Azure machine into ARM as a
Microsoft.HybridCompute/machinesresource so Azure governance applies. - Connected Machine agent (
azcmagent) — The single package (CLI + HIMDS, GuestConfig, extension-manager services) that connects and manages an Arc server; outbound-only on 443. - Hybrid Instance Metadata Service (HIMDS) — Local service at
localhost:40342that brokers managed-identity tokens and metadata in-guest. - System-assigned managed identity — Per-machine, certificate-backed identity created at connect; lets in-guest scripts authenticate to Azure with no stored secret.
- Machine Configuration — In-box DSC-style engine that audits and optionally remediates in-guest state (registry, files, services, sysctl).
- Guest assignment — A Machine Configuration package bound to a machine (
guestConfigurationAssignments), carrying the compliance status. assignmentType— Whether a guest assignment isAudit,ApplyAndMonitor, orApplyAndAutoCorrect.- Extended Security Updates (ESU) — Paid security patches for out-of-support Windows Server 2012/2012 R2, delivered off-Azure through Arc.
- ESU license — First-class ARM resource (
HybridCompute/licenses) you provision and link to eligible machines. - License profile — The
machines/licenseProfiles/defaultchild that links an ESU license to a machine. - Connectivity mode —
direct,proxy, orPrivate Link; shapes which endpoints go public vs private. - Azure Arc Private Link Scope —
HybridCompute/privateLinkScopesresource that routes the his/guestconfiguration data planes over a private endpoint. - Onboarding service principal — A dedicated, RG-scoped Entra identity with
Azure Connected Machine Onboardingused to connect machines non-interactively. - Remediation task — The action that brings existing resources into compliance for
DeployIfNotExists/Modifypolicies. - Service tag — A named IP range (e.g.
AzureActiveDirectory,AzureResourceManager) used in NSG/firewall rules to allow the always-public agent egress.
Next steps
- Azure Arc-Enabled Kubernetes: GitOps, Policy & Fleet Management — extend the same control-plane projection from servers to clusters.
- Azure Update Manager: Maintenance Configurations & Patch Orchestration with Arc — orchestrate patching (including ESU delivery) across the fleet you just onboarded.
- Defender for Servers: CWPP for Hybrid & Multicloud — add EDR and vulnerability assessment to your Arc machines.
- Azure Policy: Governance at Scale — deepen the policy assignments, initiatives, and remediation that drive Arc compliance.
- Enterprise-Scale Landing Zone: Management-Group Hierarchy Design — place your Arc estate inside a governed landing-zone hierarchy.