Identity Hybrid

Active Directory Domain Services Forest Design and Domain Controller Promotion on Azure IaaS

Running Active Directory Domain Services on Azure IaaS is not “lift the on-prem DC and shut up.” Forest design, FSMO placement, and how you model Azure subnets in Sites and Services determine whether authentication survives a zone outage and whether replication converges or becomes a permanent slow burn. This is the build I run for production AD DS on Azure VMs.

Scope note: this is self-managed AD DS on IaaS VMs — full forest control, schema extension, GPO, the works. It is not Microsoft Entra Domain Services (the managed, you-don’t-touch-the-DCs offering) and not Entra ID. Pick IaaS DCs when you need schema extensions, a writable forest root you own, or legacy app compatibility the managed service will not give you.

1. Forest, domain, tree, and OU design decisions

Start with the boundary questions, because they are expensive to reverse.

OU design is your real administrative model. Do not mirror the org chart; model it around what you delegate and what GPOs you target. A durable baseline:

contoso.local
+-- OU=Admin            (tier-0: PAWs, admin accounts, DC-related groups)
+-- OU=Servers
|   +-- OU=Infrastructure
|   +-- OU=Application
+-- OU=Workstations
|   +-- OU=Standard
|   +-- OU=Kiosk
+-- OU=Users
|   +-- OU=Employees
|   +-- OU=ServiceAccounts
+-- OU=Groups

Keep computer and user objects out of the default CN=Computers and CN=Users containers — you cannot link a GPO to a container, only to an OU/site/domain. Use redircmp and redirusr to redirect default object creation into real OUs.

2. Plan DNS, namespace, and trusts before the first DC

AD will not function without correct DNS. Decide these before Install-ADDSForest runs:

# Pin the VNet to resolve via the DC private IPs (Terraform).
resource "azurerm_virtual_network" "hub" {
  name                = "vnet-identity-hub"
  resource_group_name = azurerm_resource_group.identity.name
  location            = "eastus"
  address_space       = ["10.10.0.0/16"]
  dns_servers         = ["10.10.1.4", "10.10.2.4"] # DC1, DC2 private IPs
}

resource "azurerm_subnet" "dc" {
  name                 = "snet-dc"
  resource_group_name  = azurerm_resource_group.identity.name
  virtual_network_name = azurerm_virtual_network.hub.name
  address_prefixes     = ["10.10.1.0/24"]
}

3. Size and place domain controllers across regions and zones

Plan for failure domains, not just CPU.

Decision Guidance
Count Minimum two DCs per domain per region so a single DC reboot (patch night) never leaves a region without authentication.
Zones Spread the two regional DCs across two Availability Zones so a single-zone outage leaves a live DC.
Cross-region A second region needs its own DCs modeled as a separate AD site (Step 5), not stretched LAN.
VM size Modest but memory-bound — the DIT should fit in RAM. A general-purpose D-series (e.g. D2s/D4s v5) suits most. Avoid B-series burstable for production DCs: credit starvation stalls LSASS under load.
Disk Place the database (NTDS.DIT), logs, and SYSVOL on a separate managed data disk with host caching None. The AD database must not use write caching that can lose committed transactions.

If the region supports Availability Zones, prefer zones over Availability Sets for the two DCs so a single-zone outage leaves a live DC.

Attach and initialize the data disk before promotion:

# Bring the new data disk online and format it (run in-guest).
$disk = Get-Disk | Where-Object PartitionStyle -eq 'RAW' | Select-Object -First 1
Initialize-Disk -Number $disk.Number -PartitionStyle GPT
New-Partition -DiskNumber $disk.Number -UseMaximumSize -DriveLetter N |
    Format-Volume -FileSystem NTFS -NewFileSystemLabel "AD-DATA" -Confirm:$false

4. Promote the first DC and additional DCs

dcpromo.exe is long dead; promotion is PowerShell. First install the role binaries:

Install-WindowsFeature AD-Domain-Services -IncludeManagementTools

First DC — create a new forest (greenfield):

$safeModePwd = Read-Host -AsSecureString -Prompt "DSRM password"

Install-ADDSForest `
    -DomainName "corp.contoso.com" `
    -DomainNetbiosName "CORP" `
    -ForestMode "WinThreshold" `   # 2016 functional level; raise as supported
    -DomainMode "WinThreshold" `
    -InstallDns `
    -DatabasePath "N:\NTDS" `
    -LogPath "N:\NTDS" `
    -SysvolPath "N:\SYSVOL" `
    -SafeModeAdministratorPassword $safeModePwd `
    -NoRebootOnCompletion:$false `
    -Force

WinThreshold is the functional-level identifier for Windows Server 2016 — there is no new forest/domain functional level for 2019 or 2022, so 2016 is the highest meaningful FL on current servers. Choose the highest level every DC can support.

Additional DC — replica into the existing domain (this is also the extension pattern for Azure DCs joining on-prem):

$safeModePwd = Read-Host -AsSecureString -Prompt "DSRM password"

Install-ADDSDomainController `
    -DomainName "corp.contoso.com" `
    -Credential (Get-Credential "CORP\domainadmin") `
    -InstallDns `
    -SiteName "Azure-EastUS" `        # pre-create the site first (Step 5)
    -DatabasePath "N:\NTDS" `
    -LogPath "N:\NTDS" `
    -SysvolPath "N:\SYSVOL" `
    -SafeModeAdministratorPassword $safeModePwd `
    -Force

Prerequisites that bite: time must be sane (Step 8), the promoting DC must resolve the helper DC via DNS (Step 2), and AD replication ports must be open between sites — for a cross-region replica over ExpressRoute that means RPC dynamic ports (49152-65535) plus 389/636/88/445/135. Run Test-ADDSDomainControllerInstallation and read its output; do not skip it.

For the first replica over a slow link, use Install From Media (IFM): generate an ntdsutil IFM set from a healthy DC and pass -InstallationMediaPath so promotion seeds the DIT from media instead of pulling the whole database across the WAN.

5. Configure AD Sites and Services for Azure

Out of the box every DC lands in Default-First-Site-Name. That is wrong the moment you have more than one Azure region: clients use site topology to find the nearest DC, and the KCC builds replication around sites. Map every Azure subnet to an AD site.

# Create sites for each Azure region.
New-ADReplicationSite -Name "Azure-EastUS"
New-ADReplicationSite -Name "Azure-WestEU"

# Associate the VNet subnets with their sites (must match Azure CIDRs exactly).
New-ADReplicationSubnet -Name "10.10.0.0/16" -Site "Azure-EastUS"
New-ADReplicationSubnet -Name "10.20.0.0/16" -Site "Azure-WestEU"

# Site link controlling inter-site replication frequency.
New-ADReplicationSiteLink -Name "EastUS-WestEU" `
    -SitesIncluded "Azure-EastUS","Azure-WestEU" `
    -ReplicationFrequencyInMinutes 15 `
    -InterSiteTransportProtocol IP

Then confirm each DC’s server object is in the correct site (it follows -SiteName at promotion if you pre-created the site). Within a site the KCC handles intra-site replication automatically (change-notification based, near-real-time). Across sites it compresses and batches on the site-link schedule. Set ReplicationFrequencyInMinutes to 15 for responsive cross-region convergence; the legacy 180-minute default is a WAN-era artifact.

Subnet accuracy matters: a member VM whose IP is covered by no site subnet logs “no site mapping” and may authenticate cross-region, adding latency and egress cost. Cover all VNet ranges, including spokes.

6. Distribute and seize FSMO roles; plan for DC failure

The five FSMO roles are single-instance — two forest-wide, three per-domain.

Role Scope Notes
Schema Master Forest Schema changes only; co-locate with the PDC holder.
Domain Naming Master Forest Needed to add/remove domains.
PDC Emulator Domain Most operationally critical: time source, password chaining, GPO-edit default, lockout. Put it on your most reliable primary-region DC.
RID Master Domain Hands out RID pools; if down, new-object creation eventually stalls.
Infrastructure Master Domain Cross-domain reference updates; irrelevant in a single-domain forest where all DCs are GCs.

View and transfer (the graceful operation) roles:

# Where do the roles live?
Get-ADDomain  | Select-Object PDCEmulator, RIDMaster, InfrastructureMaster
Get-ADForest  | Select-Object SchemaMaster, DomainNamingMaster

# Graceful transfer to a healthy DC (use when the current holder is alive).
Move-ADDirectoryServerOperationMasterRole `
    -Identity "DC2-EastUS" `
    -OperationMasterRole PDCEmulator, RIDMaster, InfrastructureMaster

Seizing is the destructive emergency operation when the holder is permanently dead and will never come back:

# SEIZE only when the old holder is gone for good. Add -Force.
Move-ADDirectoryServerOperationMasterRole `
    -Identity "DC2-EastUS" `
    -OperationMasterRole SchemaMaster, DomainNamingMaster, PDCEmulator, RIDMaster, InfrastructureMaster `
    -Force

Hard rule: a DC whose role was seized must never come back online — a returning holder that still thinks it owns the role causes a duplicate-role split brain. Wipe it. To decommission a live DC, run Uninstall-ADDSDomainController (demotion) so it cleans up its own metadata; reserve ntdsutil metadata cleanup for a DC that died without demoting.

7. Group Policy design

GPO is where OU design pays off. Processing order is L-S-D-OU: Local, Site, Domain, then OU (nested OUs apply parent-to-child). Last writer wins, so a GPO on a deep OU overrides a domain-linked one — unless you set Enforced on the higher link, which inverts precedence and defeats Block Inheritance. Principles I hold to:

# Create the Central Store and seed it with the local ADMX/ADML files.
$cs = "\\corp.contoso.com\SYSVOL\corp.contoso.com\Policies\PolicyDefinitions"
New-Item -Path $cs -ItemType Directory -Force
Copy-Item "$env:WINDIR\PolicyDefinitions\*" -Destination $cs -Recurse -Force

# Create and link a GPO to an OU, computer settings only.
New-GPO -Name "Servers - Security Baseline" |
    New-GPLink -Target "OU=Servers,DC=corp,DC=contoso,DC=com" -LinkEnabled Yes
Set-GPo... # (use Set-GPRegistryValue / GPMC for actual settings)

8. Operational hardening

This separates a DC that runs for years from one that quietly rots.

Time sync — the Azure-specific gotcha. Kerberos breaks beyond a 5-minute skew. The PDC Emulator of the forest root domain is the authoritative time source; every other DC and member syncs down from it. On Azure VMs the default is to sync from the host via the Hyper-V time provider (VMICTimeProvider), which fights the AD hierarchy. Point the PDC at a reliable external NTP source and disable host time integration so the AD hierarchy stays authoritative.

# On the forest-root PDC Emulator: authoritative external time.
w32tm /config /manualpeerlist:"time.windows.com,0x9" /syncfromflags:manual /reliable:yes /update
Restart-Service w32time
w32tm /resync /rediscover
w32tm /query /status   # confirm Source is the external peer, not "Local CMOS Clock"

Secure LDAP (LDAPS). Plain LDAP on 389 sends binds in the clear. Issue a certificate from your enterprise CA to each DC with Server Authentication EKU and the DC’s FQDN; the DC auto-binds it to 636. Then enforce LDAP channel binding and signing (Microsoft’s hardened default since the 2020 updates) — audit first via the relevant event logs, then enforce, so legacy simple-bind clients on 389 are caught before you break them.

Backup and restore. Back up System State on at least one DC per domain (Windows Server Backup or Azure Backup, which supports System State). Know the difference between an authoritative restore (ntdsutil marks objects/subtree authoritative to roll a deletion back across the domain) and a non-authoritative one (the DC catches up via replication). Treat the AD Recycle Bin as the first line for object recovery — enable it forest-wide if you have not:

Enable-ADOptionalFeature 'Recycle Bin Feature' `
    -Scope ForestOrConfigurationSet `
    -Target "corp.contoso.com"   # IRREVERSIBLE once enabled

Do not snapshot a DC for “backup.” Reverting a DC VM snapshot reintroduces old USNs and triggers a USN rollback that quarantines the DC. Use System State backup, or the safety net of multiple DCs plus the Recycle Bin. (Modern DCs have VM-GenerationID protection for the worst snapshot cases, but do not lean on it.)

Enterprise scenario

A retail platform team extended an on-prem corp.contoso.com forest into Azure East US over ExpressRoute, promoting two replica DCs into a new Azure-EastUS site. Promotion passed, dcdiag was clean — but within a day the help desk reported intermittent slow logons and a spike in cross-region ExpressRoute egress on the billing export. Replication was healthy; the problem was client site affinity.

The gotcha: they had created the AD site and the 10.50.0.0/16 subnet for the DC subnet, but the application spoke VNets sat in 10.60.0.0/16 and 10.70.0.0/16, which mapped to no AD site. Every member VM in those spokes fell back to a random DC — frequently the on-prem PDC across the WAN — because the DC Locator had no subnet-to-site mapping to steer it local. nltest /dsgetdc:corp.contoso.com from a spoke VM returned an on-prem DC, confirming it.

The fix was to register every spoke CIDR (including future-reserved ranges) to the correct site, then force the clients to re-locate:

# Map all spoke subnets to the Azure site so DC Locator stays in-region.
New-ADReplicationSubnet -Name "10.60.0.0/16" -Site "Azure-EastUS"
New-ADReplicationSubnet -Name "10.70.0.0/16" -Site "Azure-EastUS"

# On affected members: drop the cached locator and confirm a local DC.
nltest /dsgetdc:corp.contoso.com /force

Logons normalized immediately and ExpressRoute egress dropped back to baseline. The lesson: AD site coverage must track the whole VNet topology, spokes included — not just the subnet your DCs happen to live in.

Verify

Run these after promotion and after any topology change.

# Overall DC health: replication, FSMO, SYSVOL, DNS, services.
dcdiag /v /c /e
dcdiag /test:dns /e /v

# Replication status across all partners; look for 0 failures and recent times.
repadmin /replsummary
repadmin /showrepl
repadmin /replicate DC2-EastUS DC1-EastUS "DC=corp,DC=contoso,DC=com"

# SYSVOL / DFS-R health (modern forests use DFS-R, not FRS).
dfsrdiag ReplicationState

# Confirm FSMO placement and that all DCs are Global Catalogs.
netdom query fsmo
Get-ADDomainController -Filter * | Select Name, Site, IsGlobalCatalog

# Time hierarchy is sane on a member/DC.
w32tm /query /status

Expect repadmin /replsummary to show zero failures and last-success timestamps within your site-link frequency. Any dcdiag failure on Connectivity, Replications, NetLogons, or DNS is a stop-ship.

Checklist

Pitfalls and next steps

The recurring failures I see on Azure AD DS: VNet left on Azure-provided DNS so nothing joins; DCs on B-series VMs stalling LSASS; data disk left on default write caching, risking the DIT; subnets not mapped to sites so clients authenticate cross-region; and the catastrophic one — restoring a DC from a VM snapshot and triggering USN rollback. Each is avoidable with the steps above.

Next, layer in tiered administration (a tier-0 OU with PAWs), Microsoft LAPS for member local-admin passwords, and an ExpressRoute path with proper RPC port allowances if these DCs extend an on-prem forest. Then monitor replication continuously (repadmin /replsummary on a schedule into your SIEM) rather than discovering divergence during an incident.

Active DirectoryAD DSDomain ControllerAzureGroup Policy

Comments

Keep Reading