Highly Available DNS and DHCP on Windows Server, End to End

DNS and DHCP are the two services nobody notices until they break, at which point the whole estate appears to be on fire. This is the build I use for production: two domain controllers running AD-integrated DNS, a pair of DHCP servers in a failover relationship, and the scavenging, conditional-forwarding, and dynamic-update plumbing that keeps records clean without ever deleting a live host.

Scope: this assumes a single AD domain (contoso.local) with two domain controllers, dc01 (10.10.0.10) and dc02 (10.10.0.11), each also running the DNS Server role. DHCP runs on the same two boxes. Everything is done in PowerShell so it is repeatable and reviewable. Run the DNS cmdlets from the DnsServer module and DHCP cmdlets from the DhcpServer module; both ship with the respective RSAT/role features.

1. AD-integrated vs. primary/secondary, and replication scope

If your DNS servers are domain controllers, use Active Directory-integrated zones and stop thinking about primary/secondary. The reasons are not stylistic:

Multi-master writes. Every DC hosting the zone is writable. A classic primary/secondary setup has one writable primary; lose it and dynamic updates stop until you seize. AD-integrated zones have no single writable node.
Replication rides AD. The zone lives in a directory partition and replicates over the existing DC replication topology with the same security and compression. You do not configure zone transfers between DCs.
Secure dynamic updates. Only AD-integrated zones support Secure dynamic updates, which gate record writes with ACLs (Step 4). This is the whole reason scavenging and DHCP registration stay sane.

The decision that actually matters is replication scope — which partition the zone lands in:

Scope	Replicates to	Use when
`Forest`	All DNS servers on DCs in the forest	Forest-wide name (e.g. a `_msdcs` root, shared infra zone)
`Domain`	All DNS servers on DCs in this domain	The default and right answer for a domain’s own zone
`Legacy`	All DCs in the domain (Windows 2000 partition)	Never, on a modern forest
`Custom`	DCs enlisted in a named app partition	You explicitly want a subset of DCs to host it

Create the forward and reverse zones domain-wide:

# Forward lookup zone, AD-integrated, domain-wide replication, secure updates only
Add-DnsServerPrimaryZone -Name "contoso.local" `
  -ReplicationScope "Domain" `
  -DynamicUpdate "Secure" `
  -ComputerName dc01

# Reverse lookup zone for 10.10.0.0/24
Add-DnsServerPrimaryZone -NetworkId "10.10.0.0/24" `
  -ReplicationScope "Domain" `
  -DynamicUpdate "Secure" `
  -ComputerName dc01

You only run this on one DC. Replication delivers the zone to dc02 automatically; confirm with Get-DnsServerZone -ComputerName dc02 after convergence.

-DynamicUpdate "Secure" is non-negotiable for any zone DHCP writes into. NonsecureAndSecure lets any host overwrite any record, which is both a cleanup nightmare and a spoofing vector.

2. Forwarders, conditional forwarders, and root hints

Three different mechanisms, frequently confused. Get them straight:

Forwarders are where a DNS server sends queries it is not authoritative for — typically your ISP or a cloud resolver. This is your default outbound path for general internet resolution.
Conditional forwarders override that for a specific domain. “Anything for partner.example goes to these nameservers” — the standard tool for resolving a partner domain or an Azure private DNS zone across a VPN.
Root hints are the fallback when no forwarder answers: the server walks the public root servers itself. Keep them present, but with forwarders configured they are a safety net, not the primary path.

Set general forwarders (do not point a DC at itself or at the other DC as a forwarder — that creates resolution loops; forwarders are for external resolvers):

# General forwarders -> a public resolver pair (use your approved upstreams)
Set-DnsServerForwarder -IPAddress 1.1.1.1, 9.9.9.9 -UseRootHint $true -ComputerName dc01
Set-DnsServerForwarder -IPAddress 1.1.1.1, 9.9.9.9 -UseRootHint $true -ComputerName dc02

Add a conditional forwarder, and make it AD-integrated so it replicates to both DCs instead of being configured twice:

Add-DnsServerConditionalForwarderZone -Name "partner.example" `
  -MasterServers 192.0.2.53, 192.0.2.54 `
  -ReplicationScope "Forest" `
  -ComputerName dc01

-MasterServers is the list of authoritative servers for that domain. -ReplicationScope "Forest" (or Domain) makes it an AD object; omit it and you get a server-local forwarder you must replicate by hand. Verify root hints are intact with Get-DnsServerRootHint.

3. Aging and scavenging without deleting live records

Scavenging deletes stale dynamic records. Done wrong, it deletes records people are still using and you spend an afternoon explaining why a file server “disappeared.” The mechanism is two intervals plus a server-level sweep:

No-refresh interval (default 7 days): after a record’s timestamp is written, the server refuses to re-stamp it. This exists to suppress AD write churn — without it every renewal would replicate.
Refresh interval (default 7 days): the window after no-refresh during which the record can be refreshed and re-stamped.
A record becomes eligible for scavenging only after no-refresh + refresh have both elapsed — 14 days by default — with no successful refresh in the refresh window.

The rule that keeps you out of trouble: no-refresh + refresh must be <= your DHCP lease duration. A client refreshes its DNS record at lease renewal (50% of lease). If the combined interval is shorter than the lease, a live machine that renews normally can still age out between renewals. With an 8-day lease, 4 + 4 is safe; with the default 7 + 7 = 14, never use a lease shorter than 14 days.

Scavenging has to be enabled in two places, and forgetting the second is the most common reason “scavenging is on but nothing gets cleaned”:

# 1. Per-zone aging: enable, set the two intervals
Set-DnsServerZoneAging -Name "contoso.local" -Aging $true `
  -NoRefreshInterval 4.00:00:00 `
  -RefreshInterval 4.00:00:00 `
  -ComputerName dc01

# 2. Server-level scavenging: the sweep that actually deletes, plus how often it runs
Set-DnsServerScavenging -ScavengingState $true `
  -ScavengingInterval 7.00:00:00 `
  -ApplyOnAllZones $true `
  -ComputerName dc01

TimeSpan strings are days.hours:minutes:seconds, so 4.00:00:00 is four days. Enable server-level scavenging on one DC only at first; a single scavenging server avoids two DCs racing to delete the same records. The intervals on the zone replicate; the server ScavengingState is per-server.

Before you trust it, force a pass and watch the DNS event log rather than waiting a week:

Start-DnsServerScavenging -ComputerName dc01 -Force -Verbose

4. Secure dynamic updates and cleaning up stale records

With Secure updates, the host (or DHCP, acting on its behalf) that first registers a record becomes its owner via an ACL. Only the owner can update it. This is great until DHCP failover enters the picture (Step 6): if each DHCP server registers under its own machine account, the partner cannot update the other’s records and you accumulate stale, un-updatable duplicates.

The fix is a dedicated, low-privilege service account used by both DHCP servers for dynamic updates, so a single identity owns every DHCP-registered record:

# Create a plain user account with no special rights; it only needs to own DNS records
New-ADUser -Name "svc-dhcp-dnsupdate" `
  -SamAccountName "svc-dhcp-dnsupdate" `
  -AccountPassword (Read-Host -AsSecureString "Password") `
  -Enabled $true `
  -PasswordNeverExpires $true `
  -CannotChangePassword $true

Then point both DHCP servers at it (this command must be run on each DHCP server, as the credential is stored locally):

$cred = Get-Credential "contoso\svc-dhcp-dnsupdate"
Set-DhcpServerDnsCredential -Credential $cred -ComputerName dc01
Set-DhcpServerDnsCredential -Credential $cred -ComputerName dc02

Both failover partners must use the same DNS credential. Mismatched (or absent) credentials are the textbook cause of “DHCP works but half my hosts have wrong/stale DNS records.”

Finding and removing stale duplicates

To audit duplicate A records (multiple hosts on one IP, or one host with several IPs from churn), pull the resource records and group:

# A records whose IP is shared by more than one name -> likely stale duplicates
Get-DnsServerResourceRecord -ZoneName "contoso.local" -RRType "A" -ComputerName dc01 |
  Where-Object { $_.RecordData.IPv4Address } |
  Group-Object { $_.RecordData.IPv4Address.IPAddressToString } |
  Where-Object Count -gt 1 |
  Sort-Object Count -Descending |
  Format-Table Name, Count, @{n='Hosts';e={ ($_.Group.HostName) -join ', ' }} -AutoSize

Remove a confirmed-dead record explicitly rather than waiting for scavenging:

$rr = Get-DnsServerResourceRecord -ZoneName "contoso.local" -Name "oldhost" -RRType "A" -ComputerName dc01
Remove-DnsServerResourceRecord -ZoneName "contoso.local" -InputObject $rr -ComputerName dc01 -Force

Always inspect the grouped output before deleting. Scavenging is the bulk safety mechanism; manual removal is for known offenders.

5. DHCP scopes, reservations, options, and policies

Authorize the servers in AD first — an unauthorized Windows DHCP server will not hand out leases:

Add-DhcpServerInDC -DnsName "dc01.contoso.local" -IPAddress 10.10.0.10
Add-DhcpServerInDC -DnsName "dc02.contoso.local" -IPAddress 10.10.0.11

Create a scope and set options. Lease duration here is 8 days to satisfy the scavenging rule from Step 3:

Add-DhcpServerv4Scope -Name "LAN-10.10.0.0" `
  -StartRange 10.10.0.50 -EndRange 10.10.0.250 `
  -SubnetMask 255.255.255.0 `
  -LeaseDuration 8.00:00:00 `
  -State Active -ComputerName dc01

# Scope-level options: gateway (003), DNS servers (006), DNS domain (015)
Set-DhcpServerv4OptionValue -ScopeId 10.10.0.0 `
  -Router 10.10.0.1 `
  -DnsServer 10.10.0.10, 10.10.0.11 `
  -DnsDomain "contoso.local" `
  -ComputerName dc01

Reservations pin an IP to a MAC for servers and printers that need a stable address but still benefit from centrally managed options:

Add-DhcpServerv4Reservation -ScopeId 10.10.0.0 `
  -IPAddress 10.10.0.60 `
  -ClientId "AA-BB-CC-11-22-33" `
  -Name "printer-floor2" `
  -ComputerName dc01

Policy-based assignment lets one scope hand different options to different device classes — e.g. give VoIP phones (matched by MAC OUI / vendor class) a different gateway or a dedicated address range:

# Policy matching a vendor's MAC prefix, carving a sub-range out of the scope
Add-DhcpServerv4Policy -Name "VoIP-Phones" -ScopeId 10.10.0.0 `
  -MacAddress "EQ,AABBCC*" -ComputerName dc01
Set-DhcpServerv4OptionValue -PolicyName "VoIP-Phones" -ScopeId 10.10.0.0 `
  -Router 10.10.0.2 -ComputerName dc01

6. DHCP failover: load-balance vs. hot-standby and MCLT

DHCP failover replicates lease and scope data between two servers so either can serve clients. It is a per-IPv4-scope relationship (IPv6 is not supported). Two modes:

Load balance (default): both servers actively lease, split by LoadBalancePercent (commonly 50/50). Best for a single site where both servers are healthy and you want active/active.
Hot standby: one server (Active) leases; the partner (Standby) waits and takes over on failure. ReservePercent reserves a slice of the pool for the standby to lease during the MCLT window before it assumes the full pool. Best for a branch whose standby lives in another site.

The parameter that confuses everyone is MCLT (Maximum Client Lead Time). It is not the lease time. MCLT governs three things:

The temporary lease length a server grants when it has lost contact with its partner (Communication Interrupted state).
How long a server waits in Partner Down before it claims 100% of the address pool.
How long an address is held back before it can be reassigned to a new client after the partner owned it.

Smaller MCLT = faster takeover but more replication overhead in normal operation; larger MCLT = less overhead but a longer delay before the survivor controls the whole pool. An hour is a reasonable middle ground for most LANs; very latency-sensitive shops go lower.

Create a load-balance relationship across both DCs (run once; it configures both ends):

Add-DhcpServerv4Failover -Name "LAN-Failover" `
  -PartnerServer dc02.contoso.local `
  -ScopeId 10.10.0.0 `
  -LoadBalancePercent 50 `
  -MaxClientLeadTime 01:00:00 `
  -AutoStateTransition $true `
  -StateSwitchInterval 01:00:00 `
  -SharedSecret "UseAStrongSecretHere" `
  -ComputerName dc01

For hot standby instead, swap the mode-specific parameters (ServerRole + ReservePercent replace LoadBalancePercent):

Add-DhcpServerv4Failover -Name "Branch-Failover" `
  -PartnerServer dc02.contoso.local `
  -ScopeId 10.10.0.0 `
  -ServerRole Active `
  -ReservePercent 5 `
  -MaxClientLeadTime 01:00:00 `
  -AutoStateTransition $true `
  -SharedSecret "UseAStrongSecretHere" `
  -ComputerName dc01

-AutoStateTransition $true with -StateSwitchInterval lets a server automatically move from Communication Interrupted to Partner Down after the interval, instead of waiting for an admin. Add scopes to an existing relationship later with Add-DhcpServerv4FailoverScope.

7. Diagnostics: Resolve-DnsName, nslookup, and the analytic log

Resolve-DnsName is the modern, scriptable resolver. Use -Server to test a specific DNS server (vital when verifying both DCs agree) and -DnsOnly to bypass other name providers:

Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.10 -Type A
Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.11 -Type A   # confirm dc02 matches
Resolve-DnsName -Name partner.example -Server 10.10.0.10              # exercises the conditional forwarder

nslookup is still useful for interactive, low-level checks (and for proving the forwarder path):

nslookup
> server 10.10.0.10
> set type=srv
> _ldap._tcp.dc._msdcs.contoso.local

The DNS Analytic log is ETW-based, off by default, and the best tool for “who is querying what.” Audit events are on already; turn analytic on only when investigating, because at very high QPS it has measurable cost:

# Inspect current diagnostic settings
Get-DnsServerDiagnostics -ComputerName dc01

# Enable query logging detail (audit is already on; this raises diagnostic verbosity)
Set-DnsServerDiagnostics -ComputerName dc01 `
  -Queries $true -Answers $true -ReceivePackets $true -SendPackets $true

The analytic channel itself (Microsoft-Windows-DNSServer/Analytical) is enabled via Event Viewer (DNS-Server node, Show Analytic and Debug Logs) or wevtutil, then read with ETW consumers. Disable it again once you have your answer.

Enterprise scenario

A retail client ran load-balance DHCP failover (50/50) across two DCs in the same datacenter. After a switch upgrade, the two DCs ended up on opposite sides of a firewall pair that NAT-ed nothing but did stateful inspection on the failover channel (TCP 647). Failover went Communication Interrupted, then both sides independently hit Partner Down and each started serving the full /23 pool. For about 40 minutes neither server knew the other was leasing, so both handed out addresses from the same ranges. Result: duplicate IP assignments, gratuitous-ARP conflicts, and a wave of clients dropping off the network.

Root cause was the firewall silently dropping idle 647 sessions after 30 minutes, well under our MaxClientLeadTime of one hour, so the survivor claimed the pool before the partner could re-establish. The real fix was a firewall exception for the channel, but we also stopped trusting auto-transition to paper over a flapping link:

# Don't auto-jump to Partner Down on a flaky channel; require a human for full-pool takeover
Set-DhcpServerv4Failover -Name "LAN-Failover" `
  -AutoStateTransition $false `
  -MaxClientLeadTime 01:00:00 `
  -ComputerName dc01

The durable lesson: in load-balance mode an unreliable failover channel is worse than no failover, because both nodes confidently lease the same addresses. Either guarantee the channel (firewall rule, dedicated path) or switch that scope to hot-standby with a ReservePercent, where only the active node owns the bulk of the pool.

Verify

Run these after the build; every one should pass before you call it done.

# Zones exist and replicated to both DCs, secure updates on
Get-DnsServerZone -ComputerName dc01 | Where-Object IsDsIntegrated
Get-DnsServerZone -ComputerName dc02 | Where-Object IsDsIntegrated

# Aging/scavenging settings on the zone
Get-DnsServerZoneAging -Name "contoso.local" -ComputerName dc01

# Forwarders and conditional forwarders present
Get-DnsServerForwarder -ComputerName dc01
Get-DnsServerZone -ComputerName dc01 | Where-Object ZoneType -eq "Forwarder"

# DHCP authorized in AD
Get-DhcpServerInDC

# Failover relationship healthy on BOTH partners (state should be "Normal")
Get-DhcpServerv4Failover -ComputerName dc01
Get-DhcpServerv4Failover -ComputerName dc02

# Both DNS servers resolve a known host identically
Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.10
Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.11

A healthy failover relationship reports State : Normal on both servers. If one says Normal and the other says Communication Interrupted, the relationship is not actually redundant — investigate connectivity and the shared secret before trusting it.

Build checklist

Monitoring and a failover test runbook

Alert on the things that signal real trouble, not noise. At minimum:

DHCP failover state != Normal on either partner (poll Get-DhcpServerv4Failover; alert on Communication Interrupted or Partner Down).
Scope utilization above ~85% (Get-DhcpServerv4ScopeStatistics) — a pool about to exhaust is an outage waiting to happen, doubly so during a partner-down period.
DHCP service (DHCPServer) and DNS service (DNS) stopped.
DNS scavenging deletions spiking — sudden large deletes in the DNS event log usually mean an interval is misconfigured against your lease.

A failover test you can actually run during a maintenance window:

Record baseline: Get-DhcpServerv4Failover (expect Normal both sides) and Get-DhcpServerv4ScopeStatistics on both.
Stop DHCP on dc01: Stop-Service DHCPServer (or take the box down to simulate a real failure).
Confirm dc02 keeps leasing — release/renew on a test client (ipconfig /release then /renew) and confirm it gets a valid address with correct options.
After the configured switch interval, confirm dc02 reports Partner Down and is serving the full pool.
Restore dc01: Start-Service DHCPServer. Confirm both sides return to Normal and lease data re-synchronizes (Get-DhcpServerv4Failover shows matching scope counts).
Repeat in the other direction (fail dc02, verify dc01). Redundancy you have only tested in one direction is half-tested.

Pitfalls

Scavenging interval shorter than the DHCP lease. The single most common cause of live records vanishing. Keep no-refresh + refresh <= lease, always.
Mismatched DHCP DNS credentials across failover partners. Records register fine, then half of them go stale because the partner cannot update what it does not own.
Pointing a DC’s forwarder at itself or the other DC. Forwarders are for external resolvers; doing this creates resolution loops and intermittent SERVFAIL.
Trusting a relationship that is Normal on only one side. Always verify both partners; a one-sided Normal is not redundancy.
Enabling server-level scavenging on every DC at once on day one. Start with one, confirm behavior in the event log, then it is safe to leave as-is. Multiple scavenging servers are fine in steady state but make first-run debugging harder.