Ansible for OS Migrations, In Depth — P2V, V2V, RHEL Major-Version Upgrades and Windows Server Upgrades
Operating system migrations are the projects that consume two engineers for nine months and deliver a 30-page wiki page that nobody reads. They do not have to. The work splits into a handful of repeatable patterns, each of which Ansible automates well: lift-and-shift conversion (P2V/V2V), major-version in-place upgrade (RHEL leapp, Windows setup.exe), and migrate-and-modernise (rebuild on a new image and replay configuration with Ansible). This lesson teaches all three, with the runbook patterns that keep migrations safe at fleet scale.
We will be opinionated. We will not cover every vendor migration tool that ever shipped — we will cover the open-source toolchain (virt-v2v, leapp, dnf, ansible-core, the community.windows collection) plus the small number of commercial pieces (vSphere, Veeam, AWS MGN) that actually integrate with Ansible runbooks. We will also be honest about the mistakes — the migrations that look clean on paper and break in production because someone forgot to reconcile UID space or DNS suffix.
Position in the curriculum. This lesson assumes Tier 1–4 fluency plus the Tier 5 compliance and DR lessons. Migrations interact with both: a migration window is a controlled mini-disaster, and the DR runbook is your safety net.
What “OS migration” means in the Ansible context
Three migration shapes account for almost all real work:
- P2V — Physical to Virtual. A bare-metal server (often a 10-year-old HP ProLiant) is captured and replayed as a VM on vSphere/KVM/AWS. Used to retire dying hardware without re-architecting the application.
- V2V — Virtual to Virtual. A VM moves between hypervisors or between vSphere clusters or out to cloud (vSphere → AWS via AWS MGN, vSphere → KVM via virt-v2v, Hyper-V → vSphere). Used to consolidate, exit a vendor, or migrate to cloud.
- In-place major-version upgrade. RHEL 7 → 8 → 9 with
leapp; Windows Server 2016 → 2019 → 2022 withsetup.exe /Auto:Upgrade; Ubuntu 20.04 → 22.04 → 24.04 withdo-release-upgrade. The OS stays on the same VM; only the userland and kernel change.
There is a fourth pattern, rebuild-and-reconfigure, which is technically not a migration: it is a clean install of the new OS plus an Ansible playbook replay against fresh hosts, then a data cut-over. It is the safest of the four when feasible, because the new host is built from the current source of truth instead of carrying years of manual drift. We cover it last because it is the easiest to do right.
Ansible’s role in all four:
- Inventory the source estate (what’s on it, who depends on it, what data lives where).
- Pre-flight check every host (disk space, repos, kernel modules, third-party agents).
- Drive the conversion or upgrade tool (
virt-v2v,leapp,setup.exe). - Reconcile networking, DNS, identity, certificates, monitoring, and application configuration on the new OS.
- Validate end-to-end with smoke tests and synthetic transactions.
- Roll back if validation fails (snapshots, virt-v2v originals, leapp rollback).
- Decommission the source once the new host has run cleanly for the bake-in window.
The discipline is the same as DR: small, idempotent roles; each step gated; the runbook is the source of truth.
The migration repository layout
Treat the migration as its own short-lived repository (or a sub-tree of a long-lived one). Once the migration is complete, the runbook code is preserved as evidence; the inventory is archived.
os-migration-rhel7-to-rhel9/
├── ansible.cfg
├── collections/requirements.yml # community.general, ansible.posix,
│ # community.crypto, redhat.rhel_system_roles,
│ # community.windows (for mixed estates)
├── inventory/
│ ├── source/ # what we are leaving (RHEL 7)
│ └── target/ # what we are arriving at (RHEL 9)
├── group_vars/
│ └── all/
│ ├── migration_window.yml
│ ├── leapp.yml
│ └── vault.yml
├── playbooks/
│ ├── 00-discover-source.yml
│ ├── 10-pre-flight-checks.yml
│ ├── 20-snapshot-and-backup.yml
│ ├── 30-leapp-preupgrade.yml
│ ├── 40-leapp-upgrade.yml
│ ├── 50-post-upgrade-reconcile.yml
│ ├── 60-validate.yml
│ ├── 70-rollback.yml # snapshot revert
│ └── 99-decommission-source.yml
└── roles/
├── discover_packages/
├── discover_services/
├── discover_disk_layout/
├── leapp_prepare/
├── leapp_run/
├── reconcile_network/
├── reconcile_repos/
├── reconcile_selinux/
├── reconcile_certs/
├── reconcile_third_party/
├── post_upgrade_validate/
└── migration_evidence/
The naming pattern (00-, 10-, …) makes the migration order obvious to a tired operator at 2am, and the role split keeps the failure radius small.
Inventory and discovery: know what you are migrating
Most migration disasters are inventory disasters. The 18-year-old Linux box had an unscheduled cron tail that nobody documented; it fired once a quarter; you discover this six weeks after migration when finance asks where their report went. Spend disproportionate effort on discovery; it is the cheapest insurance you will ever buy.
# playbooks/00-discover-source.yml
---
- name: Discover source estate
hosts: source_estate
gather_facts: true
tasks:
- name: Capture all installed packages
ansible.builtin.package_facts:
- name: Capture all running services
ansible.builtin.service_facts:
- name: Capture cron jobs (system + user)
ansible.builtin.shell: |
cat /etc/crontab /etc/cron.d/* 2>/dev/null
for u in $(cut -d: -f1 /etc/passwd); do
crontab -u $u -l 2>/dev/null && echo "##USER:$u"
done
register: cron_dump
changed_when: false
- name: Capture systemd timers
ansible.builtin.shell: systemctl list-timers --all --no-pager
register: timers
changed_when: false
- name: Capture network interfaces and routes
ansible.builtin.shell: |
ip -j addr; echo '---'; ip -j route
register: net
changed_when: false
- name: Capture mount table
ansible.builtin.shell: findmnt --json
register: mounts
changed_when: false
- name: Capture third-party agents (everything not in package manager DB)
ansible.builtin.find:
paths: [/opt, /usr/local]
file_type: directory
recurse: false
register: third_party
- name: Capture listening ports
ansible.builtin.shell: ss -tulpenH
register: ports
changed_when: false
- name: Persist inventory artefact
ansible.builtin.copy:
dest: "/var/lib/migration/discovery/{{ inventory_hostname }}.json"
content: |
{
"host": "{{ inventory_hostname }}",
"os": "{{ ansible_distribution }} {{ ansible_distribution_version }}",
"kernel": "{{ ansible_kernel }}",
"packages": {{ ansible_facts.packages | to_json }},
"services": {{ ansible_facts.services | to_json }},
"cron": {{ cron_dump.stdout_lines | to_json }},
"timers": {{ timers.stdout_lines | to_json }},
"net": {{ net.stdout | from_json }},
"mounts": {{ mounts.stdout | from_json }},
"third_party_dirs": {{ third_party.files | map(attribute='path') | list | to_json }},
"ports": {{ ports.stdout_lines | to_json }}
}
delegate_to: localhost
Run this against every host you intend to migrate. Build a small Python script (or a Jupyter notebook) over the resulting JSON to surface:
- Hosts running EOL packages that block the upgrade.
- Hosts with third-party agents in
/opt(Splunk, Dynatrace, AV) that need vendor-blessed re-installation. - Hosts that listen on weird ports (often forgotten by the application team).
- Hosts with non-standard mount layouts that will trip up the conversion tool.
Ship the analysis to the application owner before scheduling the window. The number of “wait, you can’t migrate that — it’s owned by Risk” conversations you avoid is the ROI of the discovery playbook.
P2V and V2V with virt-v2v
virt-v2v is the open-source converter that turns a physical machine, a VMware VM, or a Hyper-V VM into a KVM/qemu image (or directly into RHV/oVirt). It is what AWS MGN’s underlying replication engine resembles, what most cloud importers do under the hood, and the only converter you should bet on for Linux V2V.
The pattern with Ansible is to:
- Cleanly stop the source workload (or take a quiesced snapshot if downtime is unacceptable).
- Run
virt-v2vagainst the source, with credentials stored in Vault. - Boot the target on the new platform.
- Apply post-conversion reconciliation (network, repos, certificates).
- Validate; if green, decommission source after the bake-in window.
# roles/v2v_convert/tasks/main.yml
---
- name: Quiesce source VM (vSphere)
community.vmware.vmware_guest_tools_upgrade:
hostname: "{{ vcenter_host }}"
username: "{{ vcenter_user }}"
password: "{{ vcenter_pass }}"
name: "{{ source_vm }}"
validate_certs: false
when: source_platform == "vsphere"
no_log: true
- name: Snapshot source for rollback
community.vmware.vmware_guest_snapshot:
hostname: "{{ vcenter_host }}"
username: "{{ vcenter_user }}"
password: "{{ vcenter_pass }}"
name: "{{ source_vm }}"
state: present
snapshot_name: "pre-v2v-{{ ansible_date_time.epoch }}"
quiesce: true
memory_dump: false
no_log: true
- name: Run virt-v2v from converter host
ansible.builtin.command:
cmd: >
virt-v2v
-ic vpx://{{ vcenter_user | urlencode }}@{{ vcenter_host }}/Datacenter/Cluster/host?no_verify=1
--password-file /etc/v2v.pwd
-o rhv -os {{ rhv_storage_domain }}
-of raw
-on {{ source_vm }}-converted
{{ source_vm }}
delegate_to: "{{ converter_host }}"
register: v2v
no_log: true
- name: Capture v2v log for evidence
ansible.builtin.fetch:
src: /var/log/virt-v2v.log
dest: "/var/lib/migration/evidence/{{ source_vm }}/v2v.log"
flat: true
delegate_to: "{{ converter_host }}"
- name: Boot converted VM (RHV)
ovirt.ovirt.ovirt_vm:
auth: "{{ ovirt_auth }}"
name: "{{ source_vm }}-converted"
state: running
wait: true
cluster: "{{ rhv_cluster }}"
no_log: true
The two non-obvious points:
- Always snapshot before converting. virt-v2v injects drivers (virtio for KVM, VMware tools for vSphere reverse-conversion). A driver mistake on a critical VM costs you a day if you have a snapshot, three days if you don’t.
- The converter host has its own cred file (
/etc/v2v.pwd). Do not pass passwords on the command line;ps -efwill leak them.
After conversion, the new VM almost always needs network reconciliation (different MAC, different vNIC name) and often needs DNS/AD re-join. Wrap those in roles/reconcile_network and roles/reconcile_third_party.
RHEL major-version in-place upgrades with leapp
leapp is Red Hat’s officially supported in-place upgrade engine. It runs in two phases: leapp preupgrade (analyses the system and emits a report) and leapp upgrade (the actual upgrade, requires a reboot). Crucially, leapp is opinionated: it will refuse to upgrade a system with unsupported configurations, and you must address every blocker before running the upgrade.
The standard fleet pattern with Ansible is:
- Run
leapp preupgradeagainst every host. - Collect the JSON reports centrally.
- Cluster the reports by blocker type.
- Fix each blocker class with a targeted role (or an exception list).
- Re-run
leapp preupgradeuntil clean. - Schedule the upgrade window.
- Run
leapp upgradeand reboot in waves. - Validate; record evidence.
# playbooks/30-leapp-preupgrade.yml
---
- name: Leapp preupgrade
hosts: leapp_targets
become: true
tasks:
- name: Subscribe and enable leapp repos
ansible.builtin.dnf:
name:
- leapp-upgrade
- leapp
state: present
- name: Stage answerfile for known questions
ansible.builtin.copy:
dest: /var/log/leapp/answerfile
content: |
[remove_pam_pkcs11_module_check]
confirm = True
[authselect_check]
confirm = True
- name: Run leapp preupgrade
ansible.builtin.command:
cmd: leapp preupgrade --report-schema=1.2.0
register: preup
failed_when: preup.rc not in [0, 1] # 1 = report has inhibitors, expected
changed_when: true
- name: Fetch the report JSON
ansible.builtin.fetch:
src: /var/log/leapp/leapp-report.json
dest: "/var/lib/migration/leapp/{{ inventory_hostname }}.json"
flat: true
- name: Fetch the inhibitor list (human readable)
ansible.builtin.fetch:
src: /var/log/leapp/leapp-report.txt
dest: "/var/lib/migration/leapp/{{ inventory_hostname }}.txt"
flat: true
The interesting work happens between preupgrade and upgrade: parsing the reports and writing remediations. Common inhibitors and their fixes:
- Missing
pam_pkcs11removal: replace withsssdPAM viaauthselect select sssd with-smartcard. hashalgorithm in/etc/login.defs: change toSHA512.- Old kernel modules in
/etc/modprobe.d: clean up before upgrade. - Third-party agents (Splunk, AV, Dynatrace): must be uninstalled or replaced with vendor-blessed RHEL 9 versions.
- Storage driver issues:
mpathconfigurations sometimes need updating tomultipath -llcompatible defaults.
Each remediation becomes a role:
# roles/leapp_remediate_pam/tasks/main.yml
---
- name: Switch authselect to sssd
ansible.builtin.command:
cmd: authselect select sssd with-smartcard --force
changed_when: true
- name: Remove pam_pkcs11
ansible.builtin.dnf:
name: pam_pkcs11
state: absent
After all remediations, re-run 30-leapp-preupgrade.yml and assert that no host has inhibitors:
- name: Assert leapp clean
ansible.builtin.assert:
that: leapp_inhibitor_count == 0
fail_msg: "{{ inventory_hostname }} still has {{ leapp_inhibitor_count }} inhibitors"
vars:
leapp_inhibitor_count: "{{ leapp_report.entries | selectattr('flags','contains','inhibitor') | list | length }}"
The actual leapp upgrade run
# playbooks/40-leapp-upgrade.yml
---
- name: Leapp upgrade — wave 1
hosts: leapp_targets
serial: "20%"
become: true
tasks:
- name: Snapshot LVM root for rollback (best effort)
ansible.builtin.shell: |
lvcreate -s -L 10G -n root_pre_leapp /dev/{{ ansible_lvm.vgs[0].vg_name }}/root
args:
creates: "/dev/{{ ansible_lvm.vgs[0].vg_name }}/root_pre_leapp"
ignore_errors: true # not all hosts use LVM
- name: Run leapp upgrade (download + reboot)
ansible.builtin.command:
cmd: leapp upgrade --report-schema=1.2.0
register: upgrade
async: 7200
poll: 0 # fire-and-forget, host will reboot
- name: Wait for reboot into upgrade environment
ansible.builtin.wait_for_connection:
delay: 60
timeout: 1800
- name: Confirm upgraded OS major version
ansible.builtin.setup:
gather_subset: distribution
- name: Assert we are now on RHEL 9
ansible.builtin.assert:
that:
- ansible_distribution == "RedHat"
- ansible_distribution_major_version == "9"
- name: Capture post-upgrade leapp log
ansible.builtin.fetch:
src: /var/log/leapp/leapp-upgrade.log
dest: "/var/lib/migration/leapp/{{ inventory_hostname }}.upgrade.log"
flat: true
Why serial: "20%"? A wave size of 20% lets you observe failures on a slice of the fleet without committing the entire estate. The first wave should be 1–2 hosts; the second 5–10%; the third 20%; thereafter you can go faster. AAP supports this pattern via job templates with explicit limits (--limit wave1, --limit wave2).
Why LVM snapshots? They are not a substitute for backups, but they let you rollback an upgrade in 30 seconds (lvconvert --merge) instead of restoring from backup. The 10GB carve-out is sized for the upgrade log churn; tune to your workload.
Why fire-and-forget on the upgrade command? Because the host reboots into a temporary upgrade initramfs; the SSH session dies. async: 7200 with poll: 0 queues the work; wait_for_connection resumes once SSH is back.
Post-upgrade reconciliation (the part everyone underestimates)
A successful leapp upgrade does not mean a working host. Things that often break and must be reconciled:
- Repository configuration: Satellite/RHSM channels need to be re-attached for RHEL 9.
- SELinux contexts:
restorecon -Rv /is good practice; don’t skip it because “it was working before.” - Network configuration: NetworkManager sometimes renames interfaces; DNS resolvers may revert to defaults.
- Certificate stores:
update-ca-truston RHEL 9 is stricter; older certs in legacy folders are ignored. - Third-party agents: monitoring, AV, backup agents need RHEL 9-compatible versions installed.
- Application-specific tweaks: Java cacerts, Python venvs (RHEL 9 default Python is 3.9 not 2.7), Postgres data directories.
Each of these gets its own role; each role is idempotent so it can be re-run. A representative reconciliation playbook:
# playbooks/50-post-upgrade-reconcile.yml
---
- hosts: leapp_targets
become: true
tasks:
- import_role: { name: reconcile_repos }
- import_role: { name: reconcile_selinux }
- import_role: { name: reconcile_network }
- import_role: { name: reconcile_certs }
- import_role: { name: reconcile_third_party }
- import_role: { name: reconcile_python }
- import_role: { name: reconcile_app } # last; depends on the app stack
The reconcile_app role is application-specific. If you are migrating an app that runs Postgres on the host, you re-run the regular Ansible deploy playbook — the same one used in steady state. If you are migrating a custom Java app, you have a play that ensures java-17-openjdk is installed (RHEL 9 default is 17 not 8) and reconfigures JAVA_HOME accordingly. The point is that the migration playbook delegates application reconciliation to the standard application deploy playbook, ensuring the new host is built from the current source of truth — exactly like a rebuild.
Windows Server in-place upgrades (2016 → 2019 → 2022)
Windows in-place upgrades are less reliable than RHEL leapp, but with care they work. The pattern with the community.windows collection:
# roles/win_inplace_upgrade/tasks/main.yml
---
- name: Stage Windows Server 2022 ISO
ansible.windows.win_copy:
src: /shared/iso/SERVER_2022.iso
dest: C:\Setup\SERVER_2022.iso
- name: Mount the ISO
ansible.windows.win_powershell:
script: |
$img = Mount-DiskImage -ImagePath C:\Setup\SERVER_2022.iso -PassThru
($img | Get-Volume).DriveLetter
register: mount
- name: Pre-upgrade compatibility check
ansible.windows.win_command:
cmd: "{{ mount.output[0] }}:\\setup.exe /Auto:Upgrade /Compat:ScanOnly /Quiet /NoReboot /CopyLogs:C:\\Setup\\compat-logs"
register: compat
failed_when: compat.rc not in [0, 0xC1900210, 0xC1900208]
# 0xC1900210 = compatible; no compatibility issues.
# 0xC1900208 = compatibility issues found.
# See https://learn.microsoft.com/windows/deployment/upgrade/upgrade-error-codes
- name: Fail if compatibility issues
ansible.builtin.fail:
msg: "Windows compatibility scan failed; check C:\\Setup\\compat-logs"
when: compat.rc == 0xC1900208
- name: Run the actual upgrade (will reboot)
ansible.windows.win_command:
cmd: "{{ mount.output[0] }}:\\setup.exe /Auto:Upgrade /Quiet /CopyLogs:C:\\Setup\\upgrade-logs"
async: 7200
poll: 0
- name: Wait for upgrade reboot cycle (Windows reboots multiple times)
ansible.windows.win_wait_for_pending_reboots:
timeout: 5400
retries: 3
- name: Verify upgrade succeeded
ansible.windows.win_powershell:
script: |
$os = Get-CimInstance Win32_OperatingSystem
[pscustomobject]@{
Caption = $os.Caption
Version = $os.Version
BuildNumber = $os.BuildNumber
}
register: post_os
- name: Assert Windows Server 2022
ansible.builtin.assert:
that:
- "'2022' in post_os.output[0].Caption"
The Windows-specific traps:
- Domain controllers cannot be in-place upgraded if they hold FSMO roles; transfer roles first, demote, upgrade, re-promote. This is a common trip-hazard; document it in the migration runbook.
- Anti-virus must be in pass-through mode during upgrade; many AV vendors block the upgrade engine.
- Group Policy with disabled Windows Update will cause the upgrade to fail silently. Check before running.
- Pending reboots from Patch Tuesday must be cleared first; check
Get-PendingRebootbefore kicking off.
The post-upgrade Windows reconciliation is similar to RHEL: re-attach the host to AD if it dropped, re-install monitoring agents, re-import scheduled tasks (which sometimes get reset), and verify the application service starts.
Migrate-and-modernise: rebuild and reconfigure
When the migration path is “RHEL 7 → RHEL 9 with completely new hardware” or “Windows 2016 → AWS-hosted Windows 2022”, the safest approach is rebuild: provision a fresh target host, run your steady-state Ansible playbooks against it, cut the data over, then decommission the source.
This pattern has three big advantages over in-place:
- The new host is built from current source of truth, with no inherited drift.
- The cut-over is a single network/DNS change; the rollback is a single network/DNS change.
- You rehearse the rebuild on pre-prod weeks before production, with high confidence that it will work in prod because the same playbook drives both.
The rebuild migration runbook looks like:
# playbooks/rebuild-migration.yml
---
- name: 1. Provision target host
hosts: localhost
tasks:
- import_role: { name: provision_target } # vSphere, AWS, Azure, etc.
- name: 2. Bootstrap target with steady-state config
hosts: "{{ target_host }}"
tasks:
- import_playbook: ../../app-platform/site.yml # the regular play
- name: 3. Replicate data (initial sync)
hosts: "{{ source_host }}"
tasks:
- import_role: { name: data_sync_initial }
- name: 4. Cut-over window: stop source, sync delta, swing DNS
hosts: "{{ source_host }}"
tasks:
- import_role: { name: stop_source }
- import_role: { name: data_sync_delta }
- import_role: { name: swing_dns }
delegate_to: localhost
- name: 5. Smoke test target
hosts: "{{ target_host }}"
tasks:
- import_role: { name: app_smoke_test }
- name: 6. Bake-in window — keep source on standby
hosts: localhost
tasks:
- debug:
msg: "Source kept for {{ bake_in_days }} days. Rollback by reversing DNS swing."
- name: 7. Decommission source
hosts: "{{ source_host }}"
tasks:
- import_role: { name: decommission_source }
Steps 4 and 7 are gated by manual approvals in AAP (separate job templates). The bake-in window in step 6 is non-negotiable for production; one week is the minimum, two weeks is sane, four weeks is conservative.
Ubuntu and Debian major-version upgrades
do-release-upgrade is the analogue of leapp on Ubuntu. The pattern is identical to RHEL but the tooling differs:
- name: Configure unattended upgrade prompts
ansible.builtin.copy:
dest: /etc/apt/apt.conf.d/50unattended-upgrades-noninteractive
content: |
Dpkg::Options { "--force-confdef"; "--force-confold"; }
- name: Run release upgrade
ansible.builtin.command:
cmd: do-release-upgrade -f DistUpgradeViewNonInteractive
async: 7200
poll: 0
- name: Wait for reboot
ansible.builtin.wait_for_connection:
delay: 60
timeout: 1800
- name: Confirm Ubuntu version
ansible.builtin.setup:
gather_subset: distribution
- name: Assert Ubuntu 24.04
ansible.builtin.assert:
that:
- ansible_distribution == "Ubuntu"
- ansible_distribution_version == "24.04"
Debian uses apt full-upgrade after editing /etc/apt/sources.list; the rest is the same. Both Debian and Ubuntu have weaker guarantees than RHEL leapp — the upgrade tooling is less opinionated, so your pre-flight discovery has to do more work to surface conflicts.
Validation: prove the migration succeeded
Validation is the difference between “it booted” and “the application works.” Always run end-to-end synthetic transactions, not just service checks.
# playbooks/60-validate.yml
---
- name: End-to-end validation
hosts: target_estate
tasks:
- name: Service health check (basic)
ansible.builtin.systemd:
name: kv-app
state: started
check_mode: true
- name: Application smoke test
ansible.builtin.uri:
url: "https://{{ inventory_hostname }}:8443/healthz"
status_code: 200
validate_certs: true
register: health
retries: 10
delay: 10
- name: Synthetic transaction (full path)
ansible.builtin.uri:
url: "https://{{ inventory_hostname }}:8443/api/v1/canary"
method: POST
body_format: json
body: { canary: "{{ ansible_date_time.epoch }}" }
status_code: 201
no_log: true
- name: Persist validation result
ansible.builtin.copy:
dest: "/var/lib/migration/validation/{{ inventory_hostname }}.json"
content: |
{ "host": "{{ inventory_hostname }}",
"ts": "{{ ansible_date_time.iso8601 }}",
"health": {{ health.status }},
"passed": true
}
delegate_to: localhost
If any host fails validation, that host’s job goes to playbooks/70-rollback.yml instead of 99-decommission-source.yml. The rollback playbook is platform-specific: revert vSphere snapshot, revert LVM snapshot via lvconvert --merge, restore from Veeam backup, etc.
Migration evidence and reporting
Auditors and project sponsors both want per-host migration evidence: when did it migrate, what changed, did it pass validation, who approved the wave. Capture this once, store it forever:
- name: Persist evidence bundle
ansible.builtin.copy:
dest: "/var/lib/migration/evidence/{{ inventory_hostname }}/{{ ansible_date_time.epoch }}.json"
content: |
{
"host": "{{ inventory_hostname }}",
"wave": "{{ wave_id }}",
"from": "{{ source_os }}",
"to": "{{ ansible_distribution }} {{ ansible_distribution_version }}",
"started": "{{ run_start }}",
"ended": "{{ ansible_date_time.iso8601 }}",
"duration_seconds": {{ (ansible_date_time.epoch | int) - run_start_epoch }},
"leapp_report_sha256": "{{ leapp_report_sha }}",
"validation_passed": {{ validation_passed }},
"approved_by": "{{ approver }}",
"ticket": "{{ change_ticket }}"
}
Land this in an immutable bucket (S3 Object Lock compliance mode, GCS retention policy, Azure Blob immutability). At the end of the migration project, you can produce a single report: “We migrated 3,412 hosts across 27 waves, with 97.4% first-pass success and a mean RTO per host of 47 minutes.” That number is the kind of thing executive sponsors quote in their next budget request.
Anti-patterns that kill migrations
- Skipping discovery. Migration day is not the time to learn that this VM has a custom Splunk forwarder.
- Migrating in giant waves. A 200-host wave with one shared bug becomes 200 simultaneous incidents.
- Treating leapp inhibitors as suggestions. Every inhibitor must be addressed; “we’ll fix it after upgrade” is how you brick servers.
- No bake-in window. Decommissioning the source the same day as the cut-over guarantees you discover the broken thing on day 8 with no rollback path.
- Leaving rollback unrehearsed. Test the snapshot revert in pre-prod; it usually has rough edges.
- Hand-edited
/etc/sssd/sssd.confafter migration. If the host re-joined AD with different settings, a full Ansible config replay is the right cure, not a manual edit. - Mixing P2V and major-version upgrade in one window. Each is a migration; combining them doubles your variables and quadruples your debug time.
- Using
serial: 100%in the upgrade play. You will discover the leapp bug on every host simultaneously. - Ignoring DNS TTL. The cut-over IP cached for hours by clients defeats your rollback.
Frequently asked questions
1. Should I prefer in-place upgrade or rebuild for RHEL 7 → RHEL 9? Rebuild, when feasible. RHEL 7 → 9 requires two leapp passes (7→8 then 8→9), and the cumulative risk of inhibitors and reconciliations is high. If your application can be redeployed cleanly from Ansible to a new host, rebuild and cut over. If it has years of manual drift that nobody can replicate, in-place leapp is your friend.
2. How long should bake-in be before decommissioning the source? At minimum, a full business cycle: month-end close, weekly batch, quarterly report. For a typical web app, 2 weeks. For a financial system, 6 weeks. The fact that the new host worked for 3 days proves nothing about the quarterly batch.
3. Can I run leapp on a host running a third-party agent like Splunk or CrowdStrike? Yes, but you must have the RHEL 9-compatible agent ready to install before the upgrade, or remove the agent before upgrade and re-install after. Some agents block the upgrade engine; check vendor documentation. Always test on a non-prod host with the same agent stack first.
4. What’s the right wave size for a fleet upgrade? Wave 1: 1–3 hosts (the canary). Wave 2: 5% of fleet. Wave 3: 20% of fleet. Wave 4 onwards: 25–50%. Tighten the gap to 1–2 days between waves; fix everything found in wave N before starting N+1.
5. How do I migrate a host with a third-party kernel module (NIC driver, GPFS, etc.)? Identify it during discovery. Get the RHEL 9-compatible version from the vendor. Stage it in your local repo. Add a remediation role that swaps the module post-upgrade. If no RHEL 9-compatible version exists, defer migration of that host and engage the vendor.
6. Do I need to re-take SAN snapshots after the upgrade? Yes — your backup baselines are now invalid because the OS, package versions, and possibly file paths have changed. Take fresh full backups within 24 hours of upgrade and re-establish your incremental chain.
7. How do I handle hosts that fail validation post-upgrade? Roll back via snapshot. Investigate the failure on a clone. Add a reconciliation role to fix the failure mode. Re-run the upgrade with the new role included. Do not “fix in place” on a failed upgrade — your source-of-truth runbook will not match what’s running.
8. What’s the right way to upgrade a database host? Almost never in-place. Build a new host with the new OS and the new database version, replicate the data using the database’s native replication (Postgres logical, MySQL replicas, MongoDB replica sets), and cut over with a DNS swing. The database migration becomes its own runbook (covered in the Tier 5 DB migrations lesson).
9. How long does a typical 1000-host RHEL 7 → RHEL 9 migration take, end-to-end? With proper Ansible discovery, remediation, and waving: 8–12 weeks elapsed. Without: 9–18 months. The ratio is the cost of skipping discovery.
10. What’s the single most underrated migration practice? The rebuild-vs-upgrade decision per workload. Rebuild is safer, rehearsable, and produces a cleaner long-term state. Many teams default to in-place because “we can’t redeploy this app from scratch.” If you cannot redeploy, that is the bug to fix first; once you can, every future migration becomes a non-event.
Hands-on lab — your first leapp upgrade
The following lab takes a freshly-spun RHEL 8 container (we use a privileged container as a stand-in for a VM) through the leapp preupgrade analysis, surfacing inhibitors so you can experience the workflow without committing a real VM.
Prerequisites: Podman or Docker, a Red Hat developer subscription (free tier is fine), ansible-core ≥ 2.16.
mkdir -p leapp-lab/{playbooks,roles,inventory}
cd leapp-lab
# inventory/hosts.yml
all:
hosts:
rhel8-canary:
ansible_connection: podman
ansible_python_interpreter: /usr/bin/python3
podman run -d --name rhel8-canary --privileged registry.access.redhat.com/ubi8/ubi-init sleep infinity
podman exec rhel8-canary subscription-manager register --username "$RHN_USER" --password "$RHN_PASS" --auto-attach
# playbooks/leapp-preupgrade.yml
- hosts: rhel8-canary
become: true
tasks:
- dnf:
name: leapp-upgrade
state: present
- command: leapp preupgrade --target 9.4
register: leapp
failed_when: false
changed_when: leapp.rc == 0
- fetch:
src: /var/log/leapp/leapp-report.txt
dest: ./leapp-report.txt
flat: true
- debug:
msg: "{{ lookup('file','./leapp-report.txt').split('\n')[:50] }}"
ansible-playbook -i inventory/hosts.yml playbooks/leapp-preupgrade.yml
cat leapp-report.txt | head -100
The output is the leapp report — read every inhibitor. Now the exercise: write a roles/remediate_<inhibitor>/tasks/main.yml for each inhibitor you see, re-run preupgrade, and watch the inhibitor list shrink. By the time the list is empty, you have built the muscle memory for fleet leapp.
Glossary
- P2V — Physical to Virtual conversion.
- V2V — Virtual to Virtual conversion (often hypervisor-to-hypervisor).
- virt-v2v — Open-source converter for libvirt/KVM/RHV/oVirt targets.
- leapp — Red Hat’s in-place upgrade engine for RHEL major versions.
- Inhibitor — A leapp report finding that blocks the upgrade until resolved.
- Bake-in window — Period between cut-over and source decommission.
- Wave — A subset of hosts migrated together with shared monitoring.
- Cut-over — The moment traffic moves from source to target.
- Reconciliation — Post-migration step that fixes drift between source and target.
- Rebuild migration — Provision new target, replay configuration, swing data, decommission source.
Certification mapping
- EX374 — Workflow orchestration, RBAC, surveys (used for migration approvals).
- RHCE EX294 — Roles, conditionals, handlers — all used heavily in migration roles.
- Red Hat Certified Specialist in Linux Migration — direct alignment.
- AWS Migration Specialty — Lift-and-shift via MGN, virt-v2v patterns.
Next steps
You now have the mental model and the runbook patterns for the three migration shapes. The next specialist lesson covers air-gapped automation — running Ansible against environments that cannot reach the internet, and how to keep collections, EEs, leapp content, and OS repositories synchronised inside a sealed network. Air-gapped migrations are even less forgiving than connected ones, and they are where the discipline taught here pays the highest dividend.
If you only take one habit from this lesson: discover before you migrate, and rehearse before you decommission. A migration with three weeks of discovery and a week of rehearsal beats the heroic weekend migration every time, even though it looks slower on the project plan.