Ansible Lesson 32 of 42

Ansible for Hybrid & Multi-Cloud Orchestration: Coordinating On-Prem, AWS, Azure, GCP, and Kubernetes from a Single Workflow

A bank with 80 years of history runs a portfolio you can’t restart for fun: COBOL on z/OS handles ledger updates, RHEL 9 on vSphere serves the trading floor, AWS handles the mobile app, Azure handles Office 365 integration, GCP handles the data-science team’s BigQuery, and Kubernetes handles the new microservices the platform team is building. Most of these can’t move; all of them need to change weekly, and every change has to be auditable. The problem is not picking a tool — it’s picking a coordination plane that can drive all of them.

That coordination plane, for a wide swath of enterprises, is Ansible Automation Platform plus a careful approach to multi-tier inventory, automation mesh, and workflow chaining. This lesson — the capstone of the Ansible expert tier — covers how to build orchestrated, multi-environment changes that stretch across vSphere clusters, public clouds, Kubernetes, network gear, and Windows fleets, all in a single workflow with proper dependency ordering, dry-run gating, partial-failure handling, and rollback. If you’ve made it through the previous nine expert lessons, this is where they fit together.

Learning Objectives

By the end you will be able to:

Prerequisites

Mental Model: Hybrid Orchestration Done Right

1. One inventory, many sources

The right approach is not “one playbook per platform” with separate inventories. It is one logical inventory that aggregates multiple sources: a static YAML for on-prem network gear and vSphere, dynamic plugins for AWS/Azure/GCP, and a kubernetes.core inventory for K8s clusters. The aggregator file in inventory/ lists each source; AAP or ansible-inventory merges them into one host graph at runtime.

2. Automation mesh routes execution, not data

In a real enterprise, the control node sits in a management VPC. AWS prod is in a different account behind a private VPC. Azure prod is on-prem-routed via ExpressRoute. The on-prem network gear is on a management VLAN that the cloud control node can’t reach. Automation mesh is AAP’s solution: deploy execution nodes close to the targets (a small EC2 in the AWS account, a small VM in Azure, a small VM in the management VLAN), wire them to the control plane via hop nodes across firewalls, and let AAP route each Job to the right execution node based on inventory location.

3. Workflows chain Job Templates with branching logic

A Workflow Template is a DAG (directed acyclic graph) of Job Templates. Edges are conditional: on_success continues the success path, on_failure triggers cleanup, always runs regardless. A real change workflow looks like: pre-check → backup → drain → upgrade → validate → traffic-shift → confirm. Each of those is a Job Template, and the workflow stitches them with the right branching logic.

4. Identity is the hard problem in hybrid

The control node is RHEL 9. It needs to talk to: vSphere (SSO user), Windows (Kerberos, AD-joined), AWS (IRSA / OIDC), Azure (Workload Identity), GCP (Workload Identity Federation), Kubernetes (kubeconfig with short-lived tokens), network gear (TACACS+ / RADIUS / SSH key). Don’t try to unify these — instead, give AAP credentials per environment and let each Job Template pull only the credentials it needs.

5. Audit trail beats clever automation

The hardest part of multi-cloud Ansible isn’t getting it to work; it’s proving what changed, when, and by whose authority. AAP’s job-event stream is the spine of this audit story — every task on every host gets a structured record. Ship those events to Splunk/Elastic/ServiceNow, tag them with the change ticket ID (extra_vars: { change_id: CHG0001234 }), and the audit becomes searchable.

Designing a Cross-Platform Inventory

The aggregator pattern: one directory of inventory sources, each loaded by its own plugin, all merged into one inventory graph.

inventory/
├── 01-static.yml         # on-prem network gear, on-prem VMs
├── 02-vsphere.yml        # community.vmware.vmware_vm_inventory
├── 03-aws.aws_ec2.yml    # amazon.aws.aws_ec2
├── 04-azure_rm.yml       # azure.azcollection.azure_rm
├── 05-gcp_compute.yml    # google.cloud.gcp_compute
├── 06-k8s.yml             # kubernetes.core.k8s
└── group_vars/
    ├── all.yml
    ├── tag_environment_prod.yml
    └── tag_role_database.yml

01-static.yml (the on-prem skeleton):

all:
  children:
    network_devices:
      hosts:
        leaf-01.dc1.corp.example.com:
        leaf-02.dc1.corp.example.com:
        spine-01.dc1.corp.example.com:
      vars:
        ansible_connection: ansible.netcommon.network_cli
        ansible_network_os: cisco.nxos.nxos
        environment: prod
        location: dc1

03-aws.aws_ec2.yml:

plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
  - us-west-2
filters:
  tag:Environment: prod
  instance-state-name: running
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: tags.Environment
    prefix: env
  - key: placement.region
    prefix: aws_region
hostnames:
  - tag:Name
  - private-ip-address
compose:
  ansible_host: private_ip_address

04-azure_rm.yml:

plugin: azure.azcollection.azure_rm
auth_source: auto
include_vm_resource_groups:
  - prod-rg-east
  - prod-rg-west
keyed_groups:
  - key: tags.role
    prefix: role
  - key: location
    prefix: az_region
  - prefix: env
    key: tags.environment
hostnames:
  - private_ipv4_addresses[0]

05-gcp_compute.yml:

plugin: google.cloud.gcp_compute
projects:
  - corp-prod-1234
auth_kind: serviceaccount
service_account_file: /etc/ansible/sa-prod.json
filters:
  - status = RUNNING
  - labels.environment = prod
keyed_groups:
  - key: labels.role
    prefix: role
  - key: zone
    prefix: gcp_zone
hostnames:
  - networkInterfaces[0].networkIP

Run ansible-inventory -i inventory/ --graph and you get one tree:

@all:
  |--@aws_region_us_east_1:
  |  |--10.10.1.5
  |  |--10.10.1.6
  |--@az_region_eastus:
  |  |--10.20.1.5
  |--@gcp_zone_us_central1_a:
  |  |--10.30.1.5
  |--@network_devices:
  |  |--leaf-01.dc1.corp.example.com
  |--@role_app:
  |  |--10.10.1.5
  |  |--10.20.1.5
  |--@role_db:
  |  |--10.10.1.6
  |  |--10.30.1.5
  |--@env_prod:
  |  |--10.10.1.5
  |  |--10.10.1.6
  |  |--10.20.1.5
  |  |--10.30.1.5

role_app now contains app servers from AWS and Azure together; role_db spans AWS and GCP. A play that targets role_app runs across all of them — but each task is dispatched to the right execution node by automation mesh.

Automation Mesh: Routing Execution

Automation mesh has three node types:

A real layout for a bank’s prod workflow:

[Control] --(public)--> [Hop: DMZ-1] --(private)--> [Exec: AWS-prod]
                              |
                              +--(private)-----> [Exec: Azure-prod]
                              |
                              +--(MPLS)--> [Hop: DC1-mgmt] --(VLAN-100)--> [Exec: vSphere-prod]
                                                          \
                                                           +--(VLAN-200)--> [Exec: Network-mgmt]

Each execution node has the credentials and network reachability for its segment. The control node never directly touches the prod targets — it only schedules Jobs to the right execution nodes, which do the actual SSH/WinRM/HTTPS calls.

In AAP, configure an Instance Group per region/segment, then bind inventories to instance groups:

Inventory "AWS-prod-us-east-1" → instance_group "exec-aws-prod-us-east-1"
Inventory "Azure-prod-eastus"  → instance_group "exec-azure-prod-eastus"
Inventory "vSphere-prod-dc1"   → instance_group "exec-vsphere-prod-dc1"

When you launch a Job Template against Inventory "AWS-prod-us-east-1", AAP routes it through the mesh to the AWS execution node — no separate “did you remember to switch context?” step.

Workflow Templates: Chaining Job Templates

A Workflow Template is a DAG. Each node is a Job Template (or another workflow). Edges are conditional.

Example: production app deployment that touches AWS, Azure, and on-prem K8s:

[Pre-check]
  |
  +--on_success--> [Backup vSphere VMs]
                         |
                         +--on_success--> [Snapshot AWS RDS]
                                                |
                                                +--on_success--> [Drain LB traffic]
                                                                       |
                                                                       +--on_success--> [Rolling upgrade k8s]
                                                                                              |
                                                                                              +--on_success--> [Validate]
                                                                                                                    |
                                                                                                                    +--on_success--> [Restore traffic]
                                                                                                                    +--on_failure--> [Rollback k8s] --> [Restore traffic]
                                                                                              |
                                                                                              +--on_failure--> [Restore RDS] --> [Restore vSphere]

In AAP, this is a 9-node Workflow Template. Each node is a Job Template that already exists. Edges are configured in the Workflow visualizer.

Each Job Template can pass data to the next via set_stats:

- name: Record current image tag
  ansible.builtin.set_stats:
    data:
      previous_image_tag: "{{ current_image_tag }}"
    per_host: false

Subsequent Job Templates in the workflow can read previous_image_tag from extra_vars.

Dry-Run Gating: Diff → Approval → Apply

The single highest-leverage practice in multi-cloud Ansible: never apply without a diff. The pattern:

  1. Dry-run Job Template runs the same playbook with --check --diff.
  2. Output is rendered to a diff document and posted to a PR or Slack thread.
  3. A human approves.
  4. Apply Job Template runs the same playbook without --check.

In AAP this is two Job Templates wired in a Workflow with an Approval Node between them:

[Dry-run] --on_success--> [Approval Node] --on_approved--> [Apply] --on_success--> [Validate]
                                          --on_denied--> [Notify-denied]

The Approval Node pauses the workflow until a human clicks “Approve” in the AAP UI (or replies to a notification). This is your last-line audit gate before production change.

Identity Across Environments

The cleanest pattern: per-environment credential, no shared secrets.

Target AAP Credential Type Backing identity
On-prem Linux Machine SSH key, AD-joined service account
Windows Machine (winrm/kerberos) gMSA in AD
Network gear Network TACACS+ user
AWS Amazon Web Services IAM Role assumed via OIDC from AAP’s K8s SA
Azure Microsoft Azure Resource Manager Workload Identity from AAP’s K8s SA
GCP Google Compute Engine WIF mapped from AAP’s K8s SA
vSphere VMware vCenter SSO user with custom RBAC
Kubernetes OpenShift / Kubernetes Bearer Token Short-lived OIDC token

Notice the pattern: AAP runs in a K8s cluster (or OpenShift) — its workload identity (the K8s ServiceAccount of the Execution Environment pod) is the root of trust, and each cloud is configured to trust that identity via OIDC. No long-lived keys anywhere.

Blast-Radius-Aware Rollouts

For changes that touch many hosts, use these patterns to limit the damage radius:

Pattern 1: serial: with percentages

- hosts: webservers
  serial: "10%"   # 10% of the fleet at a time
  tasks:
    - name: Apply config
      ansible.builtin.template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: reload nginx

serial: "10%" rolls through the fleet 10% at a time. Combined with max_fail_percentage: 5 — if more than 5% of a batch fails, abort.

Pattern 2: Canary regions

- hosts: aws_region_us_east_1   # canary first
  tasks: [ ... ]

- hosts: aws_region_us_west_2   # then west
  tasks: [ ... ]

- hosts: az_region_eastus       # then Azure east
  tasks: [ ... ]

In AAP this becomes a Workflow with one Job Template per region, gated by validation checks between each.

Pattern 3: Ringed deployment

Tag hosts with rings: ring_0 (canary, 1%), ring_1 (early adopters, 10%), ring_2 (general, 89%). Deploy ring-by-ring with bake time between rings.

Rollback Strategy

Ansible isn’t transactional. You can’t BEGIN; ... ROLLBACK; an apt install + a route change. Rollback discipline:

The rollback Job Template is a peer to the apply Job Template, not an afterthought:

[Apply] --on_failure--> [Rollback] --on_success--> [Restore traffic] --> [Notify-failure]

Audit Trail

AAP emits Job Events as a structured stream. Each event has timestamp, host, task, status, and diff. Ship them to your audit pipeline:

# /etc/tower/conf.d/logging.py (AAP) — log streaming
LOGGING['handlers']['external_logger'] = {
  'class': 'logging.handlers.SysLogHandler',
  'address': ('splunk-hec.corp.example.com', 5140),
  'formatter': 'json',
}

Tag every job with the change ticket ID:

# Workflow extra_vars
change_id: "{{ change_id }}"
approver: "{{ approver }}"

Now your audit pipeline has: who launched the workflow, when each task ran, what changed on which host, with which inputs — all queryable.

Hands-on Free Lab: Cross-Platform Inventory + Workflow

Free, runs against LocalStack (AWS), Azurite (Azure), kind (K8s), and a local Multipass VM (on-prem). The full lab is documented at github.com/example/ansible-hybrid-lab; here’s the skeleton:

mkdir -p ~/ansible-hybrid-lab && cd ~/ansible-hybrid-lab

# Build aggregator inventory
mkdir -p inventory
cat > inventory/01-static.yml <<'EOF'
all:
  children:
    onprem:
      hosts:
        local-vm:
          ansible_host: 192.168.64.10
          ansible_user: ubuntu
EOF

cat > inventory/02-aws.aws_ec2.yml <<'EOF'
plugin: amazon.aws.aws_ec2
endpoint: http://localhost:4566   # LocalStack
regions:
  - us-east-1
filters:
  instance-state-name: running
keyed_groups:
  - key: placement.region
    prefix: aws_region
EOF

cat > inventory/03-k8s.yml <<'EOF'
plugin: kubernetes.core.k8s
connections:
  - kubeconfig: ~/.kube/config
    context: kind-ansible-lab
EOF

# Build hybrid playbook
cat > site.yml <<'EOF'
---
- name: Phase 1 — On-prem hosts
  hosts: onprem
  tasks:
    - name: Ensure nginx
      ansible.builtin.apt:
        name: nginx
        state: present
      become: true

- name: Phase 2 — AWS hosts (LocalStack)
  hosts: aws_region_us_east_1
  gather_facts: false
  tasks:
    - name: Ping
      ansible.builtin.ping:

- name: Phase 3 — K8s namespace
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Apply baseline
      kubernetes.core.k8s:
        context: kind-ansible-lab
        state: present
        definition:
          apiVersion: v1
          kind: Namespace
          metadata:
            name: hybrid-lab
EOF

# Run
ansible-inventory -i inventory --graph
ansible-playbook -i inventory site.yml

Three different platforms, one inventory, one playbook, three plays — that’s the hybrid pattern in miniature.

Common Mistakes & Troubleshooting

1. Cross-platform inventory mixes hostnames in unexpected ways AWS uses private_ip_address while vSphere uses VM name. When merged, the same host appears under multiple identifiers. Use compose: ansible_host: ... in each plugin to normalize, and unique hostnames: per source.

2. Automation mesh hop node can’t reach execution node Mesh uses port 27199 by default. Firewall it open between hop and execution, in both directions. Mesh is bidirectional.

3. Workflow Approval Node times out Default approval timeout is 1 hour. Set timeout: per node, or configure a default in AAP settings. Plan workflows for human-time, not machine-time.

4. set_stats data isn’t available in the next Job Template set_stats only propagates if --stats is set in the Job Template’s verbosity, or if the workflow is configured to forward stats. Check Job Template settings.

5. Multi-region rollout — one region’s failure shouldn’t block others Use parallel paths in the Workflow Template instead of serial. Each region is an independent path; a failure in us-east-1 doesn’t block eu-west-1. Use a convergence node at the end to aggregate.

6. Identity rotation breaks AAP credentials You rotated the OIDC trust relationship in AWS but AAP still has cached creds. AAP re-fetches per-job, so the next job will use new creds — but in-flight jobs fail. Plan rotations for low-traffic windows, or pre-warm credentials.

7. Audit pipeline floods Splunk with low-value events AAP emits an event per task per host. A 10,000-host job is 10,000+ events. Filter at AAP (only failed events go to Splunk), or use Splunk index sizing carefully.

Best Practices

Security Notes

Q&A — 14 Questions

Q1. Do I need AAP, or can I do this with plain ansible-playbook + GitHub Actions? For 1–10 engineers and < 1,000 hosts, plain Ansible + GitHub Actions + a shared notification channel works fine. AAP shines at: multi-team coordination, RBAC across teams, automation mesh for network-isolated environments, and central audit. If you’re spending more time on coordination than playbook authoring, switch to AAP (or AWX, the open-source upstream).

Q2. AAP vs AWX vs Ansible Tower — what’s the difference? Tower was the old name. AAP is the supported Red Hat product (subscription required). AWX is the upstream open-source project, no support, faster-moving. Same UI, same workflow concepts. Most regulated industries pick AAP for support and vendor accountability.

Q3. Can a workflow span multiple AAP instances? Not directly — but AAP’s API allows one workflow to launch a Job Template on a peer instance via the tower-cli or HTTP API. Pattern: a “global” AAP launches “regional” AAP workflows. Less common since automation mesh removes most of the need.

Q4. How does AAP handle execution environments (EEs)? EEs are container images with the collections + Python deps each playbook needs. AAP ships with default EEs, you build custom ones with ansible-builder, push to a registry, and configure Job Templates to use them. Each Job runs in a fresh EE container — no state carries over.

Q5. Should I run a single workflow that touches all clouds, or one workflow per cloud? Per cloud for simple changes (deploy a config update). Cross-cloud for changes that have actual cross-cloud dependencies (a DR exercise that fails over from AWS to Azure).

Q6. How do I handle a change where AWS succeeds but Azure fails halfway? Workflow on_failure branches that do best-effort rollback in each environment. This is partial — true atomic distributed transactions don’t exist. The discipline is: design changes to be reversible, snapshot before, fail fast.

Q7. Can I use Terraform alongside Ansible in a workflow? Yes — cloud.terraform.terraform module runs Terraform from Ansible. Common pattern: Terraform creates infra, Ansible configures it. Or AAP’s Workflow Template chains a Terraform Cloud run + an Ansible Job Template.

Q8. How do I audit who approved a workflow? AAP records the approving user in the workflow event log. Ship that to Splunk/Elastic with the change ticket — that’s your “who approved it” trail.

Q9. What’s the right way to handle “emergency change” workflows? A separate Workflow Template with shorter approval timeout, scoped credentials (only what the emergency needs), and a post-change required audit task that posts to a #change-emergency channel. The emergency path should be more documented than the normal path, not less.

Q10. How do I orchestrate a DB schema migration across regions?

  1. Take backups (snapshot all regions in parallel). 2. Run migration on canary region. 3. Run validation. 4. If valid, replay migration on remaining regions. 5. Validate end-to-end. Each step is a Job Template; the workflow chains them with explicit gates.

Q11. Workflow ran, jobs succeeded, but the system is in a bad state. What now? This is the gap between “task succeeded” and “outcome correct.” Bridge it with explicit validation Job Templates after each change, and end-to-end synthetic tests in the final workflow node. If validation fails, the workflow’s on_failure runs rollback.

Q12. How do I handle dependencies on external systems (load balancer, DNS, monitoring)? Each external system is a Job Template. A change that requires LB drain → app upgrade → monitoring re-arm becomes a 3-node workflow. Don’t try to put external-system calls inline in your app playbook; keep them as separate, reusable Job Templates.

Q13. What’s “convergence node” in workflows? A node with multiple incoming edges that runs only after all upstream nodes complete. Useful at the end of parallel paths: deploy to AWS and Azure and GCP in parallel, then converge to a single “validate end-to-end” node.

Q14. How big of an org needs this level of orchestration? ~50 engineers and ~5,000 hosts is the inflection point. Below that, plain ansible-playbook + simple inventory works. Above that, the coordination cost of plain Ansible exceeds AAP’s overhead. Plan for AAP when you start hearing “who deployed that?” with no answer.

Quick Check

  1. What’s the role of automation mesh hop nodes?
  2. What does set_stats do in a Workflow?
  3. What’s the canonical pattern for safe production change?
  4. Which inventory plugins would you use for AWS, Azure, GCP, vSphere, and K8s?
  5. What’s a Workflow Approval Node?
  6. What’s the difference between AAP and AWX?
  7. Why is identity the hardest problem in hybrid orchestration?
  8. What’s the right rollback strategy for an Ansible-applied schema change?

Exercise

Design (don’t fully implement — that’s a quarter’s work) a Workflow Template for the following change scenario:

Quarterly OS upgrade: 200 RHEL 8 → RHEL 9 servers across vSphere on-prem, AWS EC2, Azure VMs, and GCP Compute Engine. The servers run a stateful application with a database on AWS RDS. The application has a load balancer in each cloud.

Produce:

  1. A workflow diagram (mermaid) showing all Job Templates and edges.
  2. A list of Job Templates with their inventory targets, credential bindings, and instance group bindings.
  3. The set_stats data passed between nodes.
  4. The Approval Node placements.
  5. The rollback paths for each failure point.
  6. The audit fields injected via extra_vars.

This is a real architecture deliverable; spend an hour or two on it. Compare to your real production change processes.

Cert Mapping

Glossary

Next Steps

You’ve completed the Ansible expert tier — 10 lessons covering every major platform Ansible drives in the enterprise, plus this capstone on weaving them together. The course’s bonus tier (Tier 5) covers automation patterns that generalize beyond Ansible: GitOps, infrastructure-as-code review processes, multi-tool orchestration (Ansible + Terraform + Pulumi), and the senior-engineering questions you’ll face in role design — when not to use Ansible, when to migrate off it, and how to evaluate the next-generation tools that will eventually replace it. But what you’ve learned in these four tiers is enough to handle any real-world Ansible challenge in any real production environment.

ansiblehybrid-cloudmulti-cloudaapautomation-meshorchestrationworkflowcross-platformkloudvin
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments