Dynamic Inventory and Secure Secrets for Ansible at Cloud Scale

A static inventory.ini is a lie the moment an autoscaling group scales out. The host you so carefully tagged web-03 got terminated during a deploy, a fresh instance took its place with a new private IP, and your next playbook run targets a machine that no longer exists. At any real scale, the inventory is not a file you maintain — it is a query you run against the cloud control plane at the start of every play. Ansible has supported this for years, but the plugin-based inventory introduced in 2.4 and matured since is genuinely good now, and most teams still under-use it.

This guide wires Ansible to live AWS and Azure inventory via amazon.aws.aws_ec2 and azure.azcollection.azure_rm, shapes hosts into useful groups with keyed_groups, compose, and the constructed plugin, caches the result, and then handles the part everyone gets wrong: secrets. We will cover when Ansible Vault is the right tool and when it is a liability, retrieve credentials at runtime from HashiCorp Vault with community.hashi_vault, and stop secrets from leaking through no_log, encrypt_string, and callback plugins. Everything assumes ansible-core 2.16+ and the relevant collections installed from Galaxy.

1. Static vs dynamic inventory and the plugin lifecycle

Ansible has two inventory mechanisms. Static inventory is the INI or YAML file you hand-write. Dynamic inventory is produced by an inventory plugin — a Python class that, when invoked, populates the in-memory inventory by talking to some source of truth.

A common misconception is that dynamic inventory means “a script that prints JSON.” That was the old inventory script interface (the --list/--host contract), and it is effectively deprecated. Modern dynamic inventory uses plugins configured by a YAML file whose name must end in a recognized suffix — by convention *.aws_ec2.yml, *.azure_rm.yml, or the generic inventory.yml. The file’s top-level plugin: key names which plugin parses it.

The lifecycle for a single ansible-playbook -i inventory.aws_ec2.yml run is:

Ansible scans -i sources. For each one it asks every enabled inventory plugin “can you parse this?” The plugin checks the filename suffix and the plugin: key.
The matching plugin runs parse(): it authenticates to the cloud, lists resources, and adds hosts, groups, and host variables to the inventory object.
keyed_groups, groups, and compose rules execute, creating derived groups and computed variables.
The fully materialized inventory is handed to the play. Plugins do not re-run mid-play; the inventory is a snapshot taken at parse time.

You must explicitly enable non-core plugins. In ansible.cfg:

[defaults]
inventory = ./inventory

[inventory]
enable_plugins = amazon.aws.aws_ec2, azure.azcollection.azure_rm, constructed, ini, yaml
cache = true
cache_plugin = jsonfile
cache_connection = ./.ansible_inventory_cache
cache_timeout = 600

The order in enable_plugins is the resolution order. Put the cloud plugins before ini/yaml so a misnamed file does not get silently grabbed by the wrong parser. List a plugin or it will never run, even if its config file is valid.

Verify the collections are present before going further:

ansible-galaxy collection install amazon.aws azure.azcollection community.hashi_vault
ansible-config dump --only-changed | grep -i inventory

2. Configuring the amazon.aws.aws_ec2 plugin

The AWS EC2 plugin discovers instances via the EC2 API. Authentication follows the standard boto3 chain — environment variables, ~/.aws/credentials, an assumed role, or instance metadata — so do not put keys in the inventory file. Create inventory/prod.aws_ec2.yml:

plugin: amazon.aws.aws_ec2
regions:
  - eu-west-1
  - us-east-1
# Only pull what you'll target. Filtering server-side is cheaper than
# listing the whole account and discarding hosts client-side.
filters:
  tag:Environment: production
  instance-state-name: running
# Optionally assume a read-only inventory role per account.
assume_role_arn: "arn:aws:iam::111122223333:role/ansible-inventory-ro"
# Use the private IP for SSH inside the VPC.
hostnames:
  - private-ip-address
# Expose tags and a few instance facts as host vars.
compose:
  ansible_host: private_ip_address
strict: false

A few decisions worth calling out. hostnames controls the inventory hostname; ordering matters — Ansible uses the first that resolves. filters map directly to EC2 DescribeInstances filters, so push as much selection as you can server-side. Tags arrive as host variables prefixed tags. (for example tags.Role), and instance attributes are available under names like instance_type, placement.availability_zone, and vpc_id.

Confirm what the plugin sees before writing any rules:

ansible-inventory -i inventory/prod.aws_ec2.yml --graph
ansible-inventory -i inventory/prod.aws_ec2.yml --host i-0abc123 --yaml

For Azure, create inventory/prod.azure_rm.yml. The plugin uses the standard Azure auth chain (env vars AZURE_CLIENT_ID/AZURE_SECRET/AZURE_TENANT/AZURE_SUBSCRIPTION_ID, a managed identity, or az login):

plugin: azure.azcollection.azure_rm
include_vm_resource_groups:
  - rg-prod-app
  - rg-prod-data
# Use Scale Set VMs as well as standalone VMs.
include_vmss_resource_groups:
  - rg-prod-web
# Azure tags become host vars; pick which become groups below.
plain_host_names: true
conditional_groups:
  azure_linux: "'Linux' in os_disk.os_type"

plain_host_names: true gives you the VM name as the inventory hostname instead of the long fully qualified default. Without it you get a uniqueness-safe but unreadable name.

3. Building host groups with keyed_groups, compose, and constructed

Raw cloud inventory is a flat bag of hosts. The value is in the groups, because that is what hosts: in a play targets. Three mechanisms build them.

keyed_groups creates one group per distinct value of an expression. This is the workhorse: turn the Role tag into role_web, role_api, role_db groups automatically.

keyed_groups:
  # tag:Role=web -> group "role_web"
  - key: tags.Role
    prefix: role
    separator: "_"
  # one group per AZ, e.g. "az_eu_west_1a"
  - key: placement.availability_zone
    prefix: az
  # default_value handles untagged hosts so they don't vanish silently
  - key: tags.Team
    prefix: team
    default_value: unowned

groups creates a single named group whose membership is a boolean Jinja expression — good for cross-cutting logic that is not a simple key:

groups:
  large_instances: "instance_type.startswith('m5.4xlarge') or instance_type.startswith('c5.9xlarge')"
  needs_patching: "'PatchGroup' in (tags | default({}))"

compose sets host variables from Jinja, evaluated against the host’s other facts. Use it to normalize connection variables across clouds so the same play runs everywhere:

compose:
  ansible_host: private_ip_address
  ansible_user: "'ec2-user'"
  region: placement.region

compose expressions are raw Jinja with no {{ }} and templates are not trusted by default — string literals must be quoted ("'ec2-user'"), or Ansible treats the bare word as a variable reference and fails. This trips up everyone once.

The constructed plugin is the missing piece for cross-source grouping. The cloud plugins can only group on facts they themselves produce. constructed runs after other inventory sources, sees the merged set of host variables (including ones you set in host_vars/group_vars), and applies keyed_groups/groups/compose across all of them. Put it last so it sees everything:

# inventory/constructed.yml
plugin: constructed
strict: false
keyed_groups:
  # Build groups from a var that may come from AWS, Azure, OR group_vars.
  - key: app_tier
    prefix: tier
groups:
  # Now you can mix an AWS tag with an Azure tag uniformly.
  frontends: "app_tier == 'frontend'"

4. Caching inventory and merging multiple sources

Listing thousands of instances across regions on every ansible invocation is slow and burns API quota. Inventory caching stores the parsed result and serves it until it expires.

Enable it globally (as in the ansible.cfg above) or per-plugin inside the inventory YAML, which is more explicit:

plugin: amazon.aws.aws_ec2
regions: [eu-west-1]
cache: true
cache_plugin: jsonfile
cache_connection: ./.ansible_inventory_cache
cache_timeout: 1800

The cache key is derived from the plugin config, so changing a filter invalidates it correctly. Two operational notes: a stale cache will happily target dead hosts, so keep cache_timeout short enough that a destroyed instance falls out before it causes a failed play; and force a refresh in CI or after a known scaling event with:

ansible-inventory -i inventory/ --graph --flush-cache

Merging multiple sources is where the directory form pays off. Point -i (or inventory =) at a directory, and Ansible parses every recognized file in it, unioning the results. This lets you combine AWS, Azure, a static ini of bare-metal jump hosts, and the constructed overlay into one inventory:

inventory/
  prod.aws_ec2.yml
  prod.azure_rm.yml
  bastions.ini
  constructed.yml          # parsed last; sees the union

Files are parsed in alphanumeric order, which is exactly why constructed.yml (c < p) sorting can bite you — it must run last. Force ordering by prefixing with numbers when needed: 10-prod.aws_ec2.yml, 90-constructed.yml. When the same host appears from two sources, variables merge according to Ansible’s precedence, with later sources winning on conflict.

5. Ansible Vault vs external secret managers

Now the secrets. There are two distinct tools with the same word in their name, and conflating them causes real incidents.

Ansible Vault is a file encryption feature built into ansible-core. It encrypts files (or individual strings) at rest with a symmetric passphrase using AES-256. The ciphertext lives in your git repo; the passphrase lives… somewhere you have to manage. It is excellent for low-churn, version-controlled secrets: a CA private key, a license string, default DB passwords for a lab.

External secret managers — HashiCorp Vault, AWS Secrets Manager, Azure Key Vault — store secrets outside the repo and serve them at runtime over an authenticated API, with rotation, dynamic generation, leasing, and audit logs. They are the correct choice for anything that rotates, anything dynamic (short-lived cloud creds), and anything that must be audited per-access.

Dimension	Ansible Vault	External manager (e.g. HashiCorp Vault)
Where the secret lives	Encrypted in git	Outside the repo, served at runtime
Rotation	Manual re-encrypt + commit	Native, often automatic
Audit per access	None	Full audit log
Dynamic/short-lived creds	No	Yes (leases, TTLs)
Works fully offline	Yes	No (needs the API)
Bootstrap problem	Manage one passphrase	Manage one auth token/identity

The decision rule I use: if the secret is static, low-value, and benefits from being versioned with the code (lab defaults, a self-signed CA), Ansible Vault is fine. If it rotates, is high-value, or must be audited, fetch it at runtime from an external manager. Never paste a production cloud key into an Ansible Vault file and call it secure — you have only moved the rotation problem, not solved it.

For the Ansible Vault cases that remain, encrypt individual variables, not whole files, so the surrounding YAML stays diff-able. Use encrypt_string:

ansible-vault encrypt_string --vault-id prod@prompt \
  's3cr3t-db-password' --name 'db_password'

This emits an !vault tagged block you paste straight into a group_vars file. The variable name is in cleartext; only the value is encrypted. Drive the passphrase non-interactively in CI with --vault-password-file pointing at a script that fetches the passphrase from your secret manager — so even the Ansible Vault passphrase is never on disk.

6. Runtime secret retrieval with the community.hashi_vault lookup

The cleaner pattern for dynamic environments skips file encryption entirely and pulls secrets at task runtime. The community.hashi_vault collection provides the hashi_vault lookup plugin. The playbook ships zero secrets; at execution time each control node authenticates to Vault and reads exactly the paths it needs.

First, authenticate. Avoid a long-lived root token. In CI, prefer JWT/OIDC or AppRole; locally, a token from vault login works:

export VAULT_ADDR='https://vault.internal:8200'
# AppRole example: role_id is non-secret, secret_id is short-lived.
export ANSIBLE_HASHI_VAULT_AUTH_METHOD=approle
export ANSIBLE_HASHI_VAULT_ROLE_ID="$ROLE_ID"
export ANSIBLE_HASHI_VAULT_SECRET_ID="$SECRET_ID"

Then read a KV v2 secret in a play. Note KV v2 requires data/ in the path:

- name: Configure the application
  hosts: tier_frontend
  vars:
    # Reads field "password" from secret/data/prod/app
    db_password: "{{ lookup('community.hashi_vault.hashi_vault',
                          'secret/data/prod/app:password') }}"
  tasks:
    - name: Render app config
      ansible.builtin.template:
        src: app.conf.j2
        dest: /etc/app/app.conf
        mode: '0640'
      no_log: true

For a fleet of secrets, the vault_kv2_get module is cleaner than repeated lookups because it fetches the whole secret once and registers it:

- name: Fetch app secrets once
  community.hashi_vault.vault_kv2_get:
    path: prod/app
    engine_mount_point: secret
  register: app_secrets
  no_log: true

- name: Use a field
  ansible.builtin.debug:
    msg: "username is {{ app_secrets.secret.username }}"
  # password field deliberately not referenced here

The real win is dynamic secrets. Read from a Vault database or cloud engine and you get a credential that exists only for this run and self-revokes when its lease expires — there is nothing to rotate and nothing to leak long-term:

db_creds: "{{ lookup('community.hashi_vault.hashi_vault',
                     'database/creds/app-ro') }}"
# db_creds.username / db_creds.password are valid for the lease TTL only

7. no_log, encrypt_string, and avoiding leakage in callbacks

Encrypting a secret at rest is pointless if it then prints to stdout, lands in the JSON log of your CI, or gets shipped to a callback plugin. Closing the leak paths is non-negotiable.

no_log: true suppresses a task’s arguments and return values from output and logging. Apply it to every task that touches a secret — the template that renders it, the command that consumes it, the lookup that fetches it. Without it, a failed task helpfully dumps the module arguments, secret and all, into the error.

- name: Create database user
  community.postgresql.postgresql_user:
    name: app
    password: "{{ db_password }}"
  no_log: true

There are sharp edges to know:

no_log on a loop still suppresses per-item output, but a with_items over a list of secrets is suppressed wholesale — good.
Setting ANSIBLE_DEBUG=1 or running with high verbosity (-vvv) can override no_log for connection-level debugging. Never run production playbooks at -vvvv.
A task that registers a secret keeps it in a variable; later debug of that variable bypasses the original no_log. Guard the debug task too, or do not debug secret-bearing vars at all.

Callback plugins are the sneaky path. The log_plugins family (for example a JSON file callback, or a callback that ships results to Splunk, Datadog, or Ansible Automation Platform) receives full task results. no_log does redact the result before it reaches callbacks, so it remains your primary control — but audit which callbacks you have enabled:

ansible-config dump | grep -i callback
ansible-doc -t callback -l

If a callback writes a job log to shared storage, treat that log as secret-bearing unless you have verified no_log covers every secret task. The failure mode is a “convenient” full-output log archived to an S3 bucket the whole org can read.

Finally, diff mode (--diff) prints file before/after content — which means rendering a config from a Vault secret with --diff on will print the secret to the terminal. Set no_log: true on template/copy tasks, or these tasks honor a diff: false to suppress just the diff while keeping the change.

8. Tying it together in a pipeline with ephemeral credentials

The end state: a CI job that authenticates with its own identity, derives a short-lived Vault token, runs Ansible against live cloud inventory, and leaves no static secret anywhere. Cloud read access for inventory comes from an assumed IAM role / managed identity, not keys.

# GitHub Actions job (sketch). OIDC -> Vault -> Ansible.
jobs:
  configure:
    runs-on: ubuntu-latest
    permissions:
      id-token: write     # mint the OIDC token
      contents: read
    steps:
      - uses: actions/checkout@v4

      # Assume an inventory-read role via OIDC; no stored AWS keys.
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::111122223333:role/ci-ansible-inventory
          aws-region: eu-west-1

      # Exchange the CI OIDC token for a short-lived Vault token.
      - uses: hashicorp/vault-action@v3
        with:
          url: https://vault.internal:8200
          method: jwt
          role: ci-ansible
          exportToken: true     # sets VAULT_TOKEN for later steps

      - name: Install collections
        run: ansible-galaxy collection install -r requirements.yml

      - name: Run playbook against live inventory
        env:
          ANSIBLE_HASHI_VAULT_AUTH_METHOD: token   # reuse the minted token
        run: |
          ansible-playbook -i inventory/ site.yml --flush-cache

Three properties make this safe. The AWS credentials are assumed per-job and expire in an hour. The Vault token is minted from the pipeline’s OIDC identity, scoped by a Vault role’s policies to only the paths this job needs, and dies with the job. And --flush-cache guarantees the run sees the current fleet, not a snapshot from a previous job. No secret is ever written to the repo, the runner’s disk, or a log — provided every secret-touching task carries no_log.

Verify

Walk these checks before trusting the setup in anger.

# 1. The plugin parses and the expected hosts/groups exist.
ansible-inventory -i inventory/ --graph

# 2. Derived groups from keyed_groups/constructed are present.
ansible-inventory -i inventory/ --graph | grep -E 'role_|tier_|az_'

# 3. A host carries the composed connection vars.
ansible-inventory -i inventory/ --host <one-host> --yaml | grep ansible_host

# 4. Caching works: second run is fast and offline-ish.
time ansible-inventory -i inventory/ --graph        # warm
time ansible-inventory -i inventory/ --graph --flush-cache  # cold

# 5. A Vault lookup resolves (run a throwaway play).
ansible -i localhost, -m debug \
  -a "msg={{ lookup('community.hashi_vault.hashi_vault','secret/data/prod/app:username') }}" \
  all

# 6. Secrets do NOT leak: run the real play at -vv and grep the output.
ansible-playbook -i inventory/ site.yml -vv 2>&1 | grep -i 'password\|secret' || echo "clean"

If step 6 prints anything resembling a credential, a task is missing no_log. Fix it before the pipeline ever runs.

Dynamic Inventory and Secure Secrets for Ansible at Cloud Scale

1. Static vs dynamic inventory and the plugin lifecycle

2. Configuring the amazon.aws.aws_ec2 plugin

3. Building host groups with keyed_groups, compose, and constructed

4. Caching inventory and merging multiple sources

5. Ansible Vault vs external secret managers

6. Runtime secret retrieval with the community.hashi_vault lookup

7. no_log, encrypt_string, and avoiding leakage in callbacks

8. Tying it together in a pipeline with ephemeral credentials

Verify

Checklist

Written by Vinod

Comments

Keep Reading

Engineering Idempotent Ansible Collections with Molecule Testing

Programmatic Infrastructure with CDK for Terraform in TypeScript

Building a Multi-Tool IaC Security Scanning Gate with Checkov and Trivy