IaC Multi-Cloud

Engineering Idempotent Ansible Collections with Molecule Testing

Most Ansible “roles” that ship inside an organization are not idempotent; they merely happen to converge the first time you run them on a clean box. Run them twice and they report changed on a task that did nothing. Run them in check mode and they explode. Run them on RHEL after you wrote them on Ubuntu and a package name is wrong. None of that is caught until production, because the role was never tested — it was demoed once and committed.

The fix is two disciplines that reinforce each other: package your automation as a Collection with explicit contracts (argument specs, defaults, semantic versioning), and prove every role with Molecule across a create -> converge -> idempotence -> verify lifecycle in CI. Idempotence stops being a code-review assertion and becomes a test that fails the build. This is the layout and test matrix I use for collections other teams depend on.

1. Lay out the collection: roles, plugins, module_utils, and galaxy.yml

A collection is a namespaced bundle (namespace.collection) that Ansible resolves as kloudvin.platform.nginx. The directory structure is fixed; scaffold it with ansible-galaxy collection init kloudvin.platform. A real collection fills out that skeleton like this:

kloudvin/platform/
├── galaxy.yml                 # collection metadata + dependencies
├── meta/runtime.yml           # requires_ansible, action/module redirects
├── plugins/
│   ├── modules/healthcheck.py    # -> kloudvin.platform.healthcheck
│   ├── filter/netmask.py         # custom filter plugins
│   └── module_utils/http.py      # shared code imported by modules
├── roles/
│   └── nginx/
│       ├── defaults/main.yml         # lowest-precedence, overridable vars
│       ├── meta/main.yml             # galaxy_info, role dependencies
│       ├── meta/argument_specs.yml   # the role's typed contract
│       ├── tasks/main.yml
│       ├── handlers/main.yml
│       ├── templates/nginx.conf.j2
│       └── molecule/default/         # per-role test scenarios
└── tests/sanity/ignore-2.17.txt   # documented sanity-test exceptions

galaxy.yml is the package manifest. Get the version and dependencies right; everything downstream keys off them:

namespace: kloudvin
name: platform
version: 1.4.0          # semver; bump per the rules in section 8
readme: README.md
authors:
  - Vinod H
description: Reusable platform roles, modules, and filters.
license:
  - MIT
tags:
  - infrastructure
  - web
  - linux
dependencies:
  ansible.posix: ">=1.5.0,<2.0.0"
  community.general: ">=8.0.0"
repository: https://github.com/kloudvin/platform
build_ignore:
  - .github
  - "*.tar.gz"
  - tests/output

meta/runtime.yml declares the minimum control-node Ansible and is required for the sanity tests to pass. Pin a floor you actually test against:

requires_ansible: ">=2.16.0"

Build and inspect the artifact locally before you ever push:

ansible-galaxy collection build           # -> kloudvin-platform-1.4.0.tar.gz
ansible-galaxy collection install kloudvin-platform-1.4.0.tar.gz -p ./collections --force

The dependencies map in galaxy.yml is for collection dependencies (other collections from Galaxy), not Python packages. Python deps go in requirements.txt and tests/requirements.txt; system-level requirements get documented in the README and installed in your test images.

2. Write tasks that are genuinely idempotent

Idempotence is a property of every individual task, not of the playbook as a whole. The contract: running the task when the system is already in the desired state must report ok, not changed, and must not error in check mode. Two anti-patterns break this constantly — command/shell and misuse of changed_when. Prefer a real module over command whenever one exists, because modules report change state honestly:

# Idempotent: ansible.builtin.copy hashes content and only writes on diff.
- name: Deploy nginx config
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    owner: root
    group: root
    mode: "0644"
    validate: "nginx -t -c %s"   # fail BEFORE replacing a known-good file
  notify: Reload nginx

When you are forced to shell out, you own the change reporting. A bare command is always changed because Ansible cannot know what it did. Gate it with changed_when and make it safe in check mode:

# WRONG: reports changed on every run, breaks --check.
- name: Enable feature flag
  ansible.builtin.command: app-ctl enable telemetry

# RIGHT: idempotent guard + honest change reporting + check-mode safe.
- name: Read current feature flags
  ansible.builtin.command: app-ctl get telemetry
  register: flag_state
  changed_when: false          # a read never changes state
  check_mode: false            # safe to run even in --check (read-only)

- name: Enable telemetry
  ansible.builtin.command: app-ctl enable telemetry
  when: "'enabled' not in flag_state.stdout"
  changed_when: true           # if we got here, we changed something

The pattern is a read task (changed_when: false, check_mode: false), then a write task guarded by when: so it only fires when reality diverges from intent. That is how you make imperative commands behave declaratively. When the side effect is a file, creates/removes is the cheapest guard:

- name: Initialize the database schema once
  ansible.builtin.command: app-ctl db init
  args:
    creates: /var/lib/app/.schema-initialized

Two more rules that catch real bugs:

3. Define the role contract: precedence, defaults, and argument_specs

A role that silently does the wrong thing when a caller fat-fingers a variable is a liability. Two mechanisms make the contract explicit: where variables live (precedence) and a typed spec that validates inputs at runtime. Put every user-tunable variable in defaults/main.yml — the lowest precedence, so callers can override it from group_vars, the play, or -e. Reserve vars/main.yml for values the role author controls and does not want overridden casually (it sits very high in precedence). The simplified order, low to high, that matters day to day:

role defaults  <  inventory/group_vars  <  host_vars  <  play vars
               <  role vars (vars/main.yml)  <  block/task vars  <  extra-vars (-e)

So: tunables in defaults/, internal constants in vars/, and remember that -e beats everything — useful for CI overrides, dangerous if you rely on it for normal config.

meta/argument_specs.yml is the role’s signature. Ansible validates it automatically before the role’s tasks run, producing a clear error instead of a confusing failure 12 tasks deep:

argument_specs:
  main:
    short_description: Install and configure nginx.
    options:
      nginx_worker_processes:
        type: str
        default: "auto"
        description: Value for the worker_processes directive.
      nginx_listen_port:
        type: int
        default: 80
        description: TCP port nginx listens on.
      nginx_server_names:
        type: list
        elements: str
        required: true
        description: server_name entries for the default vhost.
      nginx_ssl:
        type: dict
        required: false
        options:
          cert_path: { type: path, required: true }
          key_path: { type: path, required: true }

With that file present, calling the role with nginx_listen_port: "eighty" fails immediately with a type error — validation runs on role entry, with nothing else to wire up. This is the single highest-leverage file in a shared role.

4. Author custom modules and filter plugins inside the collection

When a task needs real logic — call an API, compute something, enforce a non-trivial idempotent state — write a module, not a 40-line shell block. Modules live in plugins/modules/ and are addressed as namespace.collection.name. The non-negotiables: supports_check_mode set, and changed reported accurately.

#!/usr/bin/python
# plugins/modules/healthcheck.py
from ansible.module_utils.basic import AnsibleModule
from ansible_collections.kloudvin.platform.plugins.module_utils.http import probe


def main():
    module = AnsibleModule(
        argument_spec=dict(
            url=dict(type="str", required=True),
            expected_status=dict(type="int", default=200),
        ),
        supports_check_mode=True,   # mandatory for a testable module
    )
    url = module.params["url"]
    expected = module.params["expected_status"]

    # A health check is a read; it never changes state.
    if module.check_mode:
        module.exit_json(changed=False, msg="check mode: probe skipped")

    status = probe(url)
    if status != expected:
        module.fail_json(msg=f"{url} returned {status}, expected {expected}")

    module.exit_json(changed=False, status=status)


if __name__ == "__main__":
    main()

Shared logic goes in module_utils/ and is imported with the fully qualified ansible_collections.<ns>.<coll>.plugins.module_utils.<mod> path — that exact form is what makes the code resolvable once the collection is installed. Document the module with DOCUMENTATION, EXAMPLES, and RETURN YAML blocks; the sanity tests in section 7 fail without them.

Filter plugins are simpler and keep templates clean. They return a dict of name-to-callable:

# plugins/filter/netmask.py
def cidr_to_netmask(cidr):
    import ipaddress
    return str(ipaddress.ip_network(cidr, strict=False).netmask)


class FilterModule(object):
    def filters(self):
        return {"cidr_to_netmask": cidr_to_netmask}

Used in a template or task as {{ '10.0.0.0/24' | kloudvin.platform.cidr_to_netmask }}.

5. Build the Molecule scenario: create, converge, idempotence, verify

Molecule wraps the test lifecycle. A scenario is a directory under roles/<role>/molecule/<scenario>/ with three files. Install the tooling first:

pip install "molecule>=24.0" molecule-plugins[docker] ansible-lint pytest-testinfra

The molecule.yml defines the driver, the platforms (the throwaway hosts), and which verifier to run:

# roles/nginx/molecule/default/molecule.yml
role_name_check: 1
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: nginx-ubuntu2204
    image: geerlingguy/docker-ubuntu2204-ansible:latest
    pre_build_image: true
    privileged: true
    cgroupns_mode: host
    command: /lib/systemd/systemd
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
provisioner:
  name: ansible
verifier:
  name: testinfra

converge.yml is the playbook Molecule runs to apply your role:

# roles/nginx/molecule/default/converge.yml
- name: Converge
  hosts: all
  tasks:
    - name: Run the nginx role
      ansible.builtin.include_role:
        name: kloudvin.platform.nginx
      vars:
        nginx_server_names:
          - example.test

Now run the lifecycle. The idempotence step is the heart of the whole exercise:

molecule create        # provision the container(s)
molecule converge      # apply the role once
molecule idempotence   # apply AGAIN; FAILS if any task reports changed
molecule verify        # assert final state with the verifier
molecule destroy       # tear down

Or run the entire matrix end to end, which is what CI calls:

molecule test          # destroy -> create -> converge -> idempotence -> verify -> destroy

molecule idempotence works by running converge a second time and parsing the recap: if changed != 0 on any host, it fails the build. This is the mechanical enforcement of section 2. A role that passes converge but fails idempotence is broken, full stop — and now the pipeline says so instead of a reviewer guessing.

The verify stage asserts the outcome, not the steps. With Testinfra you write Python assertions against the live container:

# roles/nginx/molecule/default/tests/test_default.py
def test_nginx_running(host):
    nginx = host.service("nginx")
    assert nginx.is_running
    assert nginx.is_enabled


def test_listening_on_80(host):
    assert host.socket("tcp://0.0.0.0:80").is_listening


def test_config_is_valid(host):
    assert host.run("nginx -t").rc == 0

If you prefer to stay in YAML, set verifier.name: ansible and write a verify.yml playbook using ansible.builtin.assert. Both are first-class; Testinfra reads more cleanly for service/port/file assertions.

6. Cover multiple distros and swap Docker for Podman

The bug you are hunting is the package that is nginx on Debian and on RHEL but pulled from EPEL, or the service that is nginx everywhere but the config path differs. Catch it by adding more platforms to the section 5 scenario (note Rocky’s init path differs):

platforms:
  - name: nginx-ubuntu2204
    image: geerlingguy/docker-ubuntu2204-ansible:latest
    pre_build_image: true
    command: /lib/systemd/systemd
    volumes: ["/sys/fs/cgroup:/sys/fs/cgroup:rw"]
    privileged: true
  - name: nginx-rocky9
    image: geerlingguy/docker-rockylinux9-ansible:latest
    pre_build_image: true
    command: /usr/sbin/init
    volumes: ["/sys/fs/cgroup:/sys/fs/cgroup:rw"]
    privileged: true

Add a Debian 12 entry the same way. molecule test now converges and checks idempotence on all of them. That is where OS-conditional logic earns its keep:

- name: Install nginx (handles per-distro package source)
  ansible.builtin.package:
    name: nginx
    state: present
  # EPEL on EL is set up in an earlier task gated on ansible_os_family.

Switching to Podman for rootless CI runners is a one-line driver change plus the matching plugin (pip install molecule-plugins[podman]):

driver:
  name: podman

The platform definitions are otherwise identical. Podman is the default on modern RHEL-family runners and avoids needing a Docker daemon socket, which matters for least-privilege CI.

Use pre_build_image: true with the geerlingguy/docker-*-ansible images. They ship systemd already working, so service/systemd tasks behave like a real host. Building from a bare ubuntu:22.04 and trying to run systemd inside it is a notorious time sink — don’t.

7. Lint, run sanity tests, and wire up GitHub Actions

Three checks run before merge. ansible-lint enforces idempotency-adjacent rules (no bare command, no latest package versions, FQCN usage) and is configurable:

# .ansible-lint
profile: production      # the strictest built-in profile
exclude_paths:
  - molecule/
  - tests/output/
ansible-lint                       # lint roles and playbooks

The collection sanity tests are Ansible’s own structural checks on your plugins — valid DOCUMENTATION, no Python 2 leftovers, correct argument specs. Run them from inside the installed collection tree:

ansible-test sanity --docker -v

Genuine, justified exceptions are recorded per Ansible version in tests/sanity/ignore-<version>.txt; an empty or absent file means zero ignored failures, which is the goal.

Now make all of it a required check. This workflow lints, runs sanity, and executes the full Molecule matrix:

# .github/workflows/ci.yml
name: ci
on:
  push:
    branches: [main]
  pull_request:

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install ansible-core ansible-lint
      - run: ansible-lint

  molecule:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        scenario: [default]
    steps:
      - name: Check out into the collection path
        uses: actions/checkout@v4
        with:
          path: ansible_collections/kloudvin/platform
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install "molecule>=24.0" molecule-plugins[docker] pytest-testinfra ansible-core
      - name: Run Molecule
        working-directory: ansible_collections/kloudvin/platform/roles/nginx
        run: molecule test -s ${{ matrix.scenario }}

The non-obvious bit: check the repo out into ansible_collections/<namespace>/<name>/. Ansible resolves collections by that path layout, so both ansible-test and any kloudvin.platform.* reference in converge.yml only work when the working tree sits there. Skip this and you get cryptic “collection not found” errors in CI that pass locally.

8. Publish to Galaxy / Automation Hub with semantic versioning

The version in galaxy.yml is a promise to consumers, so follow semver strictly:

Build, then publish. To public Galaxy you need an API token from your Galaxy profile:

ansible-galaxy collection build
ansible-galaxy collection publish kloudvin-platform-1.4.0.tar.gz --api-key "$GALAXY_TOKEN"

For private Automation Hub (or any pulp/galaxy_ng server), point at its API and token in ansible.cfg rather than passing flags around:

# ansible.cfg
[galaxy]
server_list = automation_hub

[galaxy_server.automation_hub]
url = https://hub.internal.kloudvin.com/api/galaxy/content/published/
token = <hub-token>

Then ansible-galaxy collection publish kloudvin-platform-1.4.0.tar.gz resolves the server from config. Consumers pin you in their requirements.yml and get reproducible installs:

# requirements.yml
collections:
  - name: kloudvin.platform
    version: ">=1.4.0,<2.0.0"
ansible-galaxy collection install -r requirements.yml

The version range with a major-version ceiling means consumers automatically receive your bug fixes and new roles but never a breaking change without an explicit bump — which is the entire point of publishing with semver instead of telling people to track main.

Verify

Before you tag a release, run the four gates end to end:

ansible-lint                              # 1. production profile clean
ansible-test sanity --docker              # 2. no undocumented ignores
cd roles/nginx && molecule test           # 3. converge + idempotence + verify, all platforms
ansible-galaxy collection build           # 4. artifact builds cleanly

The decisive signal is step 3. A clean molecule test means every platform provisioned, the role converged, a second converge reported zero changes (proven idempotence), and the verifier asserted the real end state. Green across Ubuntu, Rocky, and Debian means the role is safe to publish. As a final smoke test, install the built tarball into a scratch path (ansible-galaxy collection install kloudvin-platform-*.tar.gz -p /tmp/verify --force) and confirm a deliberately bad variable value (nginx_listen_port: "eighty") is rejected with a type error from the argument spec.

Checklist

ansiblemoleculecollectionstestingidempotency

Comments

Keep Reading