Most Ansible “roles” that ship inside an organization are not idempotent; they merely happen to converge the first time you run them on a clean box. Run them twice and they report changed on a task that did nothing. Run them in check mode and they explode. Run them on RHEL after you wrote them on Ubuntu and a package name is wrong. None of that is caught until production, because the role was never tested — it was demoed once and committed.
The fix is two disciplines that reinforce each other: package your automation as a Collection with explicit contracts (argument specs, defaults, semantic versioning), and prove every role with Molecule across a create -> converge -> idempotence -> verify lifecycle in CI. Idempotence stops being a code-review assertion and becomes a test that fails the build. This is the layout and test matrix I use for collections other teams depend on.
1. Lay out the collection: roles, plugins, module_utils, and galaxy.yml
A collection is a namespaced bundle (namespace.collection) that Ansible resolves as kloudvin.platform.nginx. The directory structure is fixed; scaffold it with ansible-galaxy collection init kloudvin.platform. A real collection fills out that skeleton like this:
kloudvin/platform/
├── galaxy.yml # collection metadata + dependencies
├── meta/runtime.yml # requires_ansible, action/module redirects
├── plugins/
│ ├── modules/healthcheck.py # -> kloudvin.platform.healthcheck
│ ├── filter/netmask.py # custom filter plugins
│ └── module_utils/http.py # shared code imported by modules
├── roles/
│ └── nginx/
│ ├── defaults/main.yml # lowest-precedence, overridable vars
│ ├── meta/main.yml # galaxy_info, role dependencies
│ ├── meta/argument_specs.yml # the role's typed contract
│ ├── tasks/main.yml
│ ├── handlers/main.yml
│ ├── templates/nginx.conf.j2
│ └── molecule/default/ # per-role test scenarios
└── tests/sanity/ignore-2.17.txt # documented sanity-test exceptions
galaxy.yml is the package manifest. Get the version and dependencies right; everything downstream keys off them:
namespace: kloudvin
name: platform
version: 1.4.0 # semver; bump per the rules in section 8
readme: README.md
authors:
- Vinod H
description: Reusable platform roles, modules, and filters.
license:
- MIT
tags:
- infrastructure
- web
- linux
dependencies:
ansible.posix: ">=1.5.0,<2.0.0"
community.general: ">=8.0.0"
repository: https://github.com/kloudvin/platform
build_ignore:
- .github
- "*.tar.gz"
- tests/output
meta/runtime.yml declares the minimum control-node Ansible and is required for the sanity tests to pass. Pin a floor you actually test against:
requires_ansible: ">=2.16.0"
Build and inspect the artifact locally before you ever push:
ansible-galaxy collection build # -> kloudvin-platform-1.4.0.tar.gz
ansible-galaxy collection install kloudvin-platform-1.4.0.tar.gz -p ./collections --force
The
dependenciesmap ingalaxy.ymlis for collection dependencies (other collections from Galaxy), not Python packages. Python deps go inrequirements.txtandtests/requirements.txt; system-level requirements get documented in the README and installed in your test images.
2. Write tasks that are genuinely idempotent
Idempotence is a property of every individual task, not of the playbook as a whole. The contract: running the task when the system is already in the desired state must report ok, not changed, and must not error in check mode. Two anti-patterns break this constantly — command/shell and misuse of changed_when. Prefer a real module over command whenever one exists, because modules report change state honestly:
# Idempotent: ansible.builtin.copy hashes content and only writes on diff.
- name: Deploy nginx config
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
owner: root
group: root
mode: "0644"
validate: "nginx -t -c %s" # fail BEFORE replacing a known-good file
notify: Reload nginx
When you are forced to shell out, you own the change reporting. A bare command is always changed because Ansible cannot know what it did. Gate it with changed_when and make it safe in check mode:
# WRONG: reports changed on every run, breaks --check.
- name: Enable feature flag
ansible.builtin.command: app-ctl enable telemetry
# RIGHT: idempotent guard + honest change reporting + check-mode safe.
- name: Read current feature flags
ansible.builtin.command: app-ctl get telemetry
register: flag_state
changed_when: false # a read never changes state
check_mode: false # safe to run even in --check (read-only)
- name: Enable telemetry
ansible.builtin.command: app-ctl enable telemetry
when: "'enabled' not in flag_state.stdout"
changed_when: true # if we got here, we changed something
The pattern is a read task (changed_when: false, check_mode: false), then a write task guarded by when: so it only fires when reality diverges from intent. That is how you make imperative commands behave declaratively. When the side effect is a file, creates/removes is the cheapest guard:
- name: Initialize the database schema once
ansible.builtin.command: app-ctl db init
args:
creates: /var/lib/app/.schema-initialized
Two more rules that catch real bugs:
- Never use
ignore_errors: trueto paper over a non-idempotent task. Fix the task.failed_whenexists to define what failure actually means. - Loops over package lists belong in the module, not in
with_items.ansible.builtin.packageand friends accept a list inname:and converge in a single transaction, which is both faster and atomic.
3. Define the role contract: precedence, defaults, and argument_specs
A role that silently does the wrong thing when a caller fat-fingers a variable is a liability. Two mechanisms make the contract explicit: where variables live (precedence) and a typed spec that validates inputs at runtime. Put every user-tunable variable in defaults/main.yml — the lowest precedence, so callers can override it from group_vars, the play, or -e. Reserve vars/main.yml for values the role author controls and does not want overridden casually (it sits very high in precedence). The simplified order, low to high, that matters day to day:
role defaults < inventory/group_vars < host_vars < play vars
< role vars (vars/main.yml) < block/task vars < extra-vars (-e)
So: tunables in defaults/, internal constants in vars/, and remember that -e beats everything — useful for CI overrides, dangerous if you rely on it for normal config.
meta/argument_specs.yml is the role’s signature. Ansible validates it automatically before the role’s tasks run, producing a clear error instead of a confusing failure 12 tasks deep:
argument_specs:
main:
short_description: Install and configure nginx.
options:
nginx_worker_processes:
type: str
default: "auto"
description: Value for the worker_processes directive.
nginx_listen_port:
type: int
default: 80
description: TCP port nginx listens on.
nginx_server_names:
type: list
elements: str
required: true
description: server_name entries for the default vhost.
nginx_ssl:
type: dict
required: false
options:
cert_path: { type: path, required: true }
key_path: { type: path, required: true }
With that file present, calling the role with nginx_listen_port: "eighty" fails immediately with a type error — validation runs on role entry, with nothing else to wire up. This is the single highest-leverage file in a shared role.
4. Author custom modules and filter plugins inside the collection
When a task needs real logic — call an API, compute something, enforce a non-trivial idempotent state — write a module, not a 40-line shell block. Modules live in plugins/modules/ and are addressed as namespace.collection.name. The non-negotiables: supports_check_mode set, and changed reported accurately.
#!/usr/bin/python
# plugins/modules/healthcheck.py
from ansible.module_utils.basic import AnsibleModule
from ansible_collections.kloudvin.platform.plugins.module_utils.http import probe
def main():
module = AnsibleModule(
argument_spec=dict(
url=dict(type="str", required=True),
expected_status=dict(type="int", default=200),
),
supports_check_mode=True, # mandatory for a testable module
)
url = module.params["url"]
expected = module.params["expected_status"]
# A health check is a read; it never changes state.
if module.check_mode:
module.exit_json(changed=False, msg="check mode: probe skipped")
status = probe(url)
if status != expected:
module.fail_json(msg=f"{url} returned {status}, expected {expected}")
module.exit_json(changed=False, status=status)
if __name__ == "__main__":
main()
Shared logic goes in module_utils/ and is imported with the fully qualified ansible_collections.<ns>.<coll>.plugins.module_utils.<mod> path — that exact form is what makes the code resolvable once the collection is installed. Document the module with DOCUMENTATION, EXAMPLES, and RETURN YAML blocks; the sanity tests in section 7 fail without them.
Filter plugins are simpler and keep templates clean. They return a dict of name-to-callable:
# plugins/filter/netmask.py
def cidr_to_netmask(cidr):
import ipaddress
return str(ipaddress.ip_network(cidr, strict=False).netmask)
class FilterModule(object):
def filters(self):
return {"cidr_to_netmask": cidr_to_netmask}
Used in a template or task as {{ '10.0.0.0/24' | kloudvin.platform.cidr_to_netmask }}.
5. Build the Molecule scenario: create, converge, idempotence, verify
Molecule wraps the test lifecycle. A scenario is a directory under roles/<role>/molecule/<scenario>/ with three files. Install the tooling first:
pip install "molecule>=24.0" molecule-plugins[docker] ansible-lint pytest-testinfra
The molecule.yml defines the driver, the platforms (the throwaway hosts), and which verifier to run:
# roles/nginx/molecule/default/molecule.yml
role_name_check: 1
dependency:
name: galaxy
driver:
name: docker
platforms:
- name: nginx-ubuntu2204
image: geerlingguy/docker-ubuntu2204-ansible:latest
pre_build_image: true
privileged: true
cgroupns_mode: host
command: /lib/systemd/systemd
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:rw
provisioner:
name: ansible
verifier:
name: testinfra
converge.yml is the playbook Molecule runs to apply your role:
# roles/nginx/molecule/default/converge.yml
- name: Converge
hosts: all
tasks:
- name: Run the nginx role
ansible.builtin.include_role:
name: kloudvin.platform.nginx
vars:
nginx_server_names:
- example.test
Now run the lifecycle. The idempotence step is the heart of the whole exercise:
molecule create # provision the container(s)
molecule converge # apply the role once
molecule idempotence # apply AGAIN; FAILS if any task reports changed
molecule verify # assert final state with the verifier
molecule destroy # tear down
Or run the entire matrix end to end, which is what CI calls:
molecule test # destroy -> create -> converge -> idempotence -> verify -> destroy
molecule idempotenceworks by runningconvergea second time and parsing the recap: ifchanged != 0on any host, it fails the build. This is the mechanical enforcement of section 2. A role that passesconvergebut failsidempotenceis broken, full stop — and now the pipeline says so instead of a reviewer guessing.
The verify stage asserts the outcome, not the steps. With Testinfra you write Python assertions against the live container:
# roles/nginx/molecule/default/tests/test_default.py
def test_nginx_running(host):
nginx = host.service("nginx")
assert nginx.is_running
assert nginx.is_enabled
def test_listening_on_80(host):
assert host.socket("tcp://0.0.0.0:80").is_listening
def test_config_is_valid(host):
assert host.run("nginx -t").rc == 0
If you prefer to stay in YAML, set verifier.name: ansible and write a verify.yml playbook using ansible.builtin.assert. Both are first-class; Testinfra reads more cleanly for service/port/file assertions.
6. Cover multiple distros and swap Docker for Podman
The bug you are hunting is the package that is nginx on Debian and on RHEL but pulled from EPEL, or the service that is nginx everywhere but the config path differs. Catch it by adding more platforms to the section 5 scenario (note Rocky’s init path differs):
platforms:
- name: nginx-ubuntu2204
image: geerlingguy/docker-ubuntu2204-ansible:latest
pre_build_image: true
command: /lib/systemd/systemd
volumes: ["/sys/fs/cgroup:/sys/fs/cgroup:rw"]
privileged: true
- name: nginx-rocky9
image: geerlingguy/docker-rockylinux9-ansible:latest
pre_build_image: true
command: /usr/sbin/init
volumes: ["/sys/fs/cgroup:/sys/fs/cgroup:rw"]
privileged: true
Add a Debian 12 entry the same way. molecule test now converges and checks idempotence on all of them. That is where OS-conditional logic earns its keep:
- name: Install nginx (handles per-distro package source)
ansible.builtin.package:
name: nginx
state: present
# EPEL on EL is set up in an earlier task gated on ansible_os_family.
Switching to Podman for rootless CI runners is a one-line driver change plus the matching plugin (pip install molecule-plugins[podman]):
driver:
name: podman
The platform definitions are otherwise identical. Podman is the default on modern RHEL-family runners and avoids needing a Docker daemon socket, which matters for least-privilege CI.
Use
pre_build_image: truewith thegeerlingguy/docker-*-ansibleimages. They ship systemd already working, soservice/systemdtasks behave like a real host. Building from a bareubuntu:22.04and trying to run systemd inside it is a notorious time sink — don’t.
7. Lint, run sanity tests, and wire up GitHub Actions
Three checks run before merge. ansible-lint enforces idempotency-adjacent rules (no bare command, no latest package versions, FQCN usage) and is configurable:
# .ansible-lint
profile: production # the strictest built-in profile
exclude_paths:
- molecule/
- tests/output/
ansible-lint # lint roles and playbooks
The collection sanity tests are Ansible’s own structural checks on your plugins — valid DOCUMENTATION, no Python 2 leftovers, correct argument specs. Run them from inside the installed collection tree:
ansible-test sanity --docker -v
Genuine, justified exceptions are recorded per Ansible version in tests/sanity/ignore-<version>.txt; an empty or absent file means zero ignored failures, which is the goal.
Now make all of it a required check. This workflow lints, runs sanity, and executes the full Molecule matrix:
# .github/workflows/ci.yml
name: ci
on:
push:
branches: [main]
pull_request:
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install ansible-core ansible-lint
- run: ansible-lint
molecule:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
scenario: [default]
steps:
- name: Check out into the collection path
uses: actions/checkout@v4
with:
path: ansible_collections/kloudvin/platform
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install "molecule>=24.0" molecule-plugins[docker] pytest-testinfra ansible-core
- name: Run Molecule
working-directory: ansible_collections/kloudvin/platform/roles/nginx
run: molecule test -s ${{ matrix.scenario }}
The non-obvious bit: check the repo out into ansible_collections/<namespace>/<name>/. Ansible resolves collections by that path layout, so both ansible-test and any kloudvin.platform.* reference in converge.yml only work when the working tree sits there. Skip this and you get cryptic “collection not found” errors in CI that pass locally.
8. Publish to Galaxy / Automation Hub with semantic versioning
The version in galaxy.yml is a promise to consumers, so follow semver strictly:
- MAJOR — a breaking change: you removed a role variable, renamed a module, raised
requires_ansible, or changed default behavior. Consumers must read release notes. - MINOR — additive and backward compatible: a new role, a new optional variable with a safe default, a new module.
- PATCH — a bug fix that changes no interface.
Build, then publish. To public Galaxy you need an API token from your Galaxy profile:
ansible-galaxy collection build
ansible-galaxy collection publish kloudvin-platform-1.4.0.tar.gz --api-key "$GALAXY_TOKEN"
For private Automation Hub (or any pulp/galaxy_ng server), point at its API and token in ansible.cfg rather than passing flags around:
# ansible.cfg
[galaxy]
server_list = automation_hub
[galaxy_server.automation_hub]
url = https://hub.internal.kloudvin.com/api/galaxy/content/published/
token = <hub-token>
Then ansible-galaxy collection publish kloudvin-platform-1.4.0.tar.gz resolves the server from config. Consumers pin you in their requirements.yml and get reproducible installs:
# requirements.yml
collections:
- name: kloudvin.platform
version: ">=1.4.0,<2.0.0"
ansible-galaxy collection install -r requirements.yml
The version range with a major-version ceiling means consumers automatically receive your bug fixes and new roles but never a breaking change without an explicit bump — which is the entire point of publishing with semver instead of telling people to track main.
Verify
Before you tag a release, run the four gates end to end:
ansible-lint # 1. production profile clean
ansible-test sanity --docker # 2. no undocumented ignores
cd roles/nginx && molecule test # 3. converge + idempotence + verify, all platforms
ansible-galaxy collection build # 4. artifact builds cleanly
The decisive signal is step 3. A clean molecule test means every platform provisioned, the role converged, a second converge reported zero changes (proven idempotence), and the verifier asserted the real end state. Green across Ubuntu, Rocky, and Debian means the role is safe to publish. As a final smoke test, install the built tarball into a scratch path (ansible-galaxy collection install kloudvin-platform-*.tar.gz -p /tmp/verify --force) and confirm a deliberately bad variable value (nginx_listen_port: "eighty") is rejected with a type error from the argument spec.