Tuning Ansible for Speed & Scale, In Depth: Pipelining, Forks, Fact Caching, Async & Mitogen

The first time you run a tidy ten-task playbook against three lab machines, Ansible feels instant. The first time you run that same playbook against three hundred production machines, it feels like watching paint dry — and the reason is almost never your tasks. It is the plumbing: how many hosts Ansible talks to at once, how many times it opens an SSH connection, how many forks-and-execs each module costs, and how much time every single play burns up front gathering facts you may not even use. Ansible’s defaults are deliberately conservative and beginner-safe; they are emphatically not tuned for scale. The good news is that the same engine that crawls with stock settings will fly once you pull the right levers, and none of those levers require rewriting a line of your roles.

This lesson is the exhaustive tour of those levers. We start with forks — the parallelism dial that decides how many hosts Ansible drives simultaneously — and how to size it sanely. We then go deep on the single biggest free win in Ansible: SSH connection reuse via OpenSSH’s ControlMaster/ControlPersist multiplexing, and pipelining, which collapses the several round-trips each task normally makes into roughly one (plus the requiretty caveat that bites people who turn it on blind). We cover fact gathering end to end — implicit versus explicit, turning it off, gather_subset, gather_timeout — and then fact caching with the jsonfile, redis, and memcached plugins so a fleet gathers facts once and reuses them for hours. We cover async + poll for long-running tasks and the fire-and-forget poll: 0 + ansible.builtin.async_status pattern; the free strategy for letting fast hosts race ahead; cutting work with loops and loop-level when; profiling with the profile_tasks and timer callbacks so you measure instead of guess; the Mitogen strategy plugin that can halve wall-clock time; and the connection plugins (ssh, paramiko, smart) underneath it all. Everything targets current Ansible (ansible-core 2.17+ / Ansible 10+, 2026) and uses FQCN — ansible.builtin.async_status, ansible.builtin.setup — throughout. We finish with a free, local before/after benchmark so you can see the numbers move.

Learning objectives

After working through this lesson you will be able to:

Size and set forks for your control node and fleet, and explain how forks interacts with serial and the chosen strategy.
Configure SSH multiplexing (ControlMaster, ControlPersist, ControlPath) and pipelining, explain exactly what each saves, and handle the requiretty caveat that breaks pipelining on hardened hosts.
Decide when facts are gathered — implicit vs explicit, gather_facts: false, gather_subset, gather_timeout — and stop paying for facts you do not use.
Stand up fact caching with the jsonfile, redis, or memcached cache plugins (and fact_caching_timeout) so a fleet gathers facts once and reuses them.
Run long tasks with async + poll, and use the fire-and-forget poll: 0 pattern with ansible.builtin.async_status to start work on every host and collect results later.
Choose a strategy (linear, free, host_pinned, debug) and know when free helps and when it hurts.
Profile a run with the profile_tasks, profile_roles, and timer callbacks (via ANSIBLE_CALLBACKS_ENABLED) to find the real bottleneck before tuning.
Install and enable the Mitogen strategy plugin, understand what it changes, and know its compatibility limits.
Pick the right connection plugin (ssh, paramiko, smart, local) for the job.

Prerequisites & where this fits

You should already be fluent with the run-time machinery this lesson tunes: playbooks made of plays and tasks, facts (the ansible.builtin.setup module, ansible_facts, custom facts), register for capturing results, and inventory with group_vars/host_vars. The previous lesson, Ansible Delegation, Strategies & Rolling Updates, In Depth, introduced serial, throttle, and the linear/free/host_pinned strategies in the context of control; this lesson revisits forks and free from the angle of speed and adds the connection-layer levers that make the biggest difference at scale. You will also lean on what you learned about variables and facts in Ansible Variables & Facts, In Depth, because fact gathering and caching are the same ansible_facts you already use, just timed and stored differently. This is the Execution module of the Advanced tier of the Ansible Zero-to-Hero ladder. The material maps to the RHCE (EX294) performance objectives — pipelining, forks, fact caching, and async are exactly the production-readiness skills the exam expects. Everything you need is ansible-core plus a couple of local containers or VMs; the lab runs for free.

Core concepts

Three ideas explain why Ansible is slow by default, and every lever in this lesson follows from them. Fix these in your head first.

Ansible’s work is dominated by connection overhead, not task logic. Most modules do very little real work — install a package, template a file, restart a service — but getting to the point of doing that work is expensive. For each task, on each host, vanilla Ansible: opens (or reuses) an SSH connection, creates a temporary directory on the target, copies the module (a Python file) into it, executes it with the right interpreter, captures the JSON it prints on stdout, and cleans up. The round-trips across the network — not the package install — are where the seconds go. Every performance lever in Ansible is ultimately about doing fewer round-trips, reusing connections, or doing more hosts at once. Hold that and the whole lesson coheres.

Ansible is push-based and serial-per-host by default, parallel-across-hosts by forks. Within a single host, tasks run top to bottom, one at a time (that is what makes a playbook readable and predictable). Across hosts, Ansible runs the same task on up to forks hosts simultaneously, then moves to the next task — that is the default linear strategy. So your two scaling dimensions are: how many hosts run in parallel (forks), and how cheap each host’s per-task overhead is (connection reuse, pipelining). The strategy decides how the per-host streams are scheduled relative to one another (linear keeps them in lockstep; free lets each host sprint).

The control node is the bottleneck you forget about. Every fork is a process on your control machine, and each one runs Python, holds an SSH connection, and uses memory and a file descriptor. Setting forks = 500 on a 2-vCPU laptop does not make Ansible 100× faster than forks = 5; it makes the control node thrash. Sizing forks is therefore a control-node capacity question, not a “bigger is better” knob. We will size it concretely below.

A vocabulary note you will see throughout: a connection plugin is the transport Ansible uses to reach a host (ssh, paramiko, local, winrm, …); a strategy plugin decides task scheduling across hosts (linear, free, host_pinned, debug, and the third-party mitogen_linear); a callback plugin reacts to run events and is how profiling output is produced. All three are plugins that run on the control node, and all three are tuning surfaces.

Forks: the parallelism dial

forks is the maximum number of hosts Ansible communicates with at the same time. It defaults to 5 — meaning even if your play targets 200 hosts, Ansible drives them 5 at a time for each task under the default linear strategy. This single setting is the most common reason a large run feels slow, and the easiest to fix.

You set it in three places (later overrides earlier):

Where	How	Scope
`ansible.cfg`	`forks = 50` under `[defaults]`	Project/global default
Environment	`ANSIBLE_FORKS=50 ansible-playbook …`	Per-invocation
Command line	`ansible-playbook -f 50 site.yml` (also `--forks`)	Per-invocation, wins

How forks interacts with the strategy. Under linear (default), Ansible runs task N on up to forks hosts, waits for all of them, then starts task N+1 on the next batch — the play advances task-by-task and the slowest host in each batch sets the pace. Under free, forks still caps concurrency, but hosts no longer wait for each other between tasks; a fast host can be ten tasks ahead of a slow one. So forks is the concurrency ceiling regardless of strategy; the strategy decides whether hosts move in lockstep beneath that ceiling.

How forks interacts with serial. serial (rolling-update batch size, covered in the previous lesson) is a different cap. serial: 10 means Ansible runs the whole play against 10 hosts, finishes, then the next 10. forks then governs concurrency within that batch of 10. The effective parallelism is min(forks, serial, hosts-remaining). If serial: 10 and forks: 50, you get at most 10 hosts at once (the batch is the binding limit); if serial: 100 and forks: 25, you get 25 at once. A classic mistake is bumping forks to 100 for a rolling update and seeing no change because serial is pinning you to 10.

Sizing forks. There is no universal number; size it from the control node’s capacity, because each fork is a Python process holding an SSH connection.

Control node	Sensible starting `forks`	Reasoning
Laptop / 2 vCPU, 8 GB	10–25	Modest CPU; SSH + Python per fork add up
CI runner / 4 vCPU, 16 GB	25–50	Common sweet spot for mid fleets
Dedicated control / 8–16 vCPU, 32 GB+	50–100+	Can drive hundreds of hosts in waves
AWX/AAP execution node	tune per node + `forks` in job template	Container resources cap it

Rules of thumb: each fork costs roughly tens of MB of RAM and one file descriptor pair; CPU matters because the control node runs Jinja2 templating, JSON parsing, and connection setup for every fork. Watch top/htop on the control node during a big run — if it pegs CPU or starts swapping, you set forks too high. Also raise the OS open-files limit (ulimit -n) before pushing forks into the hundreds, or you will hit “too many open files”. Finally, more forks only helps if you actually have many hosts and your tasks are not serialised by a downstream bottleneck (a single shared database, an artifact server, a delegate_to chokepoint).

SSH connection reuse: ControlMaster & ControlPersist (multiplexing)

Here is the biggest, cheapest win at scale, and it is pure OpenSSH. Every TCP+SSH handshake — key exchange, authentication, channel setup — costs real milliseconds, often 100–500 ms each. A playbook with 30 tasks against a host, without connection reuse, can pay that handshake dozens of times on that one host. Multiplexing opens the SSH connection once and reuses it for every subsequent task.

OpenSSH implements this with three options that Ansible’s ssh connection plugin sets for you:

OpenSSH option	What it does	Ansible’s effective default
`ControlMaster`	Lets the first connection become a master that later sessions reuse over one TCP connection	`auto`
`ControlPersist`	Keeps the master connection open in the background for N seconds after the last session closes, ready to be reused	`60s`
`ControlPath`	Filesystem path to the control socket that identifies a reusable connection (per user@host:port)	`%(directory)s/%%h-%%r` (under the control dir)

You configure these through Ansible, not by hand-editing ~/.ssh/config, via ssh_args in ansible.cfg:

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
control_path_dir = ~/.ansible/cp

What each value means in practice:

ControlMaster=auto — the first SSH connection to a host opens a master; every later task to the same host rides that one socket. auto is the right choice (it also tolerates a missing master gracefully). The alternatives are yes (force master), no (never multiplex), and ask.
ControlPersist=60s — after the last task on a host finishes, OpenSSH keeps the master alive for 60 seconds. The benefit: if you re-run the playbook (or a handler fires later), the warm connection is reused with zero handshake. Bump it (ControlPersist=300s or 30m) if you run the same playbook repeatedly during development. Setting it too high just leaves idle sockets around; setting it to no/0 disables persistence (you still multiplex within one run, but not across runs).
ControlPath — the socket path. The classic gotcha: the default path includes %h (host), %p (port), %r (remote user), and the path string has a 108-character limit on Linux sockets. Long hostnames or deep home directories blow past it and you see “unix_listener: path too long”, multiplexing silently fails, and every task pays a fresh handshake. The fix is a short control_path_dir (e.g. ~/.ansible/cp) so the full socket path stays under the limit.

How to confirm it is working. Run with -vvvv and look for ControlMaster / auto / mux lines, or check for live sockets while a play runs:

ls -la ~/.ansible/cp/
# entries like:  <hash>  -> a live control socket = multiplexing is on

The payoff: connection reuse alone can cut a multi-task run’s wall-clock time by half or more on high-latency links, because you stop paying the handshake on every task. It is on by default — your job is mainly to keep ControlPath short and tune ControlPersist to your workflow.

Pipelining: the single biggest per-task win

Multiplexing reuses the connection; pipelining reduces the number of operations per task over that connection. This is the lever that surprises people with how much it helps.

What a task costs without pipelining. For each task, Ansible normally: (1) creates a temporary directory on the target via an SSH call, (2) copies the module file into it via SFTP/SCP (another transfer), (3) executes the module via SSH, then (4) removes the temp dir. That is several round-trips per task per host.

What pipelining does. With pipelining enabled, Ansible pipes the module’s Python code straight into the interpreter’s stdin over the already-open SSH session, executing it without first writing the module to a temp file on disk. It collapses those several round-trips into roughly one execution call. Combined with multiplexing, the per-task overhead drops dramatically — frequently a 2× or better speedup on connection-heavy playbooks, and the more tasks/hosts you have, the bigger the absolute saving.

Enable it in ansible.cfg:

[ssh_connection]
pipelining = True

Or per-invocation: ANSIBLE_PIPELINING=True ansible-playbook site.yml.

The requiretty caveat — read this before you enable it. Pipelining feeds the module to Python via stdin without allocating a pseudo-terminal (no -tt). On targets where sudo is configured with requiretty in /etc/sudoers (or a sudoers.d drop-in), commands run through sudo demand a TTY and will fail when pipelining is on, with errors like “sudo: sorry, you must have a tty to run sudo”. This is the one thing that breaks people who flip pipelining on across a fleet without checking. Two resolutions:

Disable requiretty on the targets (preferred for managed fleets). Remove Defaults requiretty, or scope it: Defaults:ansible_user !requiretty. Modern RHEL/Ubuntu do not ship requiretty on by default, so most current systems are fine — but older or hardened images often do.
If you cannot touch sudoers, leave pipelining off for those hosts (you still keep the multiplexing win).

A second, smaller caveat: pipelining requires that the remote sudo/become can read from stdin, which the default sudo become plugin handles; some exotic become methods do not pipeline. In practice, on a modern fleet, pipelining = True is the first setting you should add to ansible.cfg — verify requiretty is off, then enjoy the win.

Lever	What it reuses/saves	Default	The catch
`ControlMaster`/`ControlPersist`	Reuses one SSH connection across tasks (and runs)	on (`auto`, 60s)	`ControlPath` 108-char limit → keep dir short
`pipelining`	Removes the per-task temp-file copy; ~1 round-trip/task	off	breaks under sudoers `requiretty`
`forks`	More hosts in parallel	5	bounded by control-node CPU/RAM/fds

Fact gathering: stop paying for facts you do not use

Before a play’s first task, Ansible runs an implicit ansible.builtin.setup against every host to collect facts — OS family, network interfaces, memory, mounts, hardware, and more. On a single host that costs a fraction of a second. Across a large fleet, or on hosts where setup is slow (lots of mounts, slow lsblk, network probes), fact gathering can be a meaningful slice of total run time — and you pay it every play, every run, whether or not your tasks read a single fact.

The levers:

Lever	Where	Effect
`gather_facts: false`	Play keyword	Skip the implicit setup entirely for this play
`gathering = smart` / `explicit` / `implicit`	`ansible.cfg` `[defaults]`	Global policy for when facts are gathered
`gather_subset:`	Play `module_defaults` or the `setup` task	Collect only some fact subsets
`gather_timeout:`	Same	Per-fact-collection timeout (default 10s)
`fact_caching` (+ timeout)	`ansible.cfg`	Reuse facts across runs (next section)

gathering policy — set in ansible.cfg:

implicit (historical default) — gather facts at the start of every play unless gather_facts: false.
explicit — never gather automatically; you must add a gather_facts: true play keyword or an explicit ansible.builtin.setup task. Good for fleets where most plays do not need facts.
smart (recommended) — gather facts for a host only if they are not already cached. Combined with fact caching (below), smart means the first run gathers and caches; later runs within the cache window skip gathering entirely. This is the setting you want for production.

Turning it off per play. If a play only runs a couple of commands and never touches ansible_facts, set gather_facts: false and save the entire setup cost:

- name: Quick service bounce (no facts needed)
  hosts: web
  gather_facts: false
  tasks:
    - name: Restart nginx
      ansible.builtin.service:
        name: nginx
        state: restarted

If you discover mid-play that you do need a fact, gather on demand:

    - name: Gather just what I need
      ansible.builtin.setup:
        gather_subset:
          - "!all"
          - "!min"
          - network

gather_subset — collect only what you use. The setup module organises facts into subsets. You pass a list; prefix with ! to exclude. The meta-subsets are all (everything), min (a small mandatory core, always included unless you say !min), and the individual categories below.

Subset	Covers (examples)
`min`	`ansible_fqdn`, `ansible_distribution`, basic identity (cheap, near-always wanted)
`hardware`	CPU, memory, devices, mounts — often the slowest (probes `lsblk`, `/proc`)
`network`	Interfaces, IPs, default route
`virtual`	Hypervisor/container detection
`facter` / `ohai`	Pull facts from Puppet’s facter / Chef’s ohai if installed
`all`	Everything (the default if you specify nothing)

The big lever is excluding hardware: gather_subset: ["!hardware"] (or the tighter ["!all", "!min", "network"]) can noticeably speed gathering on fleets where you only need OS family and IPs. Set it globally with module_defaults so every play benefits:

- hosts: all
  module_defaults:
    ansible.builtin.setup:
      gather_subset:
        - "!hardware"

gather_timeout — the per-collection timeout, default 10 seconds. Raise it (gather_timeout: 30) when a host has many disks/mounts and hardware facts legitimately take longer than 10s and you see “timed out waiting for … facts”; otherwise leave it. It is set on the setup module (or via DEFAULT_GATHER_TIMEOUT).

Fact caching: gather once, reuse for hours

Skipping facts is great when you do not need them. Fact caching is for when you do need them but do not want to re-gather on every run. With caching on, the first run gathers facts and persists them; subsequent runs (within the cache’s lifetime) read facts from the cache instead of touching the host — and with gathering = smart, gathering is skipped entirely. On a 300-host fleet that you converge every 15 minutes, this turns “re-probe 300 hosts every time” into “probe once an hour.”

Caching is provided by cache plugins, selected with fact_caching in ansible.cfg (or ANSIBLE_CACHE_PLUGIN):

Cache plugin	Backend	Best for	Key settings
`memory`	Process RAM (default)	Single run only — not persisted across runs	none
`jsonfile`	JSON files on the control node	Simple, single-controller, no extra services	`fact_caching_connection` = directory
`yaml`	YAML files on disk	Human-readable on-disk cache	`fact_caching_connection` = directory
`redis`	Redis server	Shared cache across many controllers/AWX nodes	`fact_caching_connection` = `host:port:db` (+ keyprefix)
`memcached`	memcached server	Shared, fast, ephemeral cross-controller cache	`fact_caching_connection` = `host:port`
`pickle`	Pickled files on disk	On-disk binary cache	`fact_caching_connection` = directory

The default is memory, which caches facts only for the duration of a single ansible-playbook run — useful so a second play in the same run reuses the first play’s facts, but gone the moment the process exits. To get cross-run reuse you must choose a persistent plugin.

jsonfile — the zero-dependency choice:

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_fact_cache
fact_caching_timeout = 7200

That writes one JSON file per host under the directory and treats cached facts as valid for 7200 seconds (2 hours); after that they are considered stale and re-gathered. fact_caching_timeout = 0 means never expire (cache forever until you delete it) — handy but dangerous, because a host that changes IPs will keep reporting the old one until you flush. There is no automatic invalidation on host change; the timeout is your only freshness guarantee, so pick it to match how often your fleet legitimately changes.

redis — the shared choice for multiple controllers or AWX/AAP:

[defaults]
gathering = smart
fact_caching = redis
fact_caching_connection = 127.0.0.1:6379:0
fact_caching_timeout = 3600
fact_caching_prefix = ansible_facts_

Now every controller (or every AWX execution node) reads and writes the same fact cache, so a host gathered by one controller is instantly available to the next. memcached works the same way with host:port. These need the matching Python client (redis / python-memcached) installed on the controller.

A subtle but important benefit: hostvars across the fleet without gathering. Because cached facts live outside any single play, a play that targets web can read hostvars['db01'].ansible_default_ipv4.address even though this run never connected to db01 — as long as db01’s facts are in the cache from an earlier run. That makes cross-host templating (load-balancer configs, /etc/hosts generation) both possible and fast. This is the same cacheable: true mechanism you saw with ansible.builtin.set_fact: cacheable set_facts are persisted into the same store.

A security note carried from the variables lesson: a jsonfile/yaml cache is plaintext on the control node. If facts ever include anything sensitive, protect the cache directory and set a sane timeout — see Security notes below.

async & poll: long-running and fire-and-forget tasks

By default Ansible runs a task synchronously: it starts the module on the host and blocks, holding the connection open, until the module returns. Two problems follow. First, a genuinely long task (a 20-minute package compile, a big database dump) can exceed the SSH/become timeout and the connection dies. Second, while one host runs a 10-minute task, that fork is tied up and cannot do useful work elsewhere.

async + poll solve both. async: N tells Ansible “this task may run up to N seconds; start it in the background on the target and let me check on it.” poll: M tells Ansible “check whether it finished every M seconds.”

- name: Long database backup (up to 30 min), checked every 15s
  ansible.builtin.command: /usr/local/bin/full_backup.sh
  async: 1800      # max runtime in seconds
  poll: 15         # check every 15s; Ansible blocks here until done or timeout

With poll > 0, Ansible still waits for the task on that host — but it survives long runtimes (no connection timeout, because it polls a status file rather than holding the original exec open) and you get a clean failure if it exceeds async. Use this for tasks that are long but that you need to complete before the next task.

The fire-and-forget pattern: poll: 0. Set poll: 0 and Ansible starts the task on every host and immediately moves on — it does not wait. This is the key to parallelising slow, independent work across a fleet: kick the long job off on all hosts at once, do other things, then come back and collect results with ansible.builtin.async_status using the job id the task registered.

- name: Kick off a long upgrade on every host, do NOT wait
  ansible.builtin.command: /usr/local/bin/upgrade.sh
  async: 3600        # allow up to 1 hour
  poll: 0            # fire and forget — returns immediately with a job id
  register: upgrade_job

# ... do other useful work here while upgrades run in the background ...

- name: Wait for all the upgrades to finish
  ansible.builtin.async_status:
    jid: "{{ upgrade_job.ansible_job_id }}"
  register: upgrade_result
  until: upgrade_result.finished      # poll until the job reports finished
  retries: 60                         # up to 60 attempts ...
  delay: 60                           # ... 60s apart = wait up to 1 hour

async_status reports finished (1/0), failed, rc, stdout, etc. — exactly as if the task had run synchronously, but you collected it on your schedule. The until/retries/delay loop is how you wait for completion.

The poll: 0 cleanup gotcha. When poll: 0 is used, Ansible deliberately does not clean up the async job’s status file on the target afterwards (it cannot know when you are done with it). For truly one-shot fire-and-forget tasks you never poll (e.g. kick off something and genuinely walk away), set async to a value and never call async_status; for everything else, the async_status loop both waits and lets Ansible reap the job. Two more notes: a task with async must be a module that supports backgrounding (most command/shell/package/long-running modules do); and async + poll: 0 on a host that disconnects mid-job means you lose the result, so it suits idempotent, restartable work.

Mode	Setting	Behaviour	Use for
Synchronous	(default)	Block until task returns; hold connection	Normal short tasks
Async, polled	`async: N`, `poll: M>0`	Background on host; poll status every M s; wait	Long tasks you must finish before continuing
Fire-and-forget	`async: N`, `poll: 0`	Start on all hosts, return immediately; collect later via `async_status`	Slow independent work parallelised across the fleet

The free strategy (and why default is linear)

The strategy decides how per-host task streams are scheduled. The default, linear, runs each task on all (up-to-forks) hosts and waits for every host to finish that task before starting the next — the play advances in lockstep, and the slowest host in each step sets the pace. That predictability is great for rolling updates and ordered changes, but it means one sluggish host stalls everyone.

The free strategy removes the per-task barrier: each host races through the play as fast as it individually can, never waiting for others. A fast host may be at task 20 while a slow one is still at task 5. On a heterogeneous fleet — mixed hardware, varying latency, some hosts with more work — free can dramatically cut total wall-clock time because no host idles waiting for a laggard.

- hosts: all
  strategy: free
  tasks:
    - ...

Or globally: [defaults] strategy = free.

When free hurts. Because hosts are out of step, free breaks anything that assumes order across hosts: you cannot rely on host A finishing task 3 before host B starts task 4. Handlers still flush at the end of the play per host, but cross-host coordination (e.g. “configure all DB replicas, then promote one”) is unsafe under free. And run_once/serial semantics are designed around linear. Rule: use free for independent, parallel-safe work where speed matters; keep linear for ordered or coordinated rollouts.

Strategy	Scheduling	Best for	Avoid when
`linear` (default)	Lockstep: all hosts finish task N before task N+1	Ordered changes, rolling updates, debugging	A few slow hosts stall a big fleet
`free`	Each host runs the play as fast as it can	Heterogeneous fleets, independent work, max throughput	Cross-host ordering matters
`host_pinned`	Like linear but pins hosts to workers; a host completes the play before a new one starts	Keeping per-host work on one worker (resource locality)	You need strict task-level lockstep
`debug`	Linear, but drops into the interactive debugger on failure	Step-through debugging	Production/automation

(The previous lesson covers serial/throttle/order and the rolling-update pattern in depth; here the point is simply that free is a speed tool.)

Reducing the work itself: loops vs many tasks

The fastest task is the one you do not run. Beyond the connection layer, you can cut real work in the play:

Use one looped task, not many near-identical tasks. Installing ten packages as ten ansible.builtin.package tasks pays the full per-task overhead ten times. A single task with a list does it in one module invocation:
```
# SLOW: ten round-trips
- ansible.builtin.package: { name: git,  state: present }
- ansible.builtin.package: { name: vim,  state: present }
# ... eight more ...

# FAST: one round-trip — the package module installs the whole list at once
- name: Install all packages in one go
  ansible.builtin.package:
    name: [git, vim, curl, htop, jq, tmux, tree, unzip, rsync, lsof]
    state: present
```
Package, apt, dnf, and yum modules all accept a list of names — passing the list lets the underlying package manager resolve dependencies once and is far faster than one task per package. The same idea applies to user, lineinfile (prefer blockinfile/template for many lines), and others.
Put when on the loop body, not around a hand-unrolled set of tasks — but be aware: when on a looped task is evaluated per item, so the loop still iterates; if you can skip the whole task with a single host-level condition, that is cheaper than filtering inside the loop. For large lists, filter the list itself (loop: "{{ pkgs | select(...) | list }}") so you never iterate items you will skip.
Skip fact gathering when a play does not need it (above), and gather a subset when it needs only part.
Avoid command/shell where a module exists — not just for idempotence, but because re-running a shell command every time (when a module would no-op) is wasted work; pair unavoidable checks with changed_when: false.
Template once, not line-by-line. Twenty lineinfile tasks against one file is twenty passes; one ansible.builtin.template renders the whole file in a single task.

Profiling: measure before you tune

Do not guess where the time goes — measure. Ansible ships callback plugins that print timing, and turning them on is a one-line change. The key one is profile_tasks, which prints the wall-clock time of every task and a sorted “slowest tasks” summary at the end.

Enable callbacks in ansible.cfg:

[defaults]
callbacks_enabled = profile_tasks, profile_roles, timer

(Older docs call this callback_whitelist; modern ansible-core uses callbacks_enabled. The env var is ANSIBLE_CALLBACKS_ENABLED.)

Callback	What it reports
`timer`	Total playbook wall-clock time at the end (the headline number)
`profile_tasks`	Per-task duration during the run and a sorted top-N slowest-tasks table at the end
`profile_roles`	Time aggregated per role — which role is the hog
`cgroup_perf_recap`	CPU/memory/PID usage per task via cgroups (resource profiling, needs setup)

A typical profile_tasks tail looks like:

=============================================================
Gathering Facts -------------------------------------- 8.42s
install packages ------------------------------------- 5.10s
render nginx.conf ------------------------------------ 0.31s
...
Playbook run took 0 days, 0 hours, 0 minutes, 23 seconds

That immediately tells you the truth most people get wrong by intuition: “Gathering Facts” is frequently the single most expensive line. If it is, the fix is gather_subset/caching, not more forks. If a command task dominates, the fix is on the target, not in Ansible. Always profile first, change one lever, profile again — the lab below does exactly this so you see the numbers move.

For a one-off run without editing config: ANSIBLE_CALLBACKS_ENABLED=profile_tasks,timer ansible-playbook site.yml.

Mitogen: the strategy plugin that can halve your runtime

Mitogen for Ansible is a third-party strategy plugin that replaces Ansible’s default execution model with a far more efficient one. Vanilla Ansible, even with pipelining, still forks-and-execs a fresh Python interpreter for many operations and shuttles data over SSH for each. Mitogen instead bootstraps a single long-lived Python process on each target and runs modules inside it as in-process function calls, reusing the interpreter and connection aggressively and routing everything over one persistent channel. The result on connection/CPU-bound playbooks is commonly a 1.5×–7× wall-clock improvement, with markedly less CPU on the control node.

Install and enable it:

python3 -m pip install mitogen        # provides the ansible_mitogen package

[defaults]
strategy_plugins = /path/to/ansible_mitogen/plugins/strategy
strategy = mitogen_linear

strategy_plugins points Ansible at Mitogen’s plugin directory (find it with python3 -c "import ansible_mitogen, os; print(os.path.dirname(ansible_mitogen.__file__))" then /plugins/strategy). The strategies it adds are mitogen_linear (the lockstep default — start here), mitogen_free, and mitogen_host_pinned, mirroring the built-ins. You can also set it per-invocation with ANSIBLE_STRATEGY=mitogen_linear.

Trade-offs and limits — important:

Compatibility lags ansible-core. Mitogen is maintained independently and frequently does not support the very latest ansible-core immediately. Always check the Mitogen release notes against your ansible-core version before relying on it; an unsupported pairing produces obscure failures.
It changes execution semantics subtly. Because modules run in a shared in-process interpreter, occasional modules or custom modules that assume a fresh process, leak global state, or do exotic things can misbehave. Test your full playbook under Mitogen before trusting it in production.
Not a drop-in for every connection type. It is strongest over SSH to Linux; some connection plugins and become methods are unsupported.
It is not part of ansible-core or RHCE’s required toolset — treat it as a powerful optimisation you reach for once you have exhausted forks/pipelining/caching and still need more, and you can validate it.

When it works, Mitogen is the largest single lever after pipelining. When it does not, you fall back to a well-tuned stock setup — which is why the order of optimisation is: pipelining + multiplexing → forks → fact caching/subset → profiling-guided fixes → then consider Mitogen.

Connection plugins: ssh vs paramiko vs smart vs local

Underneath every remote task is a connection plugin — the transport. Picking the right one matters for both speed and capability.

Connection plugin	Transport	Multiplexing	Pipelining	When to use
`ssh`	Native OpenSSH binary (`/usr/bin/ssh`)	Yes (ControlMaster)	Yes	The default and the fast path — use it for Linux/Unix at scale
`paramiko`	Pure-Python SSH library	No (no ControlPersist)	Limited	Fallback when no `ssh` binary / for `--ask-pass` edge cases; slower
`smart`	Picks `ssh` if it supports ControlPersist, else `paramiko`	inherits	inherits	Legacy auto-detect; today effectively `ssh` everywhere
`local`	Runs on the control node, no SSH	n/a	n/a	`localhost`, `delegate_to: localhost`, `connection: local`
`winrm` / `psrp` / `ssh` (Windows)	WinRM / PowerShell Remoting / SSH	n/a	n/a	Windows targets

Set it with connection in ansible.cfg ([defaults] transport = ssh), the ansible_connection inventory var, or -c ssh on the command line.

The practical guidance: use ssh (the native binary). It is the only plugin that gives you OpenSSH multiplexing and pipelining — the two biggest levers in this lesson. paramiko exists for environments without an ssh binary or where you need pure-Python password auth without sshpass, but it cannot multiplex and is slower; only fall back to it deliberately. smart (a historical default) auto-selects between them and, on any modern system, resolves to ssh — so you lose nothing by setting ssh explicitly. Use local for the control node itself.

The diagram traces a single run through every lever — from forks deciding how many hosts go at once, through the connection-layer wins (multiplexing + pipelining), fact gathering and caching, async backgrounding, strategy scheduling, profiling, and finally Mitogen’s in-process model — so you can see exactly where each setting bites.

Hands-on lab: a before/after benchmark

We will build a small, deliberately connection-heavy playbook, run it with stock defaults, then turn on the big levers and re-run, watching the wall-clock time drop. This costs ₹0 — it runs against local containers (or VMs). All commands use FQCN.

1. Set up a few free targets

Spin up three lightweight containers as SSH targets (Podman or Docker; adjust the image to taste). If you already have a few VMs or localhost plus containers, use those instead.

mkdir -p ~/ansible-perf-lab && cd ~/ansible-perf-lab

for n in 1 2 3; do
  podman run -d --name perf$n -p 220$n:22 \
    docker.io/rastasheep/ubuntu-sshd:18.04 >/dev/null 2>&1 \
  || docker run -d --name perf$n -p 220$n:22 \
    docker.io/rastasheep/ubuntu-sshd:18.04
done

That image listens on SSH with root / root. Create an inventory:

cat > inventory.ini <<'EOF'
[perf]
perf1 ansible_host=127.0.0.1 ansible_port=2201
perf2 ansible_host=127.0.0.1 ansible_port=2202
perf3 ansible_host=127.0.0.1 ansible_port=2203

[perf:vars]
ansible_user=root
ansible_ssh_pass=root
ansible_ssh_common_args=-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
EOF

(sshpass is needed for ansible_ssh_pass; install it via your package manager, or swap to key auth.)

2. A connection-heavy playbook

The point is many small tasks so connection overhead dominates — exactly where tuning shows.

cat > bench.yml <<'EOF'
---
- name: Connection-heavy benchmark
  hosts: perf
  gather_facts: true
  tasks:
    - name: Touch a series of files (lots of small round-trips)
      ansible.builtin.file:
        path: "/tmp/perf_{{ item }}"
        state: touch
        mode: "0644"
      loop: "{{ range(1, 21) | list }}"

    - name: A few command checks
      ansible.builtin.command: "echo check {{ item }}"
      changed_when: false
      loop: "{{ range(1, 6) | list }}"
EOF

3. Run #1 — stock defaults, with profiling

Use a config that turns on only profiling so we get honest numbers, with everything else at defaults (pipelining off, forks 5):

cat > ansible.cfg <<'EOF'
[defaults]
inventory = inventory.ini
host_key_checking = False
callbacks_enabled = profile_tasks, timer
EOF

ANSIBLE_PIPELINING=False ansible-playbook bench.yml

Note the Playbook run took ... line and the profile_tasks table — especially how much Gathering Facts and the looped file/command tasks cost. This is your baseline.

4. Run #2 — turn on the levers

Now enable pipelining, raise forks, add multiplexing with a persistent master, and switch to the free strategy:

cat > ansible.cfg <<'EOF'
[defaults]
inventory = inventory.ini
host_key_checking = False
forks = 25
strategy = free
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_fact_cache
fact_caching_timeout = 7200
callbacks_enabled = profile_tasks, timer

[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=120s
control_path_dir = ~/.ansible/cp
EOF

ansible-playbook bench.yml

Compare the new Playbook run took ... line to the baseline. On a connection-heavy run you should see a clear drop — pipelining removes per-task temp-file copies, multiplexing reuses the connection, and free lets the three hosts finish independently.

5. Run #3 — prove fact caching

Run a third time immediately. Because gathering = smart + jsonfile caching is on and the facts are fresh (within 7200s), fact gathering is skipped entirely:

ansible-playbook bench.yml
ls -la /tmp/ansible_fact_cache/      # one JSON file per host = cached facts

In the profile_tasks output, Gathering Facts should now be near-instant (or absent), shaving the setup cost off every subsequent run. That is the production win for fleets you converge repeatedly.

6. (Optional) Confirm multiplexing and try async

While a run is in flight, list the live control sockets, then add a fire-and-forget task:

ls -la ~/.ansible/cp/    # live sockets during a run = multiplexing working

cat >> bench.yml <<'EOF'

    - name: Fire-and-forget a slow job on every host
      ansible.builtin.command: "sleep 10"
      async: 60
      poll: 0
      register: slow

    - name: Collect the slow jobs
      ansible.builtin.async_status:
        jid: "{{ slow.ansible_job_id }}"
      register: slow_done
      until: slow_done.finished
      retries: 30
      delay: 2
EOF

ansible-playbook bench.yml

The sleep 10 runs on all three hosts in parallel in the background; total added time is ~10s, not 30s, because they ran concurrently and you collected them afterwards.

Validation

Baseline vs tuned Playbook run took numbers differ (tuned is faster).
After run #2/#3, /tmp/ansible_fact_cache/ contains a JSON file per host.
During a run, ~/.ansible/cp/ shows live control sockets.
The async section completes in roughly the single-task time, not the sum.

Cleanup

for n in 1 2 3; do podman rm -f perf$n 2>/dev/null || docker rm -f perf$n; done
rm -rf ~/ansible-perf-lab /tmp/ansible_fact_cache ~/.ansible/cp

Cost note

₹0. Everything runs in local containers (or VMs) on your own machine — no cloud resources, no managed nodes billed.

Common mistakes & troubleshooting

Symptom	Likely cause	Fix
Big fleet still slow after raising `forks`	`serial` is the binding cap, or downstream chokepoint (one DB, `delegate_to`)	Effective parallelism is `min(forks, serial, hosts)`; raise/remove `serial`; remove the chokepoint
Control node thrashes / OOM / “too many open files” at high forks	Forks exceed control-node CPU/RAM/fd limits	Lower `forks`; raise `ulimit -n`; size from control-node capacity, not host count
`sudo: you must have a tty to run sudo` after enabling pipelining	Targets have `Defaults requiretty` in sudoers	Remove/scope `requiretty`, or leave `pipelining` off for those hosts
`unix_listener: path too long` / multiplexing silently off	`ControlPath` exceeds the 108-char socket limit	Set a short `control_path_dir` (e.g. `~/.ansible/cp`)
Facts re-gathered on every run despite caching	Using default `memory` cache, or `gathering = implicit`	Set a persistent plugin (`jsonfile`/`redis`) and `gathering = smart`
Cached facts are stale (old IP/hostname)	`fact_caching_timeout` too high or `0` (never expire); no auto-invalidation	Lower the timeout; delete the cache dir to force a refresh
Long task fails with a connection/timeout error	Synchronous task exceeded SSH/become timeout	Wrap it in `async: N` with `poll` (or `poll: 0` + `async_status`)
`free` rollout breaks ordering / coordination	`free` removes the per-task barrier between hosts	Use `linear` for ordered/coordinated work; `free` only for independent tasks
Mitogen fails with obscure errors	Mitogen version doesn’t support your `ansible-core`	Match versions per Mitogen release notes; test full playbook; else fall back to stock

Best practices

Add pipelining = True first — verify requiretty is off on your fleet, then enjoy the single biggest per-task win for free.
Keep control_path_dir short so multiplexing never silently breaks on the 108-char socket limit; raise ControlPersist if you re-run playbooks often.
Size forks from the control node, not the host count; watch htop during a big run and back off if it pegs CPU or swaps.
Profile before you tune and after every change with profile_tasks + timer; change one lever at a time so you know what moved the needle.
Use gathering = smart + a persistent fact cache (jsonfile for one controller, redis/memcached for many) on any fleet you converge repeatedly.
Gather only the subset you use (gather_subset: ["!hardware"] is a common, safe win) and set it via module_defaults so every play benefits.
Reach for async/poll: 0 + async_status to parallelise long, independent work across the fleet instead of letting one slow host hold a fork.
Pick free for independent work, linear for ordered rollouts — speed versus coordination is the trade.
Collapse work: one looped ansible.builtin.package over a list, one template over many lineinfiles, modules over command/shell.
Treat Mitogen as a validated optimisation, not a default — it is the biggest lever after pipelining when its version matches yours and your playbook passes under it.

Security notes

Fact caches can leak. A jsonfile/yaml cache is plaintext on the control node; if facts (or cacheable: true set_fact values) include internal IP maps, tokens, or anything sensitive, restrict the cache directory’s permissions and set a sane fact_caching_timeout. A redis/memcached cache should be on a trusted network with auth enabled — an open Redis is a data-leak waiting to happen.
Pipelining and become. Disabling requiretty to enable pipelining slightly relaxes a hardening control; do it deliberately and scope it to the automation user (Defaults:ansible_user !requiretty) rather than globally where you can.
ControlPersist leaves warm connections. A long ControlPersist keeps authenticated sockets open in the background after a run; on a shared control node, another user with access to your control socket directory could ride them. Keep control_path_dir in your own home with tight permissions, and do not set ControlPersist absurdly high on shared machines.
async status files persist with poll: 0. Fire-and-forget jobs leave their result files on the target (Ansible cannot reap them); if a job’s stdout contains secrets, those files linger — pair such tasks with no_log: true and clean the job files up, or avoid poll: 0 for sensitive output.
Higher forks widen blast radius. Driving 100 hosts at once means a bad change hits 100 hosts at once; combine aggressive forks with serial/max_fail_percentage (previous lesson) on anything that mutates production.
Mitogen runs a long-lived interpreter on targets. It bootstraps a persistent Python process per host; understand and trust the channel, and prefer it on networks you control.

Interview & exam questions

What does forks control, and what is its default? The maximum number of hosts Ansible communicates with simultaneously; default 5. Under linear it is the batch size per task; it is the concurrency ceiling under any strategy.
A colleague set forks: 100 for a rolling update but sees no speedup. Why? serial is almost certainly capping the batch. Effective parallelism is min(forks, serial, hosts-remaining); with serial: 10 you get at most 10 hosts at once regardless of forks.
Explain pipelining and the one thing that breaks it. Pipelining pipes the module’s Python straight into the remote interpreter over the open SSH session instead of copying it to a temp file first, collapsing several round-trips into ~one — a major per-task speedup. It breaks when sudoers has requiretty, because pipelining allocates no TTY and sudo then refuses; the fix is to remove/scope requiretty (or leave pipelining off there).
What is SSH multiplexing and how does Ansible use it? OpenSSH ControlMaster opens one connection that subsequent sessions reuse, and ControlPersist keeps it warm for N seconds after the last use (even across runs). Ansible’s ssh plugin enables ControlMaster=auto + ControlPersist=60s by default, eliminating the handshake on every task. The gotcha is the ControlPath 108-char socket limit — keep control_path_dir short.
Default fact gathering is costing you on a 300-host fleet. List the levers. gather_facts: false where facts are unused; gather_subset (e.g. !hardware) to collect less; gather_timeout if collection legitimately exceeds 10s; and gathering = smart + a persistent fact cache (jsonfile/redis) so facts are gathered once and reused.
What does gathering = smart do, and why pair it with caching? It gathers facts for a host only if they are not already cached. With a persistent cache, the first run gathers and stores; later runs within the cache window skip gathering entirely — turning per-run probing into occasional probing.
Compare the memory, jsonfile, and redis cache plugins. memory (default) caches only within a single run — not persisted. jsonfile writes one JSON file per host to disk on the controller — simple, single-controller. redis (and memcached) is a shared network cache so many controllers/AWX nodes share one fact store. Set freshness with fact_caching_timeout.
You must run a 30-minute backup that exceeds the SSH timeout. How? Wrap it in async: 1800 with poll: 15 (or another interval): Ansible backgrounds it on the host and polls a status file rather than holding the exec open, surviving the long runtime and failing cleanly past async.
What is the fire-and-forget pattern and which module collects the result? async: N with poll: 0 starts the task on every host and returns immediately without waiting; you collect results later with ansible.builtin.async_status using the registered ansible_job_id, typically in an until: result.finished / retries / delay loop. It parallelises slow, independent work across the fleet.
linear vs free strategy — when each? linear (default) keeps hosts in lockstep (all finish task N before task N+1) — use it for ordered/coordinated rollouts; the slowest host paces each step. free lets each host run the play as fast as it can — use it for independent work on heterogeneous fleets to cut wall-clock time, but never when cross-host ordering matters.
How do you find the real bottleneck before tuning? Enable the profile_tasks and timer callbacks (callbacks_enabled = profile_tasks, timer, or ANSIBLE_CALLBACKS_ENABLED). timer gives total wall-clock; profile_tasks gives per-task timing and a sorted slowest-tasks table — which very often reveals that “Gathering Facts” is the most expensive line, pointing you at subset/caching rather than forks.
What is Mitogen and what are its risks? A third-party strategy plugin (mitogen_linear/_free/_host_pinned) that runs modules in a persistent in-process interpreter per host with one reused channel, commonly 1.5×–7× faster with less control-node CPU. Risks: it lags ansible-core compatibility, subtly changes execution semantics (test your full playbook), and does not support every connection/become combination — treat it as a validated optimisation, not a default.
Which connection plugin gives you both multiplexing and pipelining, and what is the alternative for? The native ssh plugin (the default fast path). paramiko is a pure-Python fallback for environments without an ssh binary or for certain password-auth cases — it cannot multiplex and is slower. smart auto-selects and resolves to ssh on modern systems.

Quick check

Your play targets 200 hosts with forks: 50 and serial: 10. How many hosts run a given task at once, at most?
True or false: pipelining is on by default in ansible-core.
You enable pipelining and sudo tasks start failing with a TTY error. What sudoers setting is the culprit?
Which gathering value gathers facts only when they are not already cached?
You want a long task to start on every host and not block the play, collecting results later. Which two settings do you use, and which module reaps the result?

Answers

10. Effective parallelism is min(forks, serial, hosts) = min(50, 10, 200) = 10; serial is the binding cap.
False. Multiplexing (ControlMaster) is on by default, but pipelining is off by default — you enable it in ansible.cfg.
Defaults requiretty in /etc/sudoers. Pipelining allocates no TTY, so sudo with requiretty refuses; remove or scope it.
smart (gathering = smart) — gather only if not cached; pair it with a persistent fact cache for the full benefit.
async: N with poll: 0 (fire-and-forget), then ansible.builtin.async_status (with until: result.finished / retries / delay) to collect the result.

Exercise

Take a real (or sample) playbook of yours that targets at least three hosts and do a measured tuning pass:

Baseline. Add callbacks_enabled = profile_tasks, timer only, run it, and record the total time and the top three slowest tasks. Note specifically what “Gathering Facts” costs.
Connection layer. Enable pipelining = True (confirm requiretty is off), set ssh_args with ControlPersist=120s and a short control_path_dir, and raise forks to a sensible value for your control node. Re-run and compare.
Facts. Switch to gathering = smart with a jsonfile cache (fact_caching_timeout = 3600); add gather_subset: ["!hardware"] via module_defaults. Run twice and confirm the second run skips gathering.
Async. Convert your longest task to async/poll: 0 + ansible.builtin.async_status, and verify the play no longer blocks on it.
Strategy. If your tasks are order-independent, try strategy: free and compare wall-clock against linear.
Write up the before/after timer numbers and which lever moved the needle most. (Optional stretch: install Mitogen, run under mitogen_linear, and compare — but only if its version matches your ansible-core.)

The goal is the discipline: profile → change one lever → profile again, and end with numbers, not vibes.

Certification mapping

RHCE (EX294): Performance and production-readiness sit across several objectives. Configuring ansible.cfg for the environment — forks, pipelining, [ssh_connection] tuning — is part of “use the ansible.cfg” expectations; you should be able to set these from memory under time pressure. async/poll maps directly to “run tasks asynchronously,” and the fire-and-forget + async_status pattern is a known exam idiom for long-running tasks. Fact handling — gather_facts, gather_subset, and fact caching — overlaps the facts objectives and is a realistic production-config task. Knowing the linear/free strategies and how forks/serial interact rounds out the run-control material the exam probes.
The connection layer (multiplexing, pipelining, the requiretty caveat) and profiling (profile_tasks/timer) are not always called out explicitly but are exactly the “make this fleet fast and reliable” reasoning interviewers and graders look for — know them cold.

Glossary

forks — maximum number of hosts Ansible drives in parallel; default 5; the concurrency ceiling under any strategy.
Strategy plugin — decides how per-host task streams are scheduled (linear, free, host_pinned, debug, mitogen_*).
linear — default strategy: every host finishes task N before any starts task N+1 (lockstep); slowest host paces each step.
free — strategy where each host runs the whole play as fast as it can, never waiting for others.
host_pinned — strategy that completes a host’s play on one worker before bringing in a new host.
Connection plugin — the transport to a host (ssh, paramiko, smart, local, winrm).
Multiplexing — OpenSSH ControlMaster/ControlPersist reusing one SSH connection across many sessions/tasks (and runs).
ControlPath — filesystem socket path identifying a reusable SSH master; subject to a 108-char limit on Linux.
ControlPersist — seconds OpenSSH keeps a master connection warm after its last use.
Pipelining — feeding a module’s code into the remote interpreter over the open session instead of copying a temp file first; ~1 round-trip per task. Off by default; breaks under requiretty.
requiretty — sudoers setting demanding a TTY for sudo; conflicts with pipelining.
Fact gathering — running ansible.builtin.setup to collect host facts (ansible_facts) before tasks.
gather_subset — which fact categories to collect (min, hardware, network, virtual, all; ! excludes).
gather_timeout — per-fact-collection timeout; default 10s.
gathering — global policy: implicit (always), explicit (never auto), smart (only if not cached).
Cache plugin — backend storing facts (memory, jsonfile, yaml, redis, memcached, pickle).
fact_caching_timeout — seconds cached facts stay valid; 0 = never expire (no auto-invalidation).
async — maximum seconds a backgrounded task may run.
poll — how often Ansible checks a backgrounded task; poll: 0 = fire-and-forget.
ansible.builtin.async_status — module that checks/collects a backgrounded job by its ansible_job_id.
Callback plugin — reacts to run events; profile_tasks/profile_roles/timer produce profiling output.
callbacks_enabled — ansible.cfg setting (env ANSIBLE_CALLBACKS_ENABLED) that turns non-stdout callbacks on.
Mitogen — third-party strategy plugin running modules in a persistent remote interpreter for large speedups; version-sensitive.

Next steps

You can now make Ansible fast: parallelise with forks, eliminate connection cost with multiplexing + pipelining, stop re-gathering with gathering = smart + a fact cache, background long work with async/poll and ansible.builtin.async_status, choose free for independent work, and — crucially — profile with profile_tasks/timer so every change is measured, with Mitogen as the heavy lever once the basics are exhausted. The next lesson, Dynamic Inventory for AWS, Azure & Secrets, takes you from a fast static fleet to a fast dynamic one — generating inventory from cloud APIs so the hosts you tune here are discovered automatically. To revisit the control-side companions of these levers — serial, throttle, order, and the linear/free/host_pinned strategies used for coordination rather than raw speed — return to Ansible Delegation, Strategies & Rolling Updates, In Depth. And to refresh where the facts you are caching actually come from, see Ansible Variables & Facts, In Depth.