Ansible Lesson 14 of 42

Tuning Ansible for Speed & Scale, In Depth: Pipelining, Forks, Fact Caching, Async & Mitogen

The first time you run a tidy ten-task playbook against three lab machines, Ansible feels instant. The first time you run that same playbook against three hundred production machines, it feels like watching paint dry — and the reason is almost never your tasks. It is the plumbing: how many hosts Ansible talks to at once, how many times it opens an SSH connection, how many forks-and-execs each module costs, and how much time every single play burns up front gathering facts you may not even use. Ansible’s defaults are deliberately conservative and beginner-safe; they are emphatically not tuned for scale. The good news is that the same engine that crawls with stock settings will fly once you pull the right levers, and none of those levers require rewriting a line of your roles.

This lesson is the exhaustive tour of those levers. We start with forks — the parallelism dial that decides how many hosts Ansible drives simultaneously — and how to size it sanely. We then go deep on the single biggest free win in Ansible: SSH connection reuse via OpenSSH’s ControlMaster/ControlPersist multiplexing, and pipelining, which collapses the several round-trips each task normally makes into roughly one (plus the requiretty caveat that bites people who turn it on blind). We cover fact gathering end to end — implicit versus explicit, turning it off, gather_subset, gather_timeout — and then fact caching with the jsonfile, redis, and memcached plugins so a fleet gathers facts once and reuses them for hours. We cover async + poll for long-running tasks and the fire-and-forget poll: 0 + ansible.builtin.async_status pattern; the free strategy for letting fast hosts race ahead; cutting work with loops and loop-level when; profiling with the profile_tasks and timer callbacks so you measure instead of guess; the Mitogen strategy plugin that can halve wall-clock time; and the connection plugins (ssh, paramiko, smart) underneath it all. Everything targets current Ansible (ansible-core 2.17+ / Ansible 10+, 2026) and uses FQCNansible.builtin.async_status, ansible.builtin.setup — throughout. We finish with a free, local before/after benchmark so you can see the numbers move.

Learning objectives

After working through this lesson you will be able to:

Prerequisites & where this fits

You should already be fluent with the run-time machinery this lesson tunes: playbooks made of plays and tasks, facts (the ansible.builtin.setup module, ansible_facts, custom facts), register for capturing results, and inventory with group_vars/host_vars. The previous lesson, Ansible Delegation, Strategies & Rolling Updates, In Depth, introduced serial, throttle, and the linear/free/host_pinned strategies in the context of control; this lesson revisits forks and free from the angle of speed and adds the connection-layer levers that make the biggest difference at scale. You will also lean on what you learned about variables and facts in Ansible Variables & Facts, In Depth, because fact gathering and caching are the same ansible_facts you already use, just timed and stored differently. This is the Execution module of the Advanced tier of the Ansible Zero-to-Hero ladder. The material maps to the RHCE (EX294) performance objectives — pipelining, forks, fact caching, and async are exactly the production-readiness skills the exam expects. Everything you need is ansible-core plus a couple of local containers or VMs; the lab runs for free.

Core concepts

Three ideas explain why Ansible is slow by default, and every lever in this lesson follows from them. Fix these in your head first.

Ansible’s work is dominated by connection overhead, not task logic. Most modules do very little real work — install a package, template a file, restart a service — but getting to the point of doing that work is expensive. For each task, on each host, vanilla Ansible: opens (or reuses) an SSH connection, creates a temporary directory on the target, copies the module (a Python file) into it, executes it with the right interpreter, captures the JSON it prints on stdout, and cleans up. The round-trips across the network — not the package install — are where the seconds go. Every performance lever in Ansible is ultimately about doing fewer round-trips, reusing connections, or doing more hosts at once. Hold that and the whole lesson coheres.

Ansible is push-based and serial-per-host by default, parallel-across-hosts by forks. Within a single host, tasks run top to bottom, one at a time (that is what makes a playbook readable and predictable). Across hosts, Ansible runs the same task on up to forks hosts simultaneously, then moves to the next task — that is the default linear strategy. So your two scaling dimensions are: how many hosts run in parallel (forks), and how cheap each host’s per-task overhead is (connection reuse, pipelining). The strategy decides how the per-host streams are scheduled relative to one another (linear keeps them in lockstep; free lets each host sprint).

The control node is the bottleneck you forget about. Every fork is a process on your control machine, and each one runs Python, holds an SSH connection, and uses memory and a file descriptor. Setting forks = 500 on a 2-vCPU laptop does not make Ansible 100× faster than forks = 5; it makes the control node thrash. Sizing forks is therefore a control-node capacity question, not a “bigger is better” knob. We will size it concretely below.

A vocabulary note you will see throughout: a connection plugin is the transport Ansible uses to reach a host (ssh, paramiko, local, winrm, …); a strategy plugin decides task scheduling across hosts (linear, free, host_pinned, debug, and the third-party mitogen_linear); a callback plugin reacts to run events and is how profiling output is produced. All three are plugins that run on the control node, and all three are tuning surfaces.

Forks: the parallelism dial

forks is the maximum number of hosts Ansible communicates with at the same time. It defaults to 5 — meaning even if your play targets 200 hosts, Ansible drives them 5 at a time for each task under the default linear strategy. This single setting is the most common reason a large run feels slow, and the easiest to fix.

You set it in three places (later overrides earlier):

Where How Scope
ansible.cfg forks = 50 under [defaults] Project/global default
Environment ANSIBLE_FORKS=50 ansible-playbook … Per-invocation
Command line ansible-playbook -f 50 site.yml (also --forks) Per-invocation, wins

How forks interacts with the strategy. Under linear (default), Ansible runs task N on up to forks hosts, waits for all of them, then starts task N+1 on the next batch — the play advances task-by-task and the slowest host in each batch sets the pace. Under free, forks still caps concurrency, but hosts no longer wait for each other between tasks; a fast host can be ten tasks ahead of a slow one. So forks is the concurrency ceiling regardless of strategy; the strategy decides whether hosts move in lockstep beneath that ceiling.

How forks interacts with serial. serial (rolling-update batch size, covered in the previous lesson) is a different cap. serial: 10 means Ansible runs the whole play against 10 hosts, finishes, then the next 10. forks then governs concurrency within that batch of 10. The effective parallelism is min(forks, serial, hosts-remaining). If serial: 10 and forks: 50, you get at most 10 hosts at once (the batch is the binding limit); if serial: 100 and forks: 25, you get 25 at once. A classic mistake is bumping forks to 100 for a rolling update and seeing no change because serial is pinning you to 10.

Sizing forks. There is no universal number; size it from the control node’s capacity, because each fork is a Python process holding an SSH connection.

Control node Sensible starting forks Reasoning
Laptop / 2 vCPU, 8 GB 10–25 Modest CPU; SSH + Python per fork add up
CI runner / 4 vCPU, 16 GB 25–50 Common sweet spot for mid fleets
Dedicated control / 8–16 vCPU, 32 GB+ 50–100+ Can drive hundreds of hosts in waves
AWX/AAP execution node tune per node + forks in job template Container resources cap it

Rules of thumb: each fork costs roughly tens of MB of RAM and one file descriptor pair; CPU matters because the control node runs Jinja2 templating, JSON parsing, and connection setup for every fork. Watch top/htop on the control node during a big run — if it pegs CPU or starts swapping, you set forks too high. Also raise the OS open-files limit (ulimit -n) before pushing forks into the hundreds, or you will hit “too many open files”. Finally, more forks only helps if you actually have many hosts and your tasks are not serialised by a downstream bottleneck (a single shared database, an artifact server, a delegate_to chokepoint).

SSH connection reuse: ControlMaster & ControlPersist (multiplexing)

Here is the biggest, cheapest win at scale, and it is pure OpenSSH. Every TCP+SSH handshake — key exchange, authentication, channel setup — costs real milliseconds, often 100–500 ms each. A playbook with 30 tasks against a host, without connection reuse, can pay that handshake dozens of times on that one host. Multiplexing opens the SSH connection once and reuses it for every subsequent task.

OpenSSH implements this with three options that Ansible’s ssh connection plugin sets for you:

OpenSSH option What it does Ansible’s effective default
ControlMaster Lets the first connection become a master that later sessions reuse over one TCP connection auto
ControlPersist Keeps the master connection open in the background for N seconds after the last session closes, ready to be reused 60s
ControlPath Filesystem path to the control socket that identifies a reusable connection (per user@host:port) %(directory)s/%%h-%%r (under the control dir)

You configure these through Ansible, not by hand-editing ~/.ssh/config, via ssh_args in ansible.cfg:

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
control_path_dir = ~/.ansible/cp

What each value means in practice:

How to confirm it is working. Run with -vvvv and look for ControlMaster / auto / mux lines, or check for live sockets while a play runs:

ls -la ~/.ansible/cp/
# entries like:  <hash>  -> a live control socket = multiplexing is on

The payoff: connection reuse alone can cut a multi-task run’s wall-clock time by half or more on high-latency links, because you stop paying the handshake on every task. It is on by default — your job is mainly to keep ControlPath short and tune ControlPersist to your workflow.

Pipelining: the single biggest per-task win

Multiplexing reuses the connection; pipelining reduces the number of operations per task over that connection. This is the lever that surprises people with how much it helps.

What a task costs without pipelining. For each task, Ansible normally: (1) creates a temporary directory on the target via an SSH call, (2) copies the module file into it via SFTP/SCP (another transfer), (3) executes the module via SSH, then (4) removes the temp dir. That is several round-trips per task per host.

What pipelining does. With pipelining enabled, Ansible pipes the module’s Python code straight into the interpreter’s stdin over the already-open SSH session, executing it without first writing the module to a temp file on disk. It collapses those several round-trips into roughly one execution call. Combined with multiplexing, the per-task overhead drops dramatically — frequently a 2× or better speedup on connection-heavy playbooks, and the more tasks/hosts you have, the bigger the absolute saving.

Enable it in ansible.cfg:

[ssh_connection]
pipelining = True

Or per-invocation: ANSIBLE_PIPELINING=True ansible-playbook site.yml.

The requiretty caveat — read this before you enable it. Pipelining feeds the module to Python via stdin without allocating a pseudo-terminal (no -tt). On targets where sudo is configured with requiretty in /etc/sudoers (or a sudoers.d drop-in), commands run through sudo demand a TTY and will fail when pipelining is on, with errors like “sudo: sorry, you must have a tty to run sudo”. This is the one thing that breaks people who flip pipelining on across a fleet without checking. Two resolutions:

A second, smaller caveat: pipelining requires that the remote sudo/become can read from stdin, which the default sudo become plugin handles; some exotic become methods do not pipeline. In practice, on a modern fleet, pipelining = True is the first setting you should add to ansible.cfg — verify requiretty is off, then enjoy the win.

Lever What it reuses/saves Default The catch
ControlMaster/ControlPersist Reuses one SSH connection across tasks (and runs) on (auto, 60s) ControlPath 108-char limit → keep dir short
pipelining Removes the per-task temp-file copy; ~1 round-trip/task off breaks under sudoers requiretty
forks More hosts in parallel 5 bounded by control-node CPU/RAM/fds

Fact gathering: stop paying for facts you do not use

Before a play’s first task, Ansible runs an implicit ansible.builtin.setup against every host to collect facts — OS family, network interfaces, memory, mounts, hardware, and more. On a single host that costs a fraction of a second. Across a large fleet, or on hosts where setup is slow (lots of mounts, slow lsblk, network probes), fact gathering can be a meaningful slice of total run time — and you pay it every play, every run, whether or not your tasks read a single fact.

The levers:

Lever Where Effect
gather_facts: false Play keyword Skip the implicit setup entirely for this play
gathering = smart / explicit / implicit ansible.cfg [defaults] Global policy for when facts are gathered
gather_subset: Play module_defaults or the setup task Collect only some fact subsets
gather_timeout: Same Per-fact-collection timeout (default 10s)
fact_caching (+ timeout) ansible.cfg Reuse facts across runs (next section)

gathering policy — set in ansible.cfg:

Turning it off per play. If a play only runs a couple of commands and never touches ansible_facts, set gather_facts: false and save the entire setup cost:

- name: Quick service bounce (no facts needed)
  hosts: web
  gather_facts: false
  tasks:
    - name: Restart nginx
      ansible.builtin.service:
        name: nginx
        state: restarted

If you discover mid-play that you do need a fact, gather on demand:

    - name: Gather just what I need
      ansible.builtin.setup:
        gather_subset:
          - "!all"
          - "!min"
          - network

gather_subset — collect only what you use. The setup module organises facts into subsets. You pass a list; prefix with ! to exclude. The meta-subsets are all (everything), min (a small mandatory core, always included unless you say !min), and the individual categories below.

Subset Covers (examples)
min ansible_fqdn, ansible_distribution, basic identity (cheap, near-always wanted)
hardware CPU, memory, devices, mounts — often the slowest (probes lsblk, /proc)
network Interfaces, IPs, default route
virtual Hypervisor/container detection
facter / ohai Pull facts from Puppet’s facter / Chef’s ohai if installed
all Everything (the default if you specify nothing)

The big lever is excluding hardware: gather_subset: ["!hardware"] (or the tighter ["!all", "!min", "network"]) can noticeably speed gathering on fleets where you only need OS family and IPs. Set it globally with module_defaults so every play benefits:

- hosts: all
  module_defaults:
    ansible.builtin.setup:
      gather_subset:
        - "!hardware"

gather_timeout — the per-collection timeout, default 10 seconds. Raise it (gather_timeout: 30) when a host has many disks/mounts and hardware facts legitimately take longer than 10s and you see “timed out waiting for … facts”; otherwise leave it. It is set on the setup module (or via DEFAULT_GATHER_TIMEOUT).

Fact caching: gather once, reuse for hours

Skipping facts is great when you do not need them. Fact caching is for when you do need them but do not want to re-gather on every run. With caching on, the first run gathers facts and persists them; subsequent runs (within the cache’s lifetime) read facts from the cache instead of touching the host — and with gathering = smart, gathering is skipped entirely. On a 300-host fleet that you converge every 15 minutes, this turns “re-probe 300 hosts every time” into “probe once an hour.”

Caching is provided by cache plugins, selected with fact_caching in ansible.cfg (or ANSIBLE_CACHE_PLUGIN):

Cache plugin Backend Best for Key settings
memory Process RAM (default) Single run only — not persisted across runs none
jsonfile JSON files on the control node Simple, single-controller, no extra services fact_caching_connection = directory
yaml YAML files on disk Human-readable on-disk cache fact_caching_connection = directory
redis Redis server Shared cache across many controllers/AWX nodes fact_caching_connection = host:port:db (+ keyprefix)
memcached memcached server Shared, fast, ephemeral cross-controller cache fact_caching_connection = host:port
pickle Pickled files on disk On-disk binary cache fact_caching_connection = directory

The default is memory, which caches facts only for the duration of a single ansible-playbook run — useful so a second play in the same run reuses the first play’s facts, but gone the moment the process exits. To get cross-run reuse you must choose a persistent plugin.

jsonfile — the zero-dependency choice:

[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_fact_cache
fact_caching_timeout = 7200

That writes one JSON file per host under the directory and treats cached facts as valid for 7200 seconds (2 hours); after that they are considered stale and re-gathered. fact_caching_timeout = 0 means never expire (cache forever until you delete it) — handy but dangerous, because a host that changes IPs will keep reporting the old one until you flush. There is no automatic invalidation on host change; the timeout is your only freshness guarantee, so pick it to match how often your fleet legitimately changes.

redis — the shared choice for multiple controllers or AWX/AAP:

[defaults]
gathering = smart
fact_caching = redis
fact_caching_connection = 127.0.0.1:6379:0
fact_caching_timeout = 3600
fact_caching_prefix = ansible_facts_

Now every controller (or every AWX execution node) reads and writes the same fact cache, so a host gathered by one controller is instantly available to the next. memcached works the same way with host:port. These need the matching Python client (redis / python-memcached) installed on the controller.

A subtle but important benefit: hostvars across the fleet without gathering. Because cached facts live outside any single play, a play that targets web can read hostvars['db01'].ansible_default_ipv4.address even though this run never connected to db01 — as long as db01’s facts are in the cache from an earlier run. That makes cross-host templating (load-balancer configs, /etc/hosts generation) both possible and fast. This is the same cacheable: true mechanism you saw with ansible.builtin.set_fact: cacheable set_facts are persisted into the same store.

A security note carried from the variables lesson: a jsonfile/yaml cache is plaintext on the control node. If facts ever include anything sensitive, protect the cache directory and set a sane timeout — see Security notes below.

async & poll: long-running and fire-and-forget tasks

By default Ansible runs a task synchronously: it starts the module on the host and blocks, holding the connection open, until the module returns. Two problems follow. First, a genuinely long task (a 20-minute package compile, a big database dump) can exceed the SSH/become timeout and the connection dies. Second, while one host runs a 10-minute task, that fork is tied up and cannot do useful work elsewhere.

async + poll solve both. async: N tells Ansible “this task may run up to N seconds; start it in the background on the target and let me check on it.” poll: M tells Ansible “check whether it finished every M seconds.”

- name: Long database backup (up to 30 min), checked every 15s
  ansible.builtin.command: /usr/local/bin/full_backup.sh
  async: 1800      # max runtime in seconds
  poll: 15         # check every 15s; Ansible blocks here until done or timeout

With poll > 0, Ansible still waits for the task on that host — but it survives long runtimes (no connection timeout, because it polls a status file rather than holding the original exec open) and you get a clean failure if it exceeds async. Use this for tasks that are long but that you need to complete before the next task.

The fire-and-forget pattern: poll: 0. Set poll: 0 and Ansible starts the task on every host and immediately moves on — it does not wait. This is the key to parallelising slow, independent work across a fleet: kick the long job off on all hosts at once, do other things, then come back and collect results with ansible.builtin.async_status using the job id the task registered.

- name: Kick off a long upgrade on every host, do NOT wait
  ansible.builtin.command: /usr/local/bin/upgrade.sh
  async: 3600        # allow up to 1 hour
  poll: 0            # fire and forget — returns immediately with a job id
  register: upgrade_job

# ... do other useful work here while upgrades run in the background ...

- name: Wait for all the upgrades to finish
  ansible.builtin.async_status:
    jid: "{{ upgrade_job.ansible_job_id }}"
  register: upgrade_result
  until: upgrade_result.finished      # poll until the job reports finished
  retries: 60                         # up to 60 attempts ...
  delay: 60                           # ... 60s apart = wait up to 1 hour

async_status reports finished (1/0), failed, rc, stdout, etc. — exactly as if the task had run synchronously, but you collected it on your schedule. The until/retries/delay loop is how you wait for completion.

The poll: 0 cleanup gotcha. When poll: 0 is used, Ansible deliberately does not clean up the async job’s status file on the target afterwards (it cannot know when you are done with it). For truly one-shot fire-and-forget tasks you never poll (e.g. kick off something and genuinely walk away), set async to a value and never call async_status; for everything else, the async_status loop both waits and lets Ansible reap the job. Two more notes: a task with async must be a module that supports backgrounding (most command/shell/package/long-running modules do); and async + poll: 0 on a host that disconnects mid-job means you lose the result, so it suits idempotent, restartable work.

Mode Setting Behaviour Use for
Synchronous (default) Block until task returns; hold connection Normal short tasks
Async, polled async: N, poll: M>0 Background on host; poll status every M s; wait Long tasks you must finish before continuing
Fire-and-forget async: N, poll: 0 Start on all hosts, return immediately; collect later via async_status Slow independent work parallelised across the fleet

The free strategy (and why default is linear)

The strategy decides how per-host task streams are scheduled. The default, linear, runs each task on all (up-to-forks) hosts and waits for every host to finish that task before starting the next — the play advances in lockstep, and the slowest host in each step sets the pace. That predictability is great for rolling updates and ordered changes, but it means one sluggish host stalls everyone.

The free strategy removes the per-task barrier: each host races through the play as fast as it individually can, never waiting for others. A fast host may be at task 20 while a slow one is still at task 5. On a heterogeneous fleet — mixed hardware, varying latency, some hosts with more work — free can dramatically cut total wall-clock time because no host idles waiting for a laggard.

- hosts: all
  strategy: free
  tasks:
    - ...

Or globally: [defaults] strategy = free.

When free hurts. Because hosts are out of step, free breaks anything that assumes order across hosts: you cannot rely on host A finishing task 3 before host B starts task 4. Handlers still flush at the end of the play per host, but cross-host coordination (e.g. “configure all DB replicas, then promote one”) is unsafe under free. And run_once/serial semantics are designed around linear. Rule: use free for independent, parallel-safe work where speed matters; keep linear for ordered or coordinated rollouts.

Strategy Scheduling Best for Avoid when
linear (default) Lockstep: all hosts finish task N before task N+1 Ordered changes, rolling updates, debugging A few slow hosts stall a big fleet
free Each host runs the play as fast as it can Heterogeneous fleets, independent work, max throughput Cross-host ordering matters
host_pinned Like linear but pins hosts to workers; a host completes the play before a new one starts Keeping per-host work on one worker (resource locality) You need strict task-level lockstep
debug Linear, but drops into the interactive debugger on failure Step-through debugging Production/automation

(The previous lesson covers serial/throttle/order and the rolling-update pattern in depth; here the point is simply that free is a speed tool.)

Reducing the work itself: loops vs many tasks

The fastest task is the one you do not run. Beyond the connection layer, you can cut real work in the play:

Profiling: measure before you tune

Do not guess where the time goes — measure. Ansible ships callback plugins that print timing, and turning them on is a one-line change. The key one is profile_tasks, which prints the wall-clock time of every task and a sorted “slowest tasks” summary at the end.

Enable callbacks in ansible.cfg:

[defaults]
callbacks_enabled = profile_tasks, profile_roles, timer

(Older docs call this callback_whitelist; modern ansible-core uses callbacks_enabled. The env var is ANSIBLE_CALLBACKS_ENABLED.)

Callback What it reports
timer Total playbook wall-clock time at the end (the headline number)
profile_tasks Per-task duration during the run and a sorted top-N slowest-tasks table at the end
profile_roles Time aggregated per role — which role is the hog
cgroup_perf_recap CPU/memory/PID usage per task via cgroups (resource profiling, needs setup)

A typical profile_tasks tail looks like:

=============================================================
Gathering Facts -------------------------------------- 8.42s
install packages ------------------------------------- 5.10s
render nginx.conf ------------------------------------ 0.31s
...
Playbook run took 0 days, 0 hours, 0 minutes, 23 seconds

That immediately tells you the truth most people get wrong by intuition: “Gathering Facts” is frequently the single most expensive line. If it is, the fix is gather_subset/caching, not more forks. If a command task dominates, the fix is on the target, not in Ansible. Always profile first, change one lever, profile again — the lab below does exactly this so you see the numbers move.

For a one-off run without editing config: ANSIBLE_CALLBACKS_ENABLED=profile_tasks,timer ansible-playbook site.yml.

Mitogen: the strategy plugin that can halve your runtime

Mitogen for Ansible is a third-party strategy plugin that replaces Ansible’s default execution model with a far more efficient one. Vanilla Ansible, even with pipelining, still forks-and-execs a fresh Python interpreter for many operations and shuttles data over SSH for each. Mitogen instead bootstraps a single long-lived Python process on each target and runs modules inside it as in-process function calls, reusing the interpreter and connection aggressively and routing everything over one persistent channel. The result on connection/CPU-bound playbooks is commonly a 1.5×–7× wall-clock improvement, with markedly less CPU on the control node.

Install and enable it:

python3 -m pip install mitogen        # provides the ansible_mitogen package
[defaults]
strategy_plugins = /path/to/ansible_mitogen/plugins/strategy
strategy = mitogen_linear

strategy_plugins points Ansible at Mitogen’s plugin directory (find it with python3 -c "import ansible_mitogen, os; print(os.path.dirname(ansible_mitogen.__file__))" then /plugins/strategy). The strategies it adds are mitogen_linear (the lockstep default — start here), mitogen_free, and mitogen_host_pinned, mirroring the built-ins. You can also set it per-invocation with ANSIBLE_STRATEGY=mitogen_linear.

Trade-offs and limits — important:

When it works, Mitogen is the largest single lever after pipelining. When it does not, you fall back to a well-tuned stock setup — which is why the order of optimisation is: pipelining + multiplexing → forks → fact caching/subset → profiling-guided fixes → then consider Mitogen.

Connection plugins: ssh vs paramiko vs smart vs local

Underneath every remote task is a connection plugin — the transport. Picking the right one matters for both speed and capability.

Connection plugin Transport Multiplexing Pipelining When to use
ssh Native OpenSSH binary (/usr/bin/ssh) Yes (ControlMaster) Yes The default and the fast path — use it for Linux/Unix at scale
paramiko Pure-Python SSH library No (no ControlPersist) Limited Fallback when no ssh binary / for --ask-pass edge cases; slower
smart Picks ssh if it supports ControlPersist, else paramiko inherits inherits Legacy auto-detect; today effectively ssh everywhere
local Runs on the control node, no SSH n/a n/a localhost, delegate_to: localhost, connection: local
winrm / psrp / ssh (Windows) WinRM / PowerShell Remoting / SSH n/a n/a Windows targets

Set it with connection in ansible.cfg ([defaults] transport = ssh), the ansible_connection inventory var, or -c ssh on the command line.

The practical guidance: use ssh (the native binary). It is the only plugin that gives you OpenSSH multiplexing and pipelining — the two biggest levers in this lesson. paramiko exists for environments without an ssh binary or where you need pure-Python password auth without sshpass, but it cannot multiplex and is slower; only fall back to it deliberately. smart (a historical default) auto-selects between them and, on any modern system, resolves to ssh — so you lose nothing by setting ssh explicitly. Use local for the control node itself.

Ansible performance tuning: how forks set parallelism across hosts, SSH ControlMaster/ControlPersist multiplexing reuses one connection while pipelining collapses per-task round-trips, the setup module gathers facts that fact caching (jsonfile/redis/memcached) persists for reuse, async/poll backgrounds long tasks and async_status collects them, the linear versus free strategies schedule per-host streams, profile_tasks/timer callbacks measure where time goes, and the Mitogen strategy plugin runs modules in a persistent remote interpreter

The diagram traces a single run through every lever — from forks deciding how many hosts go at once, through the connection-layer wins (multiplexing + pipelining), fact gathering and caching, async backgrounding, strategy scheduling, profiling, and finally Mitogen’s in-process model — so you can see exactly where each setting bites.

Hands-on lab: a before/after benchmark

We will build a small, deliberately connection-heavy playbook, run it with stock defaults, then turn on the big levers and re-run, watching the wall-clock time drop. This costs ₹0 — it runs against local containers (or VMs). All commands use FQCN.

1. Set up a few free targets

Spin up three lightweight containers as SSH targets (Podman or Docker; adjust the image to taste). If you already have a few VMs or localhost plus containers, use those instead.

mkdir -p ~/ansible-perf-lab && cd ~/ansible-perf-lab

for n in 1 2 3; do
  podman run -d --name perf$n -p 220$n:22 \
    docker.io/rastasheep/ubuntu-sshd:18.04 >/dev/null 2>&1 \
  || docker run -d --name perf$n -p 220$n:22 \
    docker.io/rastasheep/ubuntu-sshd:18.04
done

That image listens on SSH with root / root. Create an inventory:

cat > inventory.ini <<'EOF'
[perf]
perf1 ansible_host=127.0.0.1 ansible_port=2201
perf2 ansible_host=127.0.0.1 ansible_port=2202
perf3 ansible_host=127.0.0.1 ansible_port=2203

[perf:vars]
ansible_user=root
ansible_ssh_pass=root
ansible_ssh_common_args=-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
EOF

(sshpass is needed for ansible_ssh_pass; install it via your package manager, or swap to key auth.)

2. A connection-heavy playbook

The point is many small tasks so connection overhead dominates — exactly where tuning shows.

cat > bench.yml <<'EOF'
---
- name: Connection-heavy benchmark
  hosts: perf
  gather_facts: true
  tasks:
    - name: Touch a series of files (lots of small round-trips)
      ansible.builtin.file:
        path: "/tmp/perf_{{ item }}"
        state: touch
        mode: "0644"
      loop: "{{ range(1, 21) | list }}"

    - name: A few command checks
      ansible.builtin.command: "echo check {{ item }}"
      changed_when: false
      loop: "{{ range(1, 6) | list }}"
EOF

3. Run #1 — stock defaults, with profiling

Use a config that turns on only profiling so we get honest numbers, with everything else at defaults (pipelining off, forks 5):

cat > ansible.cfg <<'EOF'
[defaults]
inventory = inventory.ini
host_key_checking = False
callbacks_enabled = profile_tasks, timer
EOF

ANSIBLE_PIPELINING=False ansible-playbook bench.yml

Note the Playbook run took ... line and the profile_tasks table — especially how much Gathering Facts and the looped file/command tasks cost. This is your baseline.

4. Run #2 — turn on the levers

Now enable pipelining, raise forks, add multiplexing with a persistent master, and switch to the free strategy:

cat > ansible.cfg <<'EOF'
[defaults]
inventory = inventory.ini
host_key_checking = False
forks = 25
strategy = free
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_fact_cache
fact_caching_timeout = 7200
callbacks_enabled = profile_tasks, timer

[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=120s
control_path_dir = ~/.ansible/cp
EOF

ansible-playbook bench.yml

Compare the new Playbook run took ... line to the baseline. On a connection-heavy run you should see a clear drop — pipelining removes per-task temp-file copies, multiplexing reuses the connection, and free lets the three hosts finish independently.

5. Run #3 — prove fact caching

Run a third time immediately. Because gathering = smart + jsonfile caching is on and the facts are fresh (within 7200s), fact gathering is skipped entirely:

ansible-playbook bench.yml
ls -la /tmp/ansible_fact_cache/      # one JSON file per host = cached facts

In the profile_tasks output, Gathering Facts should now be near-instant (or absent), shaving the setup cost off every subsequent run. That is the production win for fleets you converge repeatedly.

6. (Optional) Confirm multiplexing and try async

While a run is in flight, list the live control sockets, then add a fire-and-forget task:

ls -la ~/.ansible/cp/    # live sockets during a run = multiplexing working
cat >> bench.yml <<'EOF'

    - name: Fire-and-forget a slow job on every host
      ansible.builtin.command: "sleep 10"
      async: 60
      poll: 0
      register: slow

    - name: Collect the slow jobs
      ansible.builtin.async_status:
        jid: "{{ slow.ansible_job_id }}"
      register: slow_done
      until: slow_done.finished
      retries: 30
      delay: 2
EOF

ansible-playbook bench.yml

The sleep 10 runs on all three hosts in parallel in the background; total added time is ~10s, not 30s, because they ran concurrently and you collected them afterwards.

Validation

Cleanup

for n in 1 2 3; do podman rm -f perf$n 2>/dev/null || docker rm -f perf$n; done
rm -rf ~/ansible-perf-lab /tmp/ansible_fact_cache ~/.ansible/cp

Cost note

₹0. Everything runs in local containers (or VMs) on your own machine — no cloud resources, no managed nodes billed.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Big fleet still slow after raising forks serial is the binding cap, or downstream chokepoint (one DB, delegate_to) Effective parallelism is min(forks, serial, hosts); raise/remove serial; remove the chokepoint
Control node thrashes / OOM / “too many open files” at high forks Forks exceed control-node CPU/RAM/fd limits Lower forks; raise ulimit -n; size from control-node capacity, not host count
sudo: you must have a tty to run sudo after enabling pipelining Targets have Defaults requiretty in sudoers Remove/scope requiretty, or leave pipelining off for those hosts
unix_listener: path too long / multiplexing silently off ControlPath exceeds the 108-char socket limit Set a short control_path_dir (e.g. ~/.ansible/cp)
Facts re-gathered on every run despite caching Using default memory cache, or gathering = implicit Set a persistent plugin (jsonfile/redis) and gathering = smart
Cached facts are stale (old IP/hostname) fact_caching_timeout too high or 0 (never expire); no auto-invalidation Lower the timeout; delete the cache dir to force a refresh
Long task fails with a connection/timeout error Synchronous task exceeded SSH/become timeout Wrap it in async: N with poll (or poll: 0 + async_status)
free rollout breaks ordering / coordination free removes the per-task barrier between hosts Use linear for ordered/coordinated work; free only for independent tasks
Mitogen fails with obscure errors Mitogen version doesn’t support your ansible-core Match versions per Mitogen release notes; test full playbook; else fall back to stock

Best practices

Security notes

Interview & exam questions

  1. What does forks control, and what is its default? The maximum number of hosts Ansible communicates with simultaneously; default 5. Under linear it is the batch size per task; it is the concurrency ceiling under any strategy.

  2. A colleague set forks: 100 for a rolling update but sees no speedup. Why? serial is almost certainly capping the batch. Effective parallelism is min(forks, serial, hosts-remaining); with serial: 10 you get at most 10 hosts at once regardless of forks.

  3. Explain pipelining and the one thing that breaks it. Pipelining pipes the module’s Python straight into the remote interpreter over the open SSH session instead of copying it to a temp file first, collapsing several round-trips into ~one — a major per-task speedup. It breaks when sudoers has requiretty, because pipelining allocates no TTY and sudo then refuses; the fix is to remove/scope requiretty (or leave pipelining off there).

  4. What is SSH multiplexing and how does Ansible use it? OpenSSH ControlMaster opens one connection that subsequent sessions reuse, and ControlPersist keeps it warm for N seconds after the last use (even across runs). Ansible’s ssh plugin enables ControlMaster=auto + ControlPersist=60s by default, eliminating the handshake on every task. The gotcha is the ControlPath 108-char socket limit — keep control_path_dir short.

  5. Default fact gathering is costing you on a 300-host fleet. List the levers. gather_facts: false where facts are unused; gather_subset (e.g. !hardware) to collect less; gather_timeout if collection legitimately exceeds 10s; and gathering = smart + a persistent fact cache (jsonfile/redis) so facts are gathered once and reused.

  6. What does gathering = smart do, and why pair it with caching? It gathers facts for a host only if they are not already cached. With a persistent cache, the first run gathers and stores; later runs within the cache window skip gathering entirely — turning per-run probing into occasional probing.

  7. Compare the memory, jsonfile, and redis cache plugins. memory (default) caches only within a single run — not persisted. jsonfile writes one JSON file per host to disk on the controller — simple, single-controller. redis (and memcached) is a shared network cache so many controllers/AWX nodes share one fact store. Set freshness with fact_caching_timeout.

  8. You must run a 30-minute backup that exceeds the SSH timeout. How? Wrap it in async: 1800 with poll: 15 (or another interval): Ansible backgrounds it on the host and polls a status file rather than holding the exec open, surviving the long runtime and failing cleanly past async.

  9. What is the fire-and-forget pattern and which module collects the result? async: N with poll: 0 starts the task on every host and returns immediately without waiting; you collect results later with ansible.builtin.async_status using the registered ansible_job_id, typically in an until: result.finished / retries / delay loop. It parallelises slow, independent work across the fleet.

  10. linear vs free strategy — when each? linear (default) keeps hosts in lockstep (all finish task N before task N+1) — use it for ordered/coordinated rollouts; the slowest host paces each step. free lets each host run the play as fast as it can — use it for independent work on heterogeneous fleets to cut wall-clock time, but never when cross-host ordering matters.

  11. How do you find the real bottleneck before tuning? Enable the profile_tasks and timer callbacks (callbacks_enabled = profile_tasks, timer, or ANSIBLE_CALLBACKS_ENABLED). timer gives total wall-clock; profile_tasks gives per-task timing and a sorted slowest-tasks table — which very often reveals that “Gathering Facts” is the most expensive line, pointing you at subset/caching rather than forks.

  12. What is Mitogen and what are its risks? A third-party strategy plugin (mitogen_linear/_free/_host_pinned) that runs modules in a persistent in-process interpreter per host with one reused channel, commonly 1.5×–7× faster with less control-node CPU. Risks: it lags ansible-core compatibility, subtly changes execution semantics (test your full playbook), and does not support every connection/become combination — treat it as a validated optimisation, not a default.

  13. Which connection plugin gives you both multiplexing and pipelining, and what is the alternative for? The native ssh plugin (the default fast path). paramiko is a pure-Python fallback for environments without an ssh binary or for certain password-auth cases — it cannot multiplex and is slower. smart auto-selects and resolves to ssh on modern systems.

Quick check

  1. Your play targets 200 hosts with forks: 50 and serial: 10. How many hosts run a given task at once, at most?
  2. True or false: pipelining is on by default in ansible-core.
  3. You enable pipelining and sudo tasks start failing with a TTY error. What sudoers setting is the culprit?
  4. Which gathering value gathers facts only when they are not already cached?
  5. You want a long task to start on every host and not block the play, collecting results later. Which two settings do you use, and which module reaps the result?

Answers

  1. 10. Effective parallelism is min(forks, serial, hosts) = min(50, 10, 200) = 10; serial is the binding cap.
  2. False. Multiplexing (ControlMaster) is on by default, but pipelining is off by default — you enable it in ansible.cfg.
  3. Defaults requiretty in /etc/sudoers. Pipelining allocates no TTY, so sudo with requiretty refuses; remove or scope it.
  4. smart (gathering = smart) — gather only if not cached; pair it with a persistent fact cache for the full benefit.
  5. async: N with poll: 0 (fire-and-forget), then ansible.builtin.async_status (with until: result.finished / retries / delay) to collect the result.

Exercise

Take a real (or sample) playbook of yours that targets at least three hosts and do a measured tuning pass:

  1. Baseline. Add callbacks_enabled = profile_tasks, timer only, run it, and record the total time and the top three slowest tasks. Note specifically what “Gathering Facts” costs.
  2. Connection layer. Enable pipelining = True (confirm requiretty is off), set ssh_args with ControlPersist=120s and a short control_path_dir, and raise forks to a sensible value for your control node. Re-run and compare.
  3. Facts. Switch to gathering = smart with a jsonfile cache (fact_caching_timeout = 3600); add gather_subset: ["!hardware"] via module_defaults. Run twice and confirm the second run skips gathering.
  4. Async. Convert your longest task to async/poll: 0 + ansible.builtin.async_status, and verify the play no longer blocks on it.
  5. Strategy. If your tasks are order-independent, try strategy: free and compare wall-clock against linear.
  6. Write up the before/after timer numbers and which lever moved the needle most. (Optional stretch: install Mitogen, run under mitogen_linear, and compare — but only if its version matches your ansible-core.)

The goal is the discipline: profile → change one lever → profile again, and end with numbers, not vibes.

Certification mapping

Glossary

Next steps

You can now make Ansible fast: parallelise with forks, eliminate connection cost with multiplexing + pipelining, stop re-gathering with gathering = smart + a fact cache, background long work with async/poll and ansible.builtin.async_status, choose free for independent work, and — crucially — profile with profile_tasks/timer so every change is measured, with Mitogen as the heavy lever once the basics are exhausted. The next lesson, Dynamic Inventory for AWS, Azure & Secrets, takes you from a fast static fleet to a fast dynamic one — generating inventory from cloud APIs so the hosts you tune here are discovered automatically. To revisit the control-side companions of these levers — serial, throttle, order, and the linear/free/host_pinned strategies used for coordination rather than raw speed — return to Ansible Delegation, Strategies & Rolling Updates, In Depth. And to refresh where the facts you are caching actually come from, see Ansible Variables & Facts, In Depth.

ansibleperformancepipeliningforksfact-cachingrhce
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments