Ansible Lesson 42 of 42

Ansible × Observability Capstone, In Depth: Prometheus, Grafana, Loki, OpenTelemetry, AAP Metrics & the Closed Automation Feedback Loop

This is the final lesson in the Tier 5 wave, and it deliberately closes the loop. Across nine deep-dives we’ve built up an automation platform that is compliant (D1), disaster-resilient (D2), capable of bulk migrations (D3), able to operate in air-gapped enclaves (D4), integrated with complex stacks like SAP (D5), capable of fleet operation at edge scale (D6), governed by ITSM (D7), backed up by tested immutable backups (D8), and able to migrate databases without downtime (D9). All of that is meaningless without one final ingredient: the system has to know whether it is healthy.

A platform that runs correctly 99.97% of the time but cannot tell you which 0.03% failed is not a platform — it is a black box that occasionally surprises everyone. The thesis of this lesson is that automation observability is not a “nice to have” added later; it is the foundation that makes everything else trustworthy at scale. When your CHG-gated, evidence-bundled, ServiceNow-tracked, SLA-verified automation has 50,000 runs per quarter, the only way to know it is working is metrics, logs, and traces that aggregate into a single view answerable in seconds.

The four pillars of automation observability:

  1. Metrics — counters, gauges, histograms about playbook runs, AAP control plane, hosts, and ITSM/CHG flow
  2. Logs — structured stdout/stderr from every play, every task, indexed and queryable
  3. Traces — distributed traces through multi-step orchestrations (workflow → job → host → task)
  4. Events — discrete state-change events (CHG opened, job launched, EDA rule fired) correlated with the above

The toolchain we will assemble:

Pillar Tool Why this choice
Metrics ingestion Prometheus + AAP /api/v2/metrics/ AAP exposes a Prometheus endpoint natively
Metrics storage Mimir (or Cortex/Thanos) Long-term, multi-tenant, queryable
Logs Loki + promtail / Vector Aligned with Grafana stack; cheap; label-based
Traces Tempo + OpenTelemetry OTel’s Ansible callback plugin is officially supported
Visualization Grafana Single pane of glass across metrics, logs, traces
Alerting Alertmanager + Grafana Alerting Routes to Slack/Teams/PagerDuty/ServiceNow
Closed loop EDA rulebooks subscribed to alerts Alerts auto-trigger remediation playbooks

This is the canonical CNCF observability stack, with the caveat that you can substitute Datadog, New Relic, Splunk, or Elastic at the storage layer without changing the patterns in this lesson. The instrumentation contract (what to emit) is the durable part; the storage choice is replaceable.


1. The four golden signals, applied to automation

Google’s SRE book defines four golden signals for any service: latency, traffic, errors, saturation. Translated to an automation platform:

Signal What it means for AAP/Ansible What you measure
Latency How long playbooks take p50/p95/p99 job duration; per-template, per-host
Traffic How many jobs run jobs/hour, jobs/template, jobs/inventory
Errors How many fail failure rate, error class breakdown, time-to-failure
Saturation How busy the control plane is execution-environment queue depth, capacity utilisation

These four signals at the platform level give you the macro view. But automation has a fifth signal that conventional services don’t: convergence. Did the automation actually achieve its desired state, or did it merely “complete”?

A playbook that “succeeds” but fails to change anything because it was misconfigured is a successful run that produced a wrong outcome. Convergence means measuring not just job.status == 'successful' but job.changed_count > 0 AND desired_state == observed_state after the run. We’ll wire this in.


2. The Ansible callback plugin: the foundation of instrumentation

Every metric, log, and trace in this lesson originates from the same place: a callback plugin that fires on every Ansible event. Red Hat ships an official OpenTelemetry callback plugin in ansible.posix:

# ansible.cfg or AAP execution environment env
[defaults]
callbacks_enabled = ansible.posix.opentelemetry, ansible.posix.profile_tasks

[callback_opentelemetry]
otel_service_name = ansible-aap
enable_from_environment = OTEL_EXPORTER_OTLP_ENDPOINT
hide_task_arguments = true
# Environment variables (set in execution environment)
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.kv.local:4318
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer <token>
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
OTEL_SERVICE_NAME=ansible-aap

When this is enabled, every Ansible run produces a complete OpenTelemetry trace with this hierarchy:

playbook span (root, named after the playbook)
  └── play span (one per play)
       └── task span (one per task per host)
            ├── attributes: ansible.task.module=template, host=foo, status=ok/changed/failed
            └── events: stderr lines as span events

A nginx_install.yml playbook with 12 tasks running across 8 hosts will produce roughly 1 + 1 + 12*8 = 98 spans, all linked. In Tempo this is queryable as “show me all task failures for nginx_install in the last 7 days,” and you get back exact module name, host, task, exception text, and parent context.

The tradeoff is volume. A workflow that orchestrates 200 playbooks across 5,000 hosts produces hundreds of thousands of spans per run. Sample aggressively in production:

# OTel collector config
processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces-always
        type: latency
        latency: { threshold_ms: 30000 }
      - name: sample-others
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

This keeps every error trace and every slow trace, samples 5% of normal traces, and discards the rest. Storage cost drops by 95% with effectively zero loss of debugging value.

2.1 Custom callback for non-OTel workflows

Sometimes you need metrics or logs that don’t naturally fit into trace span attributes — for example, a fleet-wide compliance score, or the count of hosts behind on patches. For these, write a thin custom callback that emits to Prometheus pushgateway or directly to a metrics endpoint:

# callback_plugins/kv_metrics.py
from ansible.plugins.callback import CallbackBase
from prometheus_client import CollectorRegistry, Counter, Histogram, push_to_gateway

class CallbackModule(CallbackBase):
    CALLBACK_VERSION = 2.0
    CALLBACK_TYPE = 'aggregate'
    CALLBACK_NAME = 'kv_metrics'

    def __init__(self):
        super().__init__()
        self.registry = CollectorRegistry()
        self.task_counter = Counter(
            'ansible_task_total', 'Tasks executed',
            ['template', 'play', 'task', 'status'],
            registry=self.registry,
        )
        self.task_duration = Histogram(
            'ansible_task_duration_seconds', 'Task duration',
            ['template', 'task'],
            buckets=[0.1, 0.5, 1, 5, 10, 30, 60, 300, 1800],
            registry=self.registry,
        )

    def v2_runner_on_ok(self, result):
        self._record(result, 'ok')

    def v2_runner_on_failed(self, result, ignore_errors=False):
        self._record(result, 'failed')

    def v2_runner_on_skipped(self, result):
        self._record(result, 'skipped')

    def _record(self, result, status):
        labels = {
            'template': os.environ.get('TOWER_JOB_TEMPLATE_NAME', 'cli'),
            'play': result._task._role._role_name if result._task._role else 'no-role',
            'task': result._task.get_name(),
            'status': status,
        }
        self.task_counter.labels(**labels).inc()
        # ...

    def v2_playbook_on_stats(self, stats):
        push_to_gateway(
            os.environ['PROMETHEUS_PUSHGATEWAY'],
            job=os.environ.get('TOWER_JOB_TEMPLATE_NAME', 'cli'),
            registry=self.registry,
        )

Drop this in your execution environment’s callback_plugins/, list it in callbacks_enabled, set PROMETHEUS_PUSHGATEWAY=https://pushgateway.kv.local:9091, and every playbook run will emit task-level counters and histograms. This is the foundation for any custom metric you want.


3. AAP control plane metrics

AAP exposes a Prometheus-compatible metrics endpoint at /api/v2/metrics/. A minimal scrape config:

# prometheus.yml
scrape_configs:
  - job_name: aap-controller
    metrics_path: /api/v2/metrics/
    bearer_token: '{{ aap_metrics_token }}'
    scheme: https
    static_configs:
      - targets: ['aap.kv.local:443']
    relabel_configs:
      - source_labels: [__address__]
        target_label: aap_instance

The metrics AAP exposes natively (excerpts):

The two metrics most worth alerting on:

# alert: control plane saturation
- alert: AAPControlPlaneSaturated
  expr: |
    avg by (aap_instance) (awx_instance_consumed_capacity)
      / avg by (aap_instance) (awx_instance_total_capacity)
      > 0.85
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "AAP control plane > 85% capacity for 15 minutes"
    description: "Schedule capacity scale-up; queue depth is rising."

# alert: job failure rate spike
- alert: AAPJobFailureRateSpike
  expr: |
    sum(rate(awx_status_total{state="failed"}[5m]))
      / sum(rate(awx_status_total[5m]))
      > 0.10
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: "AAP job failure rate > 10% over 10 minutes"
    description: "Investigate template or environment regression."

These are the only two AAP-level alerts most teams need. Per-template alerts are usually too noisy and end up disabled within a quarter.


4. The unified dashboard taxonomy

A common failure mode is “we have 200 Grafana dashboards and nobody knows which one to open during an incident.” The fix is a strict three-layer dashboard taxonomy:

Layer Audience Question it answers Example
L1 — Platform health Platform team, SRE Is the automation platform itself healthy? AAP control plane saturation, EDA rulebook activations, queue depth
L2 — Workload domain Domain owners Is my application’s automation healthy? Per-business-app SLO dashboards, per-team failure rates
L3 — Investigation On-call during incident Why did this specific run fail? Job-detail drill-down, traces, logs

Every alert routes to a specific dashboard. The Slack message format is rigid:

🔴 AAP job failure rate > 10%
Severity: critical | Triggered: 14:03 | Active: 12m
Dashboard: https://grafana.kv.local/d/aap-l1-health (L1)
Investigation: https://grafana.kv.local/d/aap-l3-jobs (L3)
Runbook: https://wiki.kv.local/runbooks/aap-failure-spike

Every responder gets the same starting point. No “where do I look?” question.

4.1 The L1 platform-health dashboard

Twelve panels:

  1. Job rate (5m): rate(awx_status_total[5m]) stacked by status — see traffic + errors at once
  2. p95 job duration by template family: 95th percentile from histogram_quantile(0.95, rate(ansible_task_duration_seconds_bucket[5m]))
  3. Control node CPU/memory: node_cpu_seconds_total, node_memory_* filtered to AAP nodes
  4. Postgres health (AAP database): connection pool, replication lag, slow queries
  5. Receptor mesh state: AAP’s internal mesh — node count, peer health, message queue depth
  6. EDA rulebook activations: count of running rulebooks; up for each rulebook process
  7. Inventory sync health: time since last successful sync per inventory; alert > 4h
  8. Webhook receivers: HTTP rates and error rates on AAP webhook endpoints
  9. Subscription / license: awx_subscription_total vs limit
  10. Top 10 slowest jobs (last 24h): tabular drill-down
  11. Top 10 most-failing templates (last 7d): tabular drill-down
  12. CHG-compliance metric: % of production jobs that ran with a change_request_number extra var

That last panel is the most underrated. It’s a single number that answers “is governance actually working?” Healthy organisations keep this at 100% (excluding the read-only template list). Drop below 99% and it’s a P2 incident — someone has bypassed the gate.


5. Logs: Loki and structured AAP output

AAP emits two distinct log streams:

  1. Job execution logs — stdout/stderr of every playbook run; written to disk and accessible via API
  2. Service logs — control plane internals (web tier, task scheduler, callback receiver)

Both should land in Loki via promtail or Vector. The crucial discipline is structured logging: rather than free-form prose, use the community.general.log_plays callback plugin to emit JSON-per-task:

[defaults]
callbacks_enabled = ansible.posix.opentelemetry, community.general.log_plays
log_path = /var/log/ansible/play-{{ tower_job_id }}.json

Each JSON record contains ts, host, task, module, status, result.changed, result.msg. Loki labels include job_id, template_name, inventory, severity. The query language (LogQL) becomes precise:

# Find all failed `template` module tasks across all jobs in last hour
{template_name=~".+"} 
  | json 
  | status="failed" 
  | module="template"
  | line_format "{{.task}} on {{.host}}: {{.result_msg}}"

That single query, displayed in a Grafana log panel beside the L1 metrics, lets on-call instantly see “what’s failing right now” without clicking into individual jobs.

For ad-hoc operator debugging, a pre-built saved query for “show me everything from job 12345”:

{job_id="12345"} | json | line_format "{{.host}} | {{.task}} | {{.status}} | {{.result_msg}}"

Same data, different filter. The point is that every log query from operators should use the structured fields, not full-text grep. Free-text search of multi-GB log streams is prohibitively expensive at scale; field-indexed query is fast.

5.1 Log retention discipline

Operations logs typically need 30-90 days of hot storage; compliance often mandates 1-7 years for production change records. Structure your Loki tiering accordingly:

Most environments use Loki’s built-in compactor to handle this; Mimir / Cortex have the same pattern for metrics. Without these tiers, observability storage cost runs away within a quarter.


6. Traces and the multi-host correlation problem

The single most useful capability traces unlock is per-host execution timeline visualisation. AAP’s UI shows a job’s task list serially. A trace shows the same job as a Gantt chart across hosts: host1 ran task A from 14:03:00 to 14:03:08, then waited 4 seconds, then ran task B; host2 ran task A from 14:03:01 to 14:03:25 (slow!) — and immediately you can see which host is the long pole.

The default Tempo + Grafana visualisation gives you this for free once OTel is wired. The skills to use it well:

Find slow tasks across a fleet: trace search with service.name = ansible-aap AND duration > 30s shows every task that took longer than 30 seconds, grouped by task name. Discover that template render on host group X is consistently slow → investigate filesystem latency on those hosts.

Find failures correlated by module: trace search with service.name = ansible-aap AND status = error AND ansible.task.module = systemd shows all systemd-related failures across the fleet, last 24h. Discover a pattern (specific service name on specific OS version) and fix it once.

Cross-system correlation: this is where traces really earn their keep. Wire OTel into your AAP webhook receiver, into Event-Driven Ansible, into the application code that triggered the workflow. Now a single trace shows: “User clicked Slack button → Slack webhook received by EDA → EDA fired remediation rulebook → AAP launched job → Ansible ran on host → host’s metric came back to normal.” That entire causal chain in one trace, linked by traceparent headers passed at every boundary.

Implementing the full chain requires:

  1. AAP webhook receiver propagates incoming traceparent into the launched job’s extra vars
  2. The Ansible callback plugin reads traceparent from extra vars and uses it as the parent context
  3. EDA’s rulebook engine, when triggering a job via API, propagates its current trace context
  4. Slack/Teams bots, when invoking AAP, set traceparent from their incoming request

This is fiddly to set up but transformative once running. Mean time to root cause for “why did remediation fail?” drops from 30 minutes of cross-system investigation to one Grafana click.


7. The ServiceNow event correlation

Linking ITSM to observability is the final capstone wire. Two integration directions:

ServiceNow → metrics: Every CHG, INC, and PRB record event posts to a webhook that emits a Prometheus event. You get metrics like:

Metrics → ServiceNow: Alertmanager’s webhook receiver creates ServiceNow incidents directly:

# alertmanager.yml
receivers:
  - name: servicenow
    webhook_configs:
      - url: 'https://aap.kv.local/api/v2/job_templates/snow-create-inc/launch/'
        send_resolved: true
        http_config:
          authorization:
            type: Bearer
            credentials_file: /etc/alertmanager/aap-token

The “snow-create-inc” job template runs a playbook that takes the alertmanager payload, derives priority/category/assignment, and creates an INC via servicenow.itsm.incident. Now every operationally significant alert has a ticket; every ticket auto-resolves when the alert clears.

The closed-loop pattern, end-to-end:

1. Host metric crosses threshold (e.g. disk > 90%)
2. Prometheus fires alert → Alertmanager → AAP webhook
3. AAP creates ServiceNow INC, priority computed from severity
4. EDA rulebook subscribed to "INC created with category=disk" fires
5. EDA launches "INC: Disk cleanup" job template
6. Job runs cleanup, verifies disk now < 80%
7. Job posts work note + resolves INC
8. Host metric returns to normal → Alertmanager fires resolved
9. AAP webhook closes any matching open INCs (idempotent)

In a healthy organisation this loop runs hundreds of times a day, with humans involved only on the long tail of cases the automation cannot handle. The metric to track is the auto-resolution rate — the percentage of incidents that closed without human intervention. Healthy mature platforms reach 60-80%; the remaining 20-40% are the genuinely novel issues humans should focus on.


8. SLOs as the contract

The thread that holds the whole observability story together is the Service Level Objective. For an automation platform, the SLOs that matter:

SLO Target Measurement
Platform availability 99.9% (≈ 8h downtime/year) AAP /health/ returns 200
Job success rate 99% (excluding intentional failures) awx_status_total{state="successful"} / awx_status_total
p95 job latency by template family varies (e.g. patching < 30 min, config-drift < 5 min) OTel-derived histogram
Mean time to resolution (auto-remediation) < 5 min p95 INC opened → INC resolved (where assignment_group matches automation)
Auto-resolution rate > 60% INC auto-resolved / INC total
CHG-compliance 100% on production templates jobs-with-chg / production-jobs
Backup restore drill success 100% drill-passed / drill-total (rolling 90d)

The discipline: every SLO has an error budget. If your platform availability SLO is 99.9%, you have 0.1% of “budget” to spend on outages, deploys, and changes per quarter. When the budget is consumed, you must freeze risky changes until the budget recovers (typically over the next 30 days).

This is the SRE playbook applied to automation. The error budget aligns the platform team’s incentives: they want to ship features, but every failure consumes budget; therefore quality and reliability work earn the right to ship the next feature. Without this, the platform team always picks features over reliability, and the platform degrades over time.

A Grafana SLO dashboard panel:

Availability SLO: 99.9% (target)
  Last 30 days: 99.94% (above target ✅)
  Error budget remaining: 73% (8h 12m of 11h 43m)
  Burn rate (1h): 0.4x (sustainable)
  Burn rate (24h): 1.1x (sustainable)

When burn rate exceeds 14.4x for an hour or 6x for six hours (Google’s recommended thresholds), page on-call. Otherwise the SLO panel is just a quiet, daily health check.


9. The closed feedback loop in practice

The thing that makes all of this actually transformative is when alerts trigger automation that closes the alert — and the loop is observable end-to-end. The full lifecycle:

Step 1: A node_exporter metric on prod-app-04 shows node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.10.

Step 2: Prometheus fires DiskSpaceLow alert. Alertmanager routes to webhook receiver.

Step 3: AAP Create-INC template runs. Creates ServiceNow INC with category=disk, priority=2, cmdb_ci=prod-app-04. Trace ID abc123.

Step 4: EDA rulebook incident-remediation polls ServiceNow every 30s, sees the new INC, matches the rule for “disk full,” calls AAP API to launch INC-disk-cleanup template. Propagates abc123 as traceparent.

Step 5: AAP runs the cleanup playbook on prod-app-04. The OTel callback plugin uses abc123 as the trace root. Tasks run, logs flow to Loki tagged with traceparent=abc123.

Step 6: Cleanup succeeds. df shows disk now at 67%. Playbook posts work note, resolves INC, sets close_code=“Solved (Permanently)”.

Step 7: 60 seconds later, node_exporter scrape shows disk back below threshold. Prometheus fires DiskSpaceLow resolved.

Step 8: Alertmanager sends resolved notification. AAP webhook receives it, looks for any open INCs matching this CI + alert; the INC is already resolved, so this is a no-op (idempotent).

The Grafana dashboard for this single incident shows:

That single screen tells the story of one auto-healed incident with zero ambiguity. Operationally, this is gold — but it is also the evidence artefact an auditor wants when asking “show me an example of how your automation responds to incidents.” One trace ID, one Grafana link, one minute to walk through the full chain.


10. Operational rituals that keep observability healthy

A surprising failure mode: organisations build great observability, then it decays over months. Three rituals prevent this:

Weekly observability review (15 min): Platform team reviews:

Monthly chaos game day: Pick one specific failure mode (control plane node down, Postgres replica lagging, Loki ingest backlogged) and verify your alerting catches it within target latency. Failures here mean the alert is misconfigured; fix it before the real outage.

Quarterly dashboard pruning: Each domain owner reviews their L2 dashboards. If a panel has fired no useful insight in 90 days, either rewrite it to be useful or remove it. Dashboards bloat over time; aggressive pruning keeps them readable.

The principle: observability is a product, not a one-time build. It needs roadmap, ownership, and continuous quality work. The orgs that get this wrong end up with massive observability bills, dashboards no one looks at, alerts that fire constantly and are universally muted — and they then re-build the whole thing every two years. The orgs that get it right have a stable, slowly-evolving observability layer that sustains for a decade.


11. Cardinality discipline (the silent killer)

A specific failure mode worth its own section: metric cardinality explosion. Prometheus is fast and cheap if you keep cardinality bounded; it falls over hard if you don’t.

Examples of cardinality bombs:

# BAD — user_id has unbounded cardinality
rate(api_request_total{user_id="$user"}[5m])

# BAD — unique trace IDs as label
counter.add(traceparent=$traceparent)

# BAD — host names without bucketing
ansible_task_total{host="host-12345.cluster.local"}

Every unique label value combination is a separate time series. 10 templates × 100 hosts × 50 tasks × 4 statuses = 200k series. Add host_ip_address as a label and now it’s 200k × 1k IPs = 200M series. Prometheus crashes.

The discipline:

A continuous monitoring metric:

# alert if any single metric exceeds 100k series
- alert: HighCardinalityMetric
  expr: |
    count by (__name__) (count by (__name__) ({__name__=~".+"})) > 100000
  for: 30m

This catches cardinality bombs within an hour of introduction, before they impact ingestion performance.


12. Budgeting and capacity

Final practical concerns. Observability is not free, and ungoverned observability costs grow faster than the workload they observe.

Rough annual costs at scale (industry typical, varies by vendor and region):

The most common pattern: an organisation spends $X on the cloud workload they’re observing and $0.5-2X on observing it. That’s normal. When it exceeds 2X you have a quality problem (label explosion, log spam, no sampling) — fix the discipline, not the budget.

Capacity planning for the observability stack itself:

Run synthetic load tests quarterly to verify the headroom. Never let any component exceed 70% utilisation in steady state — the spike on the day of an incident will push it over.


13. Where this leaves you

You have just completed Tier 5 of this Ansible course. Across ten lessons we’ve covered:

What you should walk away with is the conviction that mature automation in a regulated enterprise is not a single tool or playbook. It is an interoperable system of disciplines: governance via ITSM, content via Ansible, evidence via signed bundles, recovery via tested DR, scale via fleet patterns, and visibility via observability. None of these alone is sufficient; together they form a platform that auditors trust, executives can defend, and engineers actually want to use.

The next steps from here depend on your role:

The hardest lesson across this whole course is also the simplest: automation is a cultural and organisational artefact as much as a technical one. The playbooks are the easy part. The disciplines — change-management, evidence-bundling, restore-testing, SLO-budgeting — are what separate organisations whose automation actually works from those whose automation is a slide deck. This curriculum exists to give you the patterns. Whether they take root depends on the people, leadership, and engineering culture you build around them.

That’s what makes the journey worth it.

ansibleobservabilityprometheusgrafanalokitempoopentelemetryotelmimiraapmetricstracingloggingslocapstone
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments