This is one of the lessons that, if you implement it well, fundamentally changes how your organisation perceives “automation.” Up to this point in the series, your playbooks have been triggered by humans on a CLI, by Git pushes, or by AAP schedules. That is fine for sandbox and pre-prod. In production at a regulated enterprise — banks, insurers, healthcare, telecom, utilities, anything that ships to SOX, SOC 2, ISO 27001, HIPAA or PCI-DSS — there is a hard organisational rule that automation must obey:
No production change happens without an approved Change ticket.
And a softer but equally important rule:
No production change happens silently. Operators see it in the same channel where they see everything else — usually Slack or Teams.
These two rules turn ITSM and ChatOps from “nice integrations” into the control plane of your automation. ServiceNow (or BMC Helix, or Jira Service Management) becomes the authority on what is allowed to run, against what, when, and by whom. Slack/Teams becomes the human surface of the automation: the place where engineers approve, query state, and trigger safe operations without leaving the conversation.
This lesson is the deep-dive into that wiring. We will cover the four patterns that, together, define a mature ITSM + ChatOps integration:
- ServiceNow CMDB as a dynamic inventory — the CMDB becomes Ansible’s source of truth for hosts, applications, business services, and ownership.
- Change-ticket-as-prerequisite (CHG-gate) — AAP job templates refuse to run unless an approved CHG ticket exists, is in the right state, has the right CIs attached, and is inside its scheduled window.
- Event-Driven Ansible (EDA) rulebooks — incidents in ServiceNow trigger remediation playbooks; results are written back as work notes and the incident is auto-resolved when remediation succeeds.
- Slack/Teams ChatOps with real approvals — engineers can run safe ops directly from chat (
@kv-bot reboot prod-app-04), and the bot proxies through ServiceNow approval rules so the audit trail still leads back to the change record.
By the end of this lesson you should be able to look at a regulated enterprise’s compliance auditor, point at the chain of evidence from “Slack message → ServiceNow CHG → AAP job → host change → ServiceNow work note → resolved CHG,” and have them sign off without a follow-up question. That is the bar.
1. Why ITSM integration is non-negotiable in regulated enterprises
There is a recurring pattern in mid-sized engineering orgs: the platform team builds beautiful Ansible automation, demos it, and then production teams refuse to adopt it. The reason given is usually “we don’t trust automation in prod.” The real reason, almost every time, is:
Production teams are personally accountable to auditors. They cannot allow a change to land in production unless they can point at a CHG ticket that authorised it.
If your automation cannot produce that ticket-shaped audit artefact, it does not get adopted. Period.
The four classes of ITSM evidence auditors look for, in order of importance:
| Evidence | What auditors want to see | How automation must produce it |
|---|---|---|
| Authorisation | A CHG ticket in Scheduled or Implement state, approved by named approver(s), referencing the affected CIs |
AAP refuses to run unless a valid CHG number is supplied and validated |
| Execution window | Change occurred between start_date and end_date of the CHG |
AAP refuses to run outside the window |
| Affected CIs | The CIs the playbook actually touched match the CIs listed on the CHG | AAP enforces inventory ⊆ CHG.affected_cis |
| Closure | Work notes describing what was done; CHG transitioned to Review/Closed with success/failure evidence |
Playbook writes signed work notes and updates CHG state automatically |
A common mistake is to treat ITSM as a notification target — “we’ll just email ServiceNow when a job runs.” That gives auditors no enforcement, no link between change and execution, and no automatic closure. It will fail your first ITGC audit.
The pattern in this lesson treats ServiceNow as a gate, not a notification target. Without a valid, approved, in-window CHG, the playbook does not run. The job’s first task is servicenow.itsm.change_request_info, and the job fails-closed if the lookup returns anything other than an approved, scheduled, CI-matched CHG.
2. servicenow.itsm collection: the connector
Red Hat’s officially-supported collection is servicenow.itsm. It exposes modules for every ITSM table you actually need:
change_request/change_request_info— Create, update, and look up CHG ticketschange_request_task/change_request_task_info— CHG implementation tasksincident/incident_info— Incident records (INC)problem/problem_info— Problem records (PRB)configuration_item/configuration_item_info— CI records (cmdb_ci_*)attachment/attachment_info— File attachments (evidence bundles, logs)api— Generic Table API fallback for any custom table
Authentication supports both basic auth (username + password) and OAuth2 (preferred for production). For AAP, you create a custom credential type that maps to the collection’s environment variables:
# inputs schema for AAP custom credential type "ServiceNow OAuth"
fields:
- id: instance_host
type: string
label: ServiceNow instance hostname
- id: client_id
type: string
label: OAuth client ID
- id: client_secret
type: string
label: OAuth client secret
secret: true
- id: username
type: string
label: Service account username
- id: password
type: string
label: Service account password
secret: true
required:
- instance_host
- client_id
- client_secret
- username
- password
# injectors
env:
SN_HOST: '{{ instance_host }}'
SN_CLIENT_ID: '{{ client_id }}'
SN_CLIENT_SECRET: '{{ client_secret }}'
SN_USERNAME: '{{ username }}'
SN_PASSWORD: '{{ password }}'
Now any AAP job template with this credential injected gets ServiceNow access via the standard servicenow.itsm env var contract — no inline secrets, no vars_prompt.
The service account itself needs specific roles in ServiceNow:
itil— read/write incidents, problems, changes, taskschange_manager(optional, for state transitions like Approve → Implement)- A custom role with
readoncmdb_ci,cmdb_ci_server,cmdb_ci_database, etc., for the CMDB inventory plugin - Never
admin. Auditors will fail you for over-privileged service accounts.
3. CMDB as dynamic inventory
The first major pattern is making the CMDB authoritative for inventory. This sounds simple but has far-reaching implications: if it works, you stop maintaining hand-edited inventory files, and the relationship between “what we think we run” and “what Ansible runs against” becomes always-correct-by-construction.
The collection ships an inventory plugin: servicenow.itsm.now. A minimal config:
# inventory/servicenow.yml
---
plugin: servicenow.itsm.now
# pull all server CIs that are operational
table: cmdb_ci_server
sysparm_query: "operational_status=1^install_status=1"
# build groups from CI columns
groups:
linux: "os.lower() is search('linux|rhel|ubuntu|debian|centos|rocky|alma|sles')"
windows: "os.lower() is search('windows|win')"
prod: "support_group.display_value is search('Production')"
pci_scope: "u_pci_scope == 'true'"
# build group hierarchy from business_application
keyed_groups:
- key: u_business_application.display_value | lower | replace(' ', '_')
prefix: app
- key: u_environment.display_value | lower
prefix: env
- key: location.display_value | lower | replace(' ', '_')
prefix: site
# variables to attach to each host
compose:
ansible_host: ip_address
ansible_user: "'ansible-svc' if os.lower() is search('linux') else 'svc-ansible'"
cmdb_sys_id: sys_id
cmdb_owner: owned_by.display_value
cmdb_environment: u_environment.display_value
cmdb_business_app: u_business_application.display_value
cmdb_pci_scope: u_pci_scope
What you get for free:
app_payments_api,app_billing_core,env_prod,env_uat,site_dc1_frankfurtgroups built automatically from CMDB metadata- Every host has its CMDB
sys_id, owner, environment, and business app available as Ansible vars — meaning playbooks can do things likewhen: cmdb_environment == 'production' and cmdb_pci_scope - Adding a new server in CMDB instantly makes it available to Ansible — no inventory file edits, no PR, no merge
The crucial discipline: CMDB must be the source of truth for hostnames and IPs. If your CMDB is inaccurate, this pattern amplifies the inaccuracy into automation. Most organisations need a 6-12 month CMDB hygiene project before this pattern becomes safe. The “discovery → reconcile → remediate” pattern (running ServiceNow Discovery alongside gather_facts and reconciling differences) is its own multi-week project.
A pragmatic compromise: start with CMDB as inventory for non-production environments where the cost of inaccuracy is low, fix CMDB through the feedback loop, and graduate to prod only when you have a clean reconciliation report.
3.1 Caching to survive ServiceNow rate limits
ServiceNow’s REST API is not designed for bursty inventory queries. With more than ~5,000 CIs and frequent AAP job runs, you will hit rate limits or timeouts. Configure aggressive inventory caching:
# ansible.cfg or inventory.yml
[inventory]
cache = true
cache_plugin = jsonfile
cache_timeout = 1800
cache_connection = /var/cache/ansible/inventory
cache_prefix = snow_
For AAP, the inventory source has an “update on launch” toggle. Disable it for fast-running playbooks; use a scheduled inventory sync (every 15-30 minutes) instead. A stale-by-15-minutes inventory is acceptable; an inventory sync that takes 4 minutes before every job run is not.
4. The CHG-gate pattern
This is the most important pattern in this lesson. The rule is:
Every production job template’s first play, before
gather_facts, validates the CHG ticket. If the validation fails, the job fails. There is no override flag.
Job templates expose a survey field change_request_number (string, required, regex ^CHG\d{7,}$). The first play looks like this:
---
- name: Pre-flight CHG validation
hosts: localhost
gather_facts: false
connection: local
tasks:
- name: Look up the change request
servicenow.itsm.change_request_info:
number: "{{ change_request_number }}"
register: chg_lookup
no_log: false # the CHG metadata itself is not secret
- name: Fail-closed if CHG not found
ansible.builtin.fail:
msg: "CHG {{ change_request_number }} does not exist."
when: chg_lookup.records | length == 0
- name: Capture the CHG record
ansible.builtin.set_fact:
chg: "{{ chg_lookup.records[0] }}"
- name: Fail-closed if CHG state is not Scheduled or Implement
ansible.builtin.fail:
msg: >
CHG {{ chg.number }} is in state '{{ chg.state }}'.
Required state: 'scheduled' or 'implement'.
Current approver state: {{ chg.approval }}.
when: chg.state not in ['scheduled', 'implement']
- name: Fail-closed if CHG is not approved
ansible.builtin.fail:
msg: "CHG {{ chg.number }} is not approved (approval={{ chg.approval }})."
when: chg.approval != 'approved'
- name: Fail-closed if outside scheduled window
ansible.builtin.fail:
msg: >
CHG {{ chg.number }} window is
{{ chg.start_date }} → {{ chg.end_date }}.
Current time {{ ansible_date_time.iso8601 }} is outside the window.
when: >
ansible_date_time.iso8601 < chg.start_date
or ansible_date_time.iso8601 > chg.end_date
- name: Look up the CIs attached to the CHG
servicenow.itsm.api:
resource: cmdb_ci
action: get
query_params:
sysparm_query: "sys_id={{ chg.cmdb_ci }}"
register: chg_cis
when: chg.cmdb_ci | length > 0
- name: Fail-closed if any inventory host is not in CHG.affected_cis
ansible.builtin.fail:
msg: >
Host {{ item }} is not listed in CHG {{ chg.number }} affected CIs.
CHG covers: {{ chg_ci_names | join(', ') }}.
when: hostvars[item].cmdb_sys_id not in chg_ci_sys_ids
loop: "{{ groups['target_hosts'] }}"
- name: Transition CHG to Implement state
servicenow.itsm.change_request:
number: "{{ chg.number }}"
state: implement
work_notes: >
AAP job {{ tower_job_id }} (template '{{ tower_job_template_name }}')
starting at {{ ansible_date_time.iso8601 }}.
Triggered by {{ tower_user_name }}.
when: chg.state == 'scheduled'
What this gives you:
- Six independent fail-closed gates: existence, state, approval, time window, CI membership, transition
- Automatic state transition: the CHG moves to “Implement” when the job starts; auditors see precisely when the change began
- Tower job ID linkage: the work note ties the AAP job to the CHG, so auditors can cross-reference both directions
- No override flag: this is intentional. Operators who want to bypass the gate must create a CHG. There is no
--skip-chg-checkflag.
The post-play, run after the main playbook completes, closes the loop:
- name: Post-flight CHG closure
hosts: localhost
gather_facts: false
connection: local
vars:
job_succeeded: "{{ ansible_failed_task is not defined }}"
tasks:
- name: Render evidence bundle path
ansible.builtin.set_fact:
evidence_path: "/var/lib/awx/evidence/{{ tower_job_id }}.tar.gz"
- name: Attach evidence bundle to CHG
servicenow.itsm.attachment:
table_name: change_request
table_sys_id: "{{ chg.sys_id }}"
path: "{{ evidence_path }}"
when: evidence_path is file
- name: Write closure work note
servicenow.itsm.change_request:
number: "{{ chg.number }}"
work_notes: |
AAP job {{ tower_job_id }} completed at {{ ansible_date_time.iso8601 }}.
Status: {{ 'SUCCESS' if job_succeeded else 'FAILED' }}.
Hosts changed: {{ groups['target_hosts'] | length }}.
Evidence bundle: attached.
close_code: "{{ 'successful' if job_succeeded else 'unsuccessful' }}"
close_notes: "Automated closure by AAP job {{ tower_job_id }}."
state: "{{ 'review' if job_succeeded else 'implement' }}"
A failed job stays in Implement state — it does not auto-close as failed. That is deliberate: a failure means a human has to investigate and decide what comes next. Automatic closure of failed changes hides incidents.
4.1 Standard changes get a streamlined path
Not every change needs a 5-day CAB approval cycle. ServiceNow has a concept of “standard changes” — pre-approved templates for low-risk, repeatable operations (e.g., “rotate TLS certificate,” “patch low-risk Linux kernel CVE”). The collection supports creating CHGs from a standard change template:
- name: Create standard CHG for cert rotation
servicenow.itsm.change_request:
type: standard
template: "Standard - TLS Certificate Rotation"
short_description: "Rotate TLS cert for {{ inventory_hostname }}"
cmdb_ci: "{{ cmdb_sys_id }}"
assignment_group: "Platform Engineering"
state: scheduled
start_date: "{{ ansible_date_time.iso8601 }}"
end_date: "{{ (ansible_date_time.iso8601 | as_datetime + 30*60) | iso8601 }}"
register: created_chg
You give engineers a self-service “rotate cert” button in Slack; the bot creates the standard CHG, immediately gets approval, and runs the job. The audit trail still exists, the CAB does not have to meet, and the cycle time drops from days to seconds. This is how mature orgs scale automation without breaking governance.
5. Event-Driven Ansible: incident → remediation → closure loop
The pattern so far is human-initiated, ITSM-gated. The complement is event-initiated, ITSM-recorded: an incident appears in ServiceNow (from monitoring, from a user ticket, from anywhere), EDA detects it, runs a remediation playbook, and writes the result back as a work note.
Event-Driven Ansible uses rulebooks — declarative YAML mapping sources (event producers) to conditions (rules) to actions (run a playbook, post to webhook, etc.).
The servicenow.itsm collection ships an EDA source plugin that subscribes to ServiceNow’s Table API change feed. A minimal rulebook:
# rulebooks/servicenow-incidents.yml
---
- name: ServiceNow incident remediation
hosts: all
sources:
- servicenow.itsm.records:
instance:
host: "{{ SN_HOST }}"
username: "{{ SN_USERNAME }}"
password: "{{ SN_PASSWORD }}"
table: incident
query: "active=true^state=1^assignment_group.nameLIKEPlatform"
interval: 30
rules:
- name: Disk full → run cleanup
condition: |
event.short_description is search("disk.*full|filesystem.*full", ignorecase=true)
and event.priority in [1, 2, 3]
action:
run_job_template:
name: "INC: Disk cleanup"
organization: Default
job_args:
extra_vars:
incident_number: "{{ event.number }}"
target_host: "{{ event.cmdb_ci.display_value }}"
- name: Service down → restart and verify
condition: |
event.short_description is search("service.*down|process.*not running", ignorecase=true)
and event.priority in [1, 2]
action:
run_job_template:
name: "INC: Service restart"
organization: Default
job_args:
extra_vars:
incident_number: "{{ event.number }}"
target_host: "{{ event.cmdb_ci.display_value }}"
service_name: "{{ event.short_description | regex_search('service\\s+(\\S+)', '\\1') | first }}"
- name: Unknown high-priority incident → page on-call
condition: |
event.priority in [1, 2]
and event.assignment_group.display_value == "Platform"
action:
post_event:
event:
type: pagerduty_trigger
incident_number: "{{ event.number }}"
severity: "{{ event.priority }}"
description: "{{ event.short_description }}"
Activating this rulebook in EDA means: every 30 seconds, EDA polls ServiceNow for new high-priority incidents assigned to Platform; matching incidents trigger the right remediation job; unmatched ones page on-call.
The remediation playbook itself follows a strict contract:
---
- name: Remediate disk full incident
hosts: "{{ target_host }}"
gather_facts: true
vars:
incident_number: "{{ incident_number }}"
tasks:
- name: Acknowledge incident
servicenow.itsm.incident:
number: "{{ incident_number }}"
state: in_progress
work_notes: >
AAP {{ tower_job_id }} starting auto-remediation at {{ ansible_date_time.iso8601 }}.
delegate_to: localhost
run_once: true
- name: Find candidate paths to clean
ansible.builtin.find:
paths:
- /var/log
- /tmp
- /var/cache
age: 7d
size: 100m
register: cleanup_candidates
- name: Compress old logs
ansible.builtin.archive:
path: "{{ item.path }}"
dest: "{{ item.path }}.gz"
format: gz
remove: true
loop: "{{ cleanup_candidates.files | selectattr('path', 'match', '.*\\.log$') | list }}"
register: compressed
- name: Re-check disk usage
ansible.builtin.command: df -BG /
register: df_after
changed_when: false
- name: Resolve incident
servicenow.itsm.incident:
number: "{{ incident_number }}"
state: resolved
close_code: "Solved (Permanently)"
close_notes: |
Auto-remediated by AAP job {{ tower_job_id }}.
Compressed {{ compressed.results | length }} log files.
Disk usage after cleanup:
{{ df_after.stdout }}
delegate_to: localhost
run_once: true
when: df_after.stdout is search("[0-7][0-9]%")
- name: Escalate if still full
servicenow.itsm.incident:
number: "{{ incident_number }}"
state: in_progress
urgency: 1
work_notes: |
Auto-remediation insufficient. Disk still {{ (df_after.stdout | regex_search('(\\d+)%', '\\1')).0 }}% full.
Escalating to on-call.
delegate_to: localhost
run_once: true
when: df_after.stdout is not search("[0-7][0-9]%")
Key discipline points:
- Acknowledge first, work second: marks the incident as “we’re on it” so a human doesn’t simultaneously start working it
- Verify before claiming success: the playbook only resolves the incident if
dfshows usage dropped below 80%. Failed remediation does not auto-resolve. - Escalate on incomplete remediation: incidents that the bot couldn’t fully fix get re-prioritised and routed to humans, not silently abandoned
This pattern collapses MTTR for known-shape incidents from 20-40 minutes (page → ack → triage → fix → resolve) to 30-90 seconds. For an organisation with ~50 such incidents per week, that’s a real and measurable reduction in toil.
5.1 Closing the loop with problem records
Repeated incidents on the same CI within a window indicate a problem (in ITIL terms), not just incidents. A nice elaboration:
- name: Problem detection — count incidents on this CI in last 30 days
servicenow.itsm.api:
resource: incident
action: get
query_params:
sysparm_query: >
cmdb_ci={{ cmdb_sys_id }}^
opened_at>=javascript:gs.daysAgoStart(30)^
short_descriptionLIKEdisk full
register: same_incidents
delegate_to: localhost
- name: Open problem record if >3 incidents on same CI
servicenow.itsm.problem:
short_description: "Recurring disk full on {{ inventory_hostname }}"
description: |
{{ same_incidents.records | length }} disk-full incidents on this host in last 30 days.
Auto-remediation working but treating symptom only.
Likely cause: insufficient log rotation policy or runaway logging.
cmdb_ci: "{{ cmdb_sys_id }}"
impact: 2
urgency: 2
when: same_incidents.records | length > 3
delegate_to: localhost
run_once: true
Now the bot is not just fixing symptoms but flagging chronic root causes. Auditors love this. The “we automated remediation but never investigated the underlying problem” is one of the classic anti-patterns auditors look for, and this addresses it directly.
6. ChatOps: Slack & Teams as the human surface
The fourth pillar is making automation visible and approachable in chat. In a mature setup, an engineer types @kv-bot reboot prod-app-04 in #platform-ops and the bot:
- Recognises this is a production action
- Looks up
prod-app-04in the CMDB to find the responsible team - Creates a standard CHG ticket
- Posts an interactive message in Slack/Teams: “🚨 Production reboot requested by @vinod for prod-app-04. Approve?”
- Routes the approval prompt to the on-call from the responsible team
- Once approved (in chat), runs the AAP job
- Streams progress back to the original thread
- Closes the CHG with the result
The Slack bot is itself an Ansible-driven service. The path:
Slack slash command / mention
→ Slack Events API webhook
→ AAP webhook receiver (or EDA webhook source)
→ AAP job template "ChatOps router"
→ Creates CHG, posts approval message, waits for response
→ On approve: runs target job template
→ On deny: posts denial reason
EDA’s ansible.eda.webhook source plugin is the entry point:
# rulebooks/chatops.yml
---
- name: ChatOps router
hosts: all
sources:
- ansible.eda.webhook:
host: 0.0.0.0
port: 5000
token: "{{ CHATOPS_WEBHOOK_TOKEN }}"
rules:
- name: Reboot command
condition: |
event.payload.command == "reboot"
and event.payload.target is defined
and event.payload.user_id is defined
action:
run_job_template:
name: "ChatOps: Reboot"
job_args:
extra_vars:
chat_user: "{{ event.payload.user_id }}"
chat_channel: "{{ event.payload.channel_id }}"
chat_thread: "{{ event.payload.thread_ts }}"
target_host: "{{ event.payload.target }}"
- name: Status command (read-only, no CHG)
condition: event.payload.command == "status"
action:
run_job_template:
name: "ChatOps: Status read-only"
job_args:
extra_vars:
chat_channel: "{{ event.payload.channel_id }}"
chat_thread: "{{ event.payload.thread_ts }}"
target_host: "{{ event.payload.target }}"
The “ChatOps: Reboot” job template runs a playbook that:
---
- name: ChatOps reboot orchestrator
hosts: localhost
gather_facts: false
tasks:
- name: Verify target exists in CMDB
servicenow.itsm.api:
resource: cmdb_ci_server
action: get
query_params:
sysparm_query: "name={{ target_host }}"
register: ci_lookup
- name: Fail if target unknown
ansible.builtin.fail:
msg: "Host '{{ target_host }}' not found in CMDB."
when: ci_lookup.records | length == 0
- name: Capture CI metadata
ansible.builtin.set_fact:
ci: "{{ ci_lookup.records[0] }}"
- name: Check user is in approver list for this CI's environment
ansible.builtin.uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
channel: "{{ chat_channel }}"
thread_ts: "{{ chat_thread }}"
text: >
❌ <@{{ chat_user }}> is not authorised to reboot
{{ target_host }} ({{ ci.u_environment.display_value }}).
Please ask {{ ci.support_group.display_value }} to file a CHG.
when: ci.u_environment.display_value == 'production'
and chat_user not in approved_chatops_users
- name: Create standard CHG
servicenow.itsm.change_request:
type: standard
template: "Standard - Server Reboot"
short_description: "ChatOps reboot {{ target_host }}"
cmdb_ci: "{{ ci.sys_id }}"
requested_by: "{{ chat_user_email }}"
assignment_group: "{{ ci.support_group.display_value }}"
state: scheduled
start_date: "{{ ansible_date_time.iso8601 }}"
end_date: "{{ (ansible_date_time.iso8601 | as_datetime + 15*60) | iso8601 }}"
register: chg
- name: Post Slack message with approve/deny buttons
community.general.slack:
token: "{{ slack_bot_token }}"
channel: "{{ chat_channel }}"
thread_id: "{{ chat_thread }}"
attachments:
- text: >
<@{{ chat_user }}> requested reboot of *{{ target_host }}*.
CHG {{ chg.record.number }} created. Approve?
color: warning
actions:
- type: button
text: ✅ Approve
url: "https://aap.example.com/api/v2/job_templates/42/launch/?chg={{ chg.record.number }}&approve=true"
style: primary
- type: button
text: ❌ Deny
url: "https://aap.example.com/api/v2/job_templates/42/launch/?chg={{ chg.record.number }}&approve=false"
style: danger
when: ci.u_environment.display_value == 'production'
- name: Auto-approve and reboot for non-prod
ansible.builtin.uri:
url: "https://aap.example.com/api/v2/job_templates/43/launch/"
method: POST
body_format: json
body:
extra_vars:
change_request_number: "{{ chg.record.number }}"
target_host: "{{ target_host }}"
chat_thread: "{{ chat_thread }}"
chat_channel: "{{ chat_channel }}"
headers:
Authorization: "Bearer {{ aap_oauth_token }}"
when: ci.u_environment.display_value != 'production'
What this gives operators:
- Self-service for non-prod: instant reboot, audit trail still recorded as CHG
- Approval-gated for prod: bot creates CHG, posts buttons, waits — no overrides
- Routing by ownership: the approval prompt goes to the right team automatically (read from CMDB)
- All in chat: engineer never leaves Slack, but every action lands in ServiceNow
The Teams equivalent uses adaptive cards with Action.Submit buttons that POST to the AAP webhook receiver. The pattern is identical; only the rendering primitive changes.
6.1 Read-only commands deserve their own pattern
Commands like @kv-bot status prod-app-04 should never create a CHG, never require approval, and should run as fast as possible. These are queries, not changes. The “ChatOps: Status read-only” job template uses a credential with read-only access to hosts and posts:
prod-app-04 (Linux RHEL 9.4, prod, payments_api)
Uptime: 47 days
Load: 0.32 / 0.41 / 0.38
Memory: 14.2 GB / 32 GB used
Disk /: 67%
Last patched: 2026-05-14
CHG history (30d): 4 changes, last CHG0098765 (2026-06-19)
This single message replaces five separate ServiceNow tab clicks. Engineers will thank you.
7. Failure modes and how to handle them
A few failure modes that will happen in production. Plan for them now, not at 4am.
| Failure | Symptom | Mitigation |
|---|---|---|
| ServiceNow API down | All jobs fail at CHG validation | Fail-closed is correct here. Have a documented break-glass: a separate AAP credential that bypasses CHG for a 4-hour incident window, requires SecOps approval, and writes an INC retroactively |
| ServiceNow API rate-limited | Random job failures with HTTP 429 | Configure retry-with-backoff on all servicenow.itsm.* tasks: until: result is succeeded; retries: 5; delay: 30 |
| CMDB inventory drift | Hosts missing from inventory | Schedule daily “CMDB hygiene” reports comparing AAP inventory against actual host responses; alert when drift > 5% |
| EDA rulebook crashes | Incidents pile up unhandled | Run two EDA replicas behind a load balancer; alert if rulebook activation status != “running” for > 5 min |
| Slack bot deleted from channel | ChatOps approvals silently lost | Bot must respond to its own @channel reload command and post weekly “I’m alive” health checks |
| Standard change template misconfigured | Bot creates CHGs that auto-fail | Lock standard change templates behind code review in ServiceNow Update Sets, and validate them in a UAT instance before promotion |
| ServiceNow OAuth token expired | All jobs fail with 401 | AAP credential injector should fetch fresh tokens via the client_credentials grant; rotate every 24h |
| Approver out of office | Production CHGs sit blocked | ServiceNow CAB rules should fall back to a backup approver group; document this in the runbook |
| Bot posts message but AAP webhook is down | Approval click → silent failure | Webhook receivers must respond within 3s with an ACK; the actual job runs async, with a Slack thread update on completion |
| Engineer types wrong CHG number | Job correctly fails — but engineer doesn’t know why | Slack bot’s failure messages must include a clickable ServiceNow link to the CHG state |
Two non-obvious lessons from running this in production:
Lesson 1 — the “approval fatigue” trap. If you make every change require Slack approval, on-calls start clicking ✅ without reading. The fix: tier your operations. Read-only → no approval. Standard non-prod changes → no approval, just notification. Standard prod changes → approval but with a 2-line summary in the message. Non-standard prod changes → approval + link to the change record + 30-second cooling-off period before the button works. This last one prevents accidental clicks.
Lesson 2 — never let the bot become the bottleneck. Your Slack bot will go down at the worst possible time. There must always be a manual escape hatch: an AAP UI URL that any authorised engineer can open and run the job from. Teams that build “Slack-only” automation get held hostage by their bot. Make Slack a convenience layer, not a single point of failure.
8. Evidence trail and audit-readiness
The end-to-end evidence chain for any production change should look like this when an auditor asks:
Slack message #platform-ops 2026-06-22T14:03:11Z
→ @vinod typed "@kv-bot patch prod-db-01"
→ AAP webhook received (request_id: r-7a82b4)
→ ChatOps router job 84291 (template "ChatOps: Patch")
→ CMDB lookup confirmed prod-db-01 (sys_id: abc123)
→ CHG0102847 created (standard change, template "Patch Linux")
→ Slack approve/deny posted in thread, message_ts: 1718978591.0034
→ Approver @lina clicked Approve at 14:04:22Z
→ AAP job 84292 launched (template "Patch Linux Standard")
→ Pre-flight CHG validation: PASSED
→ Inventory: prod-db-01 (single-host)
→ Tasks executed: 47, changed: 12
→ Post-flight: evidence bundle uploaded to s3://kv-evidence/2026/06/22/job-84292.tar.gz
→ CHG0102847 transitioned to Review
→ CHG0102847 closed-successful at 14:11:44Z
→ Slack thread updated: "✅ Done in 7m 22s"
Every step has a timestamp, an actor, and a system of record. That is the chain auditors want to see, and once the wiring is in place it is produced automatically for every change. Quarterly audit prep collapses from a week of evidence-gathering to a 30-minute query.
A useful nightly compliance report:
- name: Nightly compliance report — CHG-to-job linkage
hosts: localhost
gather_facts: false
tasks:
- name: Get all AAP jobs from last 24h
ansible.builtin.uri:
url: "https://aap.example.com/api/v2/jobs/?finished__gte={{ (ansible_date_time.iso8601 | as_datetime - 24*3600) | iso8601 }}"
headers:
Authorization: "Bearer {{ aap_oauth_token }}"
register: aap_jobs
- name: Find jobs that ran without a CHG number
ansible.builtin.set_fact:
non_compliant_jobs: >-
{{ aap_jobs.json.results
| rejectattr('extra_vars', 'search', 'change_request_number')
| rejectattr('job_template.name', 'in', read_only_templates)
| list }}
- name: Open INC for non-compliant runs
servicenow.itsm.incident:
short_description: "Non-compliant AAP job: {{ item.name }}"
description: "Job {{ item.id }} ran without a CHG reference. Investigate."
impact: 2
urgency: 2
category: governance
loop: "{{ non_compliant_jobs }}"
This loop catches automation that escaped the gate. In a healthy environment, the report runs nightly and finds zero offenders for months at a time. The day it finds one, you get a real signal.
9. The minimum viable maturity ladder
If you’re starting from scratch, this is the order I recommend:
- Week 1-2: Stand up the
servicenow.itsmcollection in AAP, configure the OAuth credential, run a manualchange_request_infoagainst an existing CHG. Prove connectivity. - Week 3-4: Build the CHG-gate pre-flight playbook. Apply it to one low-risk job template. Run a real change through it.
- Month 2: Add the post-flight closure block. Now your one job template is fully gated and self-closing.
- Month 3: Roll the gate out to all production-touching job templates. Resist exceptions. Track the percentage of prod jobs that go through the gate; it should be 100% within a quarter.
- Month 4: Stand up CMDB-as-inventory for non-production. Prove it works. Fix CMDB hygiene problems as they surface.
- Month 5-6: Graduate CMDB-as-inventory to prod once hygiene metrics are clean.
- Month 7-8: Build the first EDA remediation rulebook. Pick the simplest, highest-volume incident shape (disk full is the canonical choice). Measure MTTR before and after.
- Month 9-10: Roll out ChatOps for read-only commands. No approvals, no CHGs needed — instant value, near-zero risk.
- Month 11-12: Roll out ChatOps for standard changes. Now you have full bidirectional integration.
Trying to do all of this in one quarter is a known failure mode. The teams that succeed do it incrementally, with each step proving value before the next is started.
10. Where this fits in the broader Tier 5 picture
The compliance lesson (D1) gave you the what — STIG, CIS, OpenSCAP, signed evidence. The DR lesson (D2) gave you the when-it-all-goes-wrong response. This lesson gives you the day-to-day governance fabric — the wiring that ensures every routine change is authorised, observed, and recorded.
The remaining specialist lessons fill in the rest of the operational picture: backup automation (D8), database migrations (D9), and the observability capstone (D10) that ties metrics, logs, traces, and AAP events into a single Grafana view of “is automation healthy?”
When ITSM, ChatOps, compliance, DR, and observability are all in place, you have built what regulators call a demonstrably-controlled automation environment — one where every change is authorised, observed, recorded, reversible, and reviewable. That is the destination of this whole course. ITSM integration is the connective tissue that makes the other pieces auditable, and ChatOps is the human-shaped surface that keeps engineers actually using the system rather than working around it.