Engineering Incident Response: Runbooks, Tabletop Exercises, and Cloud Forensics

Most incident response programs are a PDF that nobody opens during the incident. The first time anyone reads the plan is at 02:00 with the CISO on the bridge, and by then the runbook’s gaps are billing themselves in lost evidence and missed RTO. Incident response is an engineering capability, not a document: it has interfaces (severity, roles, comms), idempotent procedures (runbooks), state preservation (forensics), and a test harness (tabletops). This guide builds it that way.

I assume you have detection in place — Defender XDR, Microsoft Sentinel, sign-in and audit logging flowing somewhere queryable. This is about what happens after the alert: how you declare, contain without destroying evidence, acquire forensic artifacts in cloud and identity systems, and prove the whole thing works before a real adversary tests it for you.

1. The IR lifecycle as a state machine

The NIST SP 800-61 lifecycle — preparation, detection and analysis, containment/eradication/recovery, post-incident — is not a waterfall. It is a loop with back-edges, and treating it as a state machine is what keeps a response disciplined under pressure.

Phase	Goal	Exit criterion
Preparation	Tooling, access, runbooks, retainers exist	Tabletop passed in the last quarter
Detection & analysis	Confirm true positive, set severity	Incident declared with scope and owner
Containment	Stop spread; preserve evidence	Blast radius bounded; artifacts acquired
Eradication	Remove adversary access and persistence	No live attacker access; root cause known
Recovery	Restore service; monitor for re-entry	Service healthy; heightened monitoring on
Post-incident	Learn; close readiness gaps	Action items tracked to closure

The single most expensive mistake is collapsing containment and eradication. If you rebuild the host before you snapshot it, you have destroyed the evidence that tells you how they got in — and you will see them again next week through the same unpatched door.

The back-edges matter: eradication routinely sends you back to analysis when a new persistence mechanism surfaces, and recovery sends you back to containment when monitoring shows re-entry. Build the runbooks to support that, not a linear checklist.

2. Severity, roles, and the escalation matrix

Before any scenario runbook, you need the chassis every runbook plugs into: a severity definition, named roles, and an escalation/communications matrix. Ambiguity here is what turns a Sev2 into a Sev1 at 04:00 because nobody knew they were allowed to wake the legal lead.

Define severity on impact and scope, not on the tool that fired:

Sev	Definition	Response	Comms cadence
Sev1	Confirmed breach with data exfil, ransomware, or domain/tenant compromise	Immediate, 24x7 bridge	30 min
Sev2	Confirmed compromise, contained blast radius	Business hours + on-call	2 h
Sev3	Suspected compromise, single account/host	On-call triage	Daily

Roles are functions, not titles — one person can hold several in a small org, but every function must be owned:

Incident Commander (IC): owns the response, makes the call, holds the timeline. Not the deepest technical expert — the decision-maker.
Investigation lead: drives analysis, hunting, and forensics.
Operations/containment lead: executes isolation, credential resets, blocks.
Comms lead: internal updates, exec briefing, and — if it escalates — regulatory and customer notification drafting.
Scribe: maintains the contemporaneous log. Non-negotiable; this becomes the legal record.

The escalation matrix is data, not prose. Keep it as code so it is versioned and queryable:

# ir-escalation-matrix.yaml — versioned in the IR repo
severity_thresholds:
  sev1_triggers:
    - "ransomware encryption observed"
    - "confirmed data exfiltration"
    - "Global Administrator or domain compromise"
roles:
  incident_commander:
    primary:   { name: "A. Rao",   phone: "+1-555-0101", pager: "ic-primary" }
    secondary: { name: "M. Singh", phone: "+1-555-0102", pager: "ic-backup" }
  legal:
    primary:   { name: "Counsel",  phone: "+1-555-0150", notify_at_sev: 1 }
notification:
  bridge_url: "https://contoso.zoom.us/j/ir-bridge"
  ticket_prefix: "SEC-IR"
  out_of_band_channel: "Signal group 'IR-Core'"   # assume primary comms are compromised

The out-of-band channel is deliberate. In a BEC or ransomware incident you must assume email and Teams/Slack are compromised or monitored by the adversary. Decide your fallback comms now, not while the attacker reads your response plan in real time.

3. Scenario runbooks: BEC, ransomware, credential compromise

A runbook is a procedure scoped to one scenario, written so a competent on-call engineer who is not a domain expert can execute it. Three scenarios cover the majority of real cloud incidents. Each follows the same skeleton: triage -> contain -> preserve -> eradicate -> recover, with explicit commands.

Business Email Compromise (BEC)

The hallmark is a mailbox with a malicious inbox rule (auto-forward, or move-to-RSS-to-hide-replies) and anomalous OAuth grants. Contain by revoking sessions and killing the rule before you reset, so the attacker cannot re-authenticate while you work.

# Connect with an admin account from a known-clean host
Connect-ExchangeOnline -UserPrincipalName ir-admin@contoso.com

# 1. Enumerate inbox rules on the affected mailbox (preserve output)
Get-InboxRule -Mailbox victim@contoso.com |
  Select-Object Name,Enabled,ForwardTo,RedirectTo,DeleteMessage,MoveToFolder |
  Export-Csv ./evidence/bec-inboxrules-victim.csv -NoTypeInformation

# 2. Disable (do NOT delete yet — keep as evidence) the malicious rule
Disable-InboxRule -Mailbox victim@contoso.com -Identity "ProcessReceipts"

# 3. Remove any attacker-set forwarding at the mailbox level
Set-Mailbox victim@contoso.com -ForwardingSmtpAddress $null -ForwardingAddress $null

Then revoke sessions and tokens in Entra ID so stolen refresh tokens are useless:

# Invalidate all refresh tokens / sign-in sessions for the user
Revoke-MgUserSignInSession -UserId victim@contoso.com

# Review consented OAuth apps; attacker-registered apps are a persistence vector
Get-MgUserOauth2PermissionGrant -UserId victim@contoso.com |
  Select-Object ClientId,ConsentType,Scope

Revoking the session without resetting the password leaves the credential valid; resetting the password without revoking sessions leaves existing refresh tokens alive for up to their lifetime. Always do both, in that order: revoke, then force a password reset with MFA re-registration.

Ransomware

Speed of containment beats elegance. The objective is to halt encryption spread laterally while preserving at least one infected host for forensics. Isolate at the network layer — do not power off, which destroys volatile memory.

# Defender for Endpoint: isolate the host but keep it forensically intact.
# "Selective" isolation keeps the MDE sensor connected for investigation.
# Prefer the portal/Graph action; CLI shown for automation.
$body = @{ Comment = "IR-SEC-1042 ransomware containment"; IsolationType = "Selective" }
Invoke-MgGraphRequest -Method POST `
  -Uri "https://graph.microsoft.com/v1.0/security/machineActions" `
  -Body ($body + @{ machineId = $deviceId }) -ContentType "application/json"

For ransomware the runbook’s most important line is not a command — it is the decision tree: do not pay before legal and exec sign-off, do not restore from backups until you have validated the backups are clean and the persistence is removed, and confirm your immutable/offline backups exist before you touch the live environment. If recovery races eradication, you re-encrypt restored data.

Credential / identity compromise

A single compromised cloud identity can pivot to the whole tenant. Contain by disabling sign-in, revoking tokens, and rotating any secrets that identity could read.

# Entra: block sign-in immediately (Microsoft Graph / az CLI)
az ad user update --id attacker-victim@contoso.com --account-enabled false

# Revoke refresh tokens tenant-side for that principal
az rest --method POST \
  --uri "https://graph.microsoft.com/v1.0/users/{id}/revokeSignInSessions"

If the compromised principal is a service principal or had access to Key Vault, rotate every secret it could read — assume exfiltration. Disabling the identity does not invalidate secrets it already copied.

4. Containment that preserves evidence

The tension in every containment decision is stop the bleeding versus keep the evidence. Resolve it with three rules baked into every runbook.

Isolate, don’t destroy. Network isolation, sign-in block, and session revocation stop spread without erasing state. Powering off a VM discards memory; deleting a mailbox rule discards proof of the technique.
Snapshot before you change. For any compute you will remediate, take a disk snapshot and capture memory first (see section 5). For identity, export the audit/sign-in records for the principal before you reset.
Disable, then delete — later. Malicious artifacts (inbox rules, OAuth grants, rogue accounts) get disabled immediately and removed only after evidence is acquired and the IC approves.

In cloud and identity systems, evidence has a clock on it. Sign-in logs, Unified Audit Log, and many control-plane logs have finite retention. The first containment action for an identity incident should preserve logs, not mutate state:

// Preserve the full sign-in trail for the principal BEFORE remediation.
// Run in Sentinel / Log Analytics and export the result set as evidence.
SigninLogs
| where TimeGenerated > ago(30d)
| where UserPrincipalName == "victim@contoso.com"
| project TimeGenerated, IPAddress, Location, AppDisplayName,
          ResultType, ResultDescription, RiskLevelDuringSignIn,
          AuthenticationRequirement, DeviceDetail, UserAgent
| sort by TimeGenerated asc

Tag isolated resources so nobody in operations “helpfully” reboots or rebuilds your evidence:

az resource tag --ids "$VM_ID" \
  --tags "ir-status=quarantine-do-not-touch" "ir-ticket=SEC-IR-1042" --is-incremental

5. Forensic acquisition in the cloud

Cloud forensics changes the acquisition model: you rarely image a physical disk, but you do capture managed-disk snapshots, memory, and — critically — the control-plane logs that have no on-prem equivalent. Acquire in volatility order: memory first (most perishable), then disk, then logs.

Memory. Capture live memory from the running, isolated host before any disk operation. On Azure VMs you can run an acquisition tool through Run Command without an interactive logon; AVML is the standard Linux acquirer.

# Acquire Linux memory on the isolated VM via Run Command (no SSH needed)
az vm run-command invoke \
  --resource-group rg-prod --name vm-web-03 \
  --command-id RunShellScript \
  --scripts "cd /tmp && ./avml mem.lime && ls -l mem.lime"

Disk. Snapshot the OS and data disks. A snapshot is your forensically sound, point-in-time copy; treat the snapshot — not the live disk — as the evidence object, copy it into an isolated, access-restricted forensics resource group, and compute a hash for chain of custody.

# Identify the OS disk and create an immutable snapshot
DISK_ID=$(az vm show -g rg-prod -n vm-web-03 --query "storageProfile.osDisk.managedDisk.id" -o tsv)

az snapshot create \
  --resource-group rg-forensics \
  --name "vm-web-03-os-$(date -u +%Y%m%dT%H%M%SZ)" \
  --source "$DISK_ID" \
  --incremental false \
  --tags "ir-ticket=SEC-IR-1042" "chain-of-custody=acquired"

Chain of custody is what makes evidence admissible and your conclusions defensible. Record who acquired what, when (UTC), from where, and the hash of the artifact. A snapshot with no recorded acquirer, timestamp, and hash is data, not evidence.

Logs. The control-plane trail is uniquely cloud and uniquely perishable. Preserve the Unified Audit Log (Microsoft 365), Azure Activity Log, and resource diagnostic logs for the incident window before retention or an attacker’s cleanup erases them.

# Microsoft 365 Unified Audit Log for the incident window
Connect-ExchangeOnline -UserPrincipalName ir-admin@contoso.com
Search-UnifiedAuditLog -StartDate "2026-06-01" -EndDate "2026-06-08" `
  -UserIds "victim@contoso.com" -ResultSize 5000 |
  Export-Csv ./evidence/m365-ual-victim.csv -NoTypeInformation

# Azure Activity Log (control-plane) for the affected subscription/window
az monitor activity-log list \
  --start-time 2026-06-01T00:00:00Z --end-time 2026-06-08T00:00:00Z \
  --query "[].{time:eventTimestamp,caller:caller,op:operationName.value,status:status.value,ip:claims.ipaddr}" \
  -o json > ./evidence/azure-activity-log.json

Hash everything you collect, immediately, and store hashes separately from the artifacts:

sha256sum ./evidence/*.csv ./evidence/*.json mem.lime > ./evidence/SHA256SUMS.txt

6. Investigating across the estate

With artifacts preserved, the investigation reconstructs the attacker timeline. KQL across Sentinel/Defender XDR is the workhorse because it joins identity, endpoint, and cloud signals in one place.

Start from the confirmed indicator and pivot outward. From a compromised account, find what else that IP touched — the classic lateral-movement pivot:

// From a known-bad IP, find every account and app it authenticated to
let badIp = "203.0.113.66";
SigninLogs
| where TimeGenerated > ago(14d)
| where IPAddress == badIp
| summarize attempts = count(),
            apps = make_set(AppDisplayName, 50),
            results = make_set(ResultType, 20)
        by UserPrincipalName
| sort by attempts desc

Look for the post-compromise persistence techniques attackers reach for in Entra: new credentials on apps, new federated trusts, role assignments:

// Persistence: credentials/federation added to apps & service principals
AuditLogs
| where TimeGenerated > ago(14d)
| where OperationName in ("Add service principal credentials",
                          "Add federated identity credential",
                          "Add member to role",
                          "Update application - Certificates and secrets management")
| extend Actor = tostring(InitiatedBy.user.userPrincipalName)
| extend Target = tostring(TargetResources[0].displayName)
| project TimeGenerated, Actor, OperationName, Target, Result
| sort by TimeGenerated asc

On the endpoint side, Defender XDR advanced hunting reconstructs process lineage from the host you isolated:

// Process tree around the suspected initial-access process on the host
DeviceProcessEvents
| where Timestamp > ago(7d)
| where DeviceName == "vm-web-03"
| project Timestamp, AccountName, FileName, ProcessCommandLine,
          InitiatingProcessFileName, InitiatingProcessCommandLine
| sort by Timestamp asc

The discipline is to build a single chronological timeline across all three sources, in the scribe’s log, with timestamps in UTC. That timeline is what you brief executives from, what scopes the breach for legal, and what drives the post-incident review.

7. Running effective tabletop exercises

A runbook you have never executed is a hypothesis. Tabletops are how you test the hypothesis before production does. A good tabletop is a facilitated, scenario-driven walkthrough where participants make real decisions against injects, and the facilitator probes the seams.

Structure a 90-minute exercise:

Scenario brief (5 min): set the scene. “It is Friday 16:00. Finance reports an invoice paid to a changed bank account.”
Injects (60 min): the facilitator releases information in stages, forcing decisions. Each inject targets a specific runbook step or escalation boundary.
Hotwash (25 min): capture what worked, what broke, and concrete action items with owners.

Design injects to hit known weak points, not to make the team look good:

# bec-tabletop-injects.yaml
scenario: "BEC leading to wire fraud"
injects:
  - t_plus: "0m"
    information: "Finance: vendor invoice paid; vendor says they never sent it."
    decision_point: "Who declares? What severity? (tests section 2)"
  - t_plus: "15m"
    information: "Sign-in logs show CFO login from a foreign IP 3 days ago."
    decision_point: "Containment order: do you revoke sessions before or after evidence export? (tests sections 3-4)"
  - t_plus: "35m"
    information: "Legal asks: is this reportable? What is the exposure window?"
    decision_point: "Can comms lead produce a timeline? Is regulatory clock running? (tests comms + timeline)"
  - t_plus: "50m"
    information: "A second mailbox shows the same inbox rule."
    decision_point: "Does scope expand? Re-trigger detection sweep? (tests back-edge to analysis)"
measure:
  - "Time to declare"
  - "Time to first containment action"
  - "Was out-of-band comms used (email assumed compromised)?"
  - "Did anyone destroy evidence?"

The exercise has failed in a useful way if the team can’t find the runbook, doesn’t know who declares, or reaches for email when they should assume it’s compromised. Those are exactly the failures you want surfaced in a conference room, not on a real bridge. Treat a smooth tabletop with suspicion — your injects were too easy.

Run these quarterly, rotate the scenario, and include the non-engineering roles (legal, comms, an exec) at least twice a year. The cross-functional muscle is the part that always atrophies.

8. Post-incident review and closing the gaps

The post-incident review (PIR) is blameless and mechanical: reconstruct the timeline, identify what worked and what didn’t, and convert findings into tracked action items. Run it within five business days while memory is fresh.

Anchor the PIR on metrics, because “we responded well” is not measurable:

Metric	Definition	Target signal
MTTD	Detection time minus actual compromise time	Shrinking quarter over quarter
MTTA	Acknowledge time minus detection	Inside on-call SLA
MTTC	First effective containment minus detection	Minutes for Sev1
MTTR	Recovery minus detection	Within stated RTO
Dwell time	Compromise to detection	The number that exposes detection gaps

Every finding becomes an action item with an owner and a due date — and critically, action items feed back into the three earlier capabilities: a missing log source becomes a preparation task, a slow containment becomes a runbook fix, a confused escalation becomes a matrix update and the subject of the next tabletop. The PIR that produces a document and no tracked work has changed nothing.

Verify

Validate readiness in two halves: confirm the capability exists, and confirm it works under exercise.

# 1. Forensics tooling and access are ready BEFORE you need them
az group show -n rg-forensics --query "name" -o tsv          # isolated RG exists
az role assignment list --resource-group rg-forensics \
  --query "[?roleDefinitionName=='Disk Snapshot Contributor'].principalName" -o tsv

# 2. IR admin accounts can act (break-glass / dedicated, MFA-strong)
az ad user show --id ir-admin@contoso.com --query "accountEnabled"

# 3. Log retention covers your investigation window (>= 90 days recommended)
az monitor log-analytics workspace show -g rg-sec -n sentinel-ws \
  --query "retentionInDays"

Then prove the program end-to-end:

Declare: in the last tabletop, was an incident declared with severity and owner inside the SLA? If not, the matrix or training is broken.
Contain without loss: did the team revoke/isolate before mutating evidence, and snapshot before remediating? Replay the recording and check.
Acquire: can you produce a hashed memory image, a disk snapshot, and exported control-plane logs for a test host, with chain-of-custody fields populated?
Reconstruct: can the investigation lead produce a single UTC timeline joining identity, endpoint, and cloud signals?
Close the loop: are last quarter’s PIR action items tracked to closure, or did they evaporate?

If any of these fails in the calm of an exercise, it will fail catastrophically in a real incident. That is the entire point of verifying now.

Enterprise scenario

A mid-size SaaS company suffered a BEC incident in their finance team: an attacker phished a controller, set a mailbox rule to hide replies, and redirected a six-figure wire. The SOC detected the anomalous sign-in promptly and the on-call engineer did exactly what felt right — opened the mailbox, deleted the malicious inbox rule, and reset the user’s password. Contained in twelve minutes. Everyone relaxed.

The constraint surfaced four days later when legal needed to determine the regulatory notification window and the exact scope of access. The engineer had deleted the inbox rule rather than disabling and exporting it, and had reset the password without first exporting the sign-in trail — and crucially, nobody had preserved the Unified Audit Log, which in their tenant retained only 90 days but had already begun aging out the earliest attacker activity. They could not prove when the rule was created, which other mailboxes the same IP had touched, or whether mail had been exfiltrated. The containment was technically perfect and forensically catastrophic: they had to assume worst-case exposure and over-notify, at significant cost.

The fix was an “evidence-first” guardrail wired directly into the BEC runbook as an executable pre-containment step, so preservation happens before any mutating action — and a tabletop inject built specifically to test that engineers reach for it under time pressure:

# bec-preserve.ps1 — MUST run before any remediation action.
param([string]$Upn, [string]$Ticket)
$out = "./evidence/$Ticket"; New-Item -ItemType Directory -Force -Path $out | Out-Null

# 1. Snapshot inbox rules as-is (proof of technique)
Get-InboxRule -Mailbox $Upn | Export-Clixml "$out/inboxrules.xml"
# 2. Export Unified Audit Log window for the user (perishable!)
Search-UnifiedAuditLog -StartDate (Get-Date).AddDays(-30) -EndDate (Get-Date) `
  -UserIds $Upn -ResultSize 5000 | Export-Csv "$out/ual.csv" -NoTypeInformation
# 3. Hash the evidence for chain of custody
Get-FileHash "$out/*" -Algorithm SHA256 | Export-Csv "$out/SHA256SUMS.csv" -NoTypeInformation
Write-Host "Evidence preserved for $Ticket. Remediation is now unblocked."

The lesson generalizes to every runbook in this guide: containment speed without an evidence-preservation gate is a liability, not a win. The cheapest place to discover that gap is a tabletop where the only cost is a slightly awkward hotwash — not a post-incident review where the cost is measured in regulatory exposure and a six-figure wire you can no longer fully scope.

Engineering Incident Response: Runbooks, Tabletop Exercises, and Cloud Forensics

1. The IR lifecycle as a state machine

2. Severity, roles, and the escalation matrix

3. Scenario runbooks: BEC, ransomware, credential compromise

Business Email Compromise (BEC)

Ransomware

Credential / identity compromise

4. Containment that preserves evidence

5. Forensic acquisition in the cloud

6. Investigating across the estate

7. Running effective tabletop exercises

8. Post-incident review and closing the gaps

Verify

Enterprise scenario

Readiness checklist

Written by Vinod

Comments

Keep Reading

Stopping Token Theft: Conditional Access Token Protection and Authentication Context

Defender EASM: Discovering and Reducing Your Internet-Facing Attack Surface

Defender for Cloud Attack Path Analysis: Custom Recommendations and Governance Rules