Azure Lesson 83 of 137

Azure Monitor End to End: Data Collection Rules, Workbooks, Metric/Log Alerts, and Action Group Automation

Most “we have Azure Monitor” stories fall apart under two questions: what exactly are you collecting, and what is it costing you per GB per month? The answer is usually a shrug, a legacy MMA agent nobody dares remove, and a Log Analytics bill that grew 40% last quarter with no new workloads. The modern stack fixes this by making collection an explicit, versioned artifact – a Data Collection Rule (DCR) – and by letting you drop or reshape data before you pay to ingest it. This piece builds the whole chain as code: DCRs and endpoints feeding the Azure Monitor Agent, ingestion-time transformations that trim cost, a workspace and table design that matches your retention economics, workbooks that turn KQL into something an on-call engineer can actually read, metric and log alerts that scale across resources, and action groups that hand off to automation instead of paging a human at 3am.

Azure Monitor is not one product; it is a pipeline with a collection plane (DCRs, DCEs, the Azure Monitor Agent, the Logs Ingestion API) that decides what lands in a table and in what shape, and a signal plane (metric alerts, scheduled-query alerts, action groups, alert processing rules) that decides what humans and automation hear about. Teams that conflate the two end up alerting on raw firehose data they should have filtered at ingestion. The rule that governs the whole article: filter low, alert high. Shape and trim at the collection plane where it is free to discard; raise signals at the top from the clean, cheap stream that remains.

By the end you will stop guessing about cost and noise. You will know exactly which DCR feeds which table, what each ingestion transform drops, which table plan each high-volume stream sits on, which alerts fire per-entity instead of as one flapping storm, and which action group fans out to which automation. You will be able to walk into a six-figure Log Analytics bill and turn “what do we collect and what does it cost” from a quarterly autopsy into a reviewed pull request.

Mental model. Azure Monitor has a collection plane and a signal plane. The collection plane (DCRs, DCEs, the Azure Monitor Agent, the Logs Ingestion API) decides what lands in a table and in what shape. The signal plane (metric alerts, scheduled-query alerts, action groups, alert processing rules) decides what humans and automation hear about. Teams that conflate the two end up alerting on raw firehose data they should have filtered at ingestion. Filter low, alert high.

What problem this solves

The pain is concrete and it shows up on an invoice. A Log Analytics workspace bills primarily on ingested GB, and the default posture of every agent-based estate is collect everything. After the migration from the legacy agent, nobody revisits the config, so a chatty Syslog stream at Information level, a verbose application diagnostic table, and health-probe 200-OK lines pour into the same expensive Analytics-plan tables you use for alerting – and almost none of it is ever queried. The bill grows with log volume, not with business value, and it grows silently.

What breaks without an explicit collection plane: you cannot answer “what are we collecting” without reverse-engineering a retired agent’s config; you cannot reduce cost without fear of deleting a signal you will wish you had during an incident; and you cannot scope access, because a workspace-per-team sprawl was the only access-control tool anyone reached for. On the signal side, the failure mode is an alert storm – one rule that fires a single giant alert that flaps as 400 VMs flicker, paging a human at 3am for something that should have auto-remediated or been suppressed during a patch window.

Who hits this: every team past the toy stage. Platform teams running fleets of VMs and AKS, security teams with regulatory retention requirements they cannot violate, and on-call engineers drowning in undifferentiated pages. The fix is almost never “turn off monitoring” – it is “make collection a versioned artifact, transform before you pay, place each table on the plan its query pattern deserves, and let the processing layer decide who gets paged and when.”

To frame the whole pipeline before the deep dive, here is every stage, what it decides, and the single highest-leverage control at that stage:

Stage Plane What it decides Highest-leverage control Cost / noise lever
Azure Monitor Agent + DCR Collection What to read from a machine and where to send it The DCR dataSources/dataFlows Collect less at source
Ingestion transformation Collection The shape of each row before billing transformKql on a data flow Drop rows/columns pre-bill
Workspace + table plan Collection Where data rests and how it is queried Analytics / Basic / Auxiliary plan Per-GB ingest + retention
Workbook Signal (read) How a human reads the data Parameter cascade + template None (read path)
Metric alert Signal Fast threshold on a pre-aggregated stream Multi-resource scope + dynamic threshold Near-zero per rule
Scheduled-query alert Signal Threshold on a KQL result over logs Dimensions + evaluation frequency Query cost + noise
Action group Signal Who/what is notified Reused group + common alert schema Notification fan-out
Alert processing rule Signal Who hears it and when Suppression + add-action-group Page volume

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the basics of a Log Analytics workspace (the store and query engine behind Azure Monitor Logs), be comfortable reading and writing KQL (where, summarize, project, bin), and know how to run az in Cloud Shell and read JSON output. Familiarity with ARM/Bicep helps, because every artifact here is a first-class ARM resource. You do not need prior alerting experience – we build it from the metric/log split up.

This sits at the centre of the Observability track. It assumes the telemetry fundamentals from Azure Monitor & Application Insights for Observability and goes one layer deeper than the survey in Azure Monitor Deep Dive: Every Option. It pairs with Azure Monitor with Managed Prometheus & Managed Grafana for AKS when your metrics live in Prometheus, and it is the upstream of every troubleshooting playbook – the data this pipeline collects is what you query in Troubleshooting Azure App Service: 502/503 Errors, Cold Starts & Restart Loops and Azure Diagnostics with Network Watcher, Resource Health & KQL.

A quick map of who owns what across the pipeline, so you route a change to the right team:

Layer What lives here Who usually owns it Failure class it causes if wrong
DCR / DCE / AMA What is collected, in what shape Platform / observability Missing data, or runaway ingest cost
Ingestion transform Row/column shape pre-bill Platform + data owner Dropped TimeGenerated, schema mismatch
Workspace / table plan Where data rests, query model Platform / FinOps Alert can’t read archived table
Workbook How humans read it Each app/SRE team Hardcoded step ignores parameters
Metric / log alert When a signal fires App + SRE team Alert storm or missed incident
Action group Who is notified On-call / SRE lead Wrong team paged; no fan-out
Alert processing rule Who hears it, when Platform / on-call lead Patch window pages 400 VMs
Automation (LA/Func) What happens without a human App + platform Duplicate remediation, retries

Core concepts

Six mental models make every later section obvious.

Collection is a versioned artifact, not an agent setting. The legacy Log Analytics agent (MMA/OMS) is retired as of 31 August 2024; the replacement is the Azure Monitor Agent (AMA), and AMA does nothing on its own – it is driven entirely by Data Collection Rules associated to a machine. A DCR is an ARM resource declaring dataSources (what to read), destinations (where to send), and dataFlows (which source maps to which destination table, plus an optional transform). The DCR is the unit of intent: change collection by changing a reviewed file, for one machine or ten thousand.

The endpoint is the ingestion door. A Data Collection Endpoint (DCE) is the entry point for ingestion. You need an explicit DCE for the Logs Ingestion API (custom logs pushed over REST) and for Private Link ingestion via an Azure Monitor Private Link Scope (AMPLS). For plain AMA collection over public networking a DCE is optional, but standardising on one keeps Private Link a config change rather than a re-architecture.

Transform before you pay. A transformation is a KQL snippet attached to a dataFlow that runs at ingestion time, before data is billed and stored. It operates on a pipeline variable named source, can drop rows and columns and redact PII, and must project columns matching the destination table schema. Because billing is on ingested volume, a transform that drops 60% of chatty Information-level syslog is a permanent line-item reduction at zero query-experience cost.

The table plan is the cost dial. Azure Monitor Logs offers three table plansAnalytics (hot, full KQL), Basic (high-volume, KQL subset, per-query billed), and Auxiliary (very high-volume, lowest ingest, limited KQL). Combined with two retention dials – interactive retention (queryable without restore) and total retention (interactive + cheap archive) – the plan is how you match each table to its real query pattern instead of paying Analytics rates for logs you read twice a year.

Metrics and logs are different planes with different physics. Metric alerts evaluate pre-aggregated, near-real-time numeric streams: cheap, fast, stateful, and capable of multi-resource scope (one rule over every VM in a scope) and dynamic thresholds (a learned band instead of a fixed number). Scheduled-query (log) alerts run KQL on a schedule against the logs store: more expressive, but they pay query latency and must be tamed with dimensions so they fire per-entity rather than as one storm.

The processing layer decouples firing from paging. An action group is the reusable fan-out target (email, SMS, push, webhook, Logic App, Function, Runbook, ITSM). An alert processing rule sits between alerts and action groups and, without touching a single alert rule, can suppress notifications on a schedule (a maintenance window) or add an action group across a scope. This is how a noisy estate stays humane: rules stay armed; processing decides who hears them and when.

The vocabulary in one table

Before the deep sections, pin every moving part side by side. The glossary at the end repeats these for lookup:

Concept One-line definition Plane Why it matters
Azure Monitor Agent (AMA) The agent that reads machine telemetry Collection Does nothing without a DCR
Data Collection Rule (DCR) ARM resource: sources → flows → destinations Collection The unit of collection intent
Data Collection Endpoint (DCE) Ingestion entry point Collection Required for Logs Ingestion API / Private Link
Transformation (transformKql) KQL on a data flow, runs at ingestion Collection Drops/reshapes rows before billing
Logs Ingestion API REST push of custom logs Collection Needs a DCE + custom-log DCR
Table plan Analytics / Basic / Auxiliary Collection Cost-vs-queryability per table
Interactive retention Days queryable without restore Collection Alerts can only read this
Total retention Interactive + cheap archive Collection Long-term keep for compliance
Workbook Parameterised JSON report of KQL steps Signal (read) Reusable, not a screenshot
Metric alert Threshold on a pre-aggregated metric Signal Fast, stateful, multi-resource
Dynamic threshold ML-learned band over metric history Signal For metrics whose normal varies
Scheduled-query rule KQL alert on a schedule Signal For signals only in logs
Dimension Grouping column splitting one rule into many alerts Signal Per-entity firing, no storm
Action group Reusable notification + action bundle Signal One place for routing
Alert processing rule Suppress / add-AG across a scope Signal Maintenance windows, central AG
Common alert schema One JSON envelope for all alert types Signal One parser downstream

Data Collection Rules, endpoints, and the Azure Monitor Agent

The DCR is the heart of the collection plane. It declares three things and arms the agent only once you associate it to a resource. The shape of a DCR maps directly to those three declarations, and each carries choices worth enumerating.

What a DCR declares, field by field

DCR element What it is Example value Default / note Gotcha
location Region of the DCR resource eastus Must match (or pair with) the workspace region Cross-region association has rules
dataCollectionEndpointId Linked DCE a DCE resource id Optional for AMA-public Required for custom logs / Private Link
dataSources What to read perf counters, syslog, events At least one required Stream names are fixed (Microsoft-Perf)
destinations Where to send one or more Log Analytics workspaces At least one required Can fan one source to many dests
dataFlows Source → destination map Microsoft-Perfla-platform Each flow maps streams to dests Carries the optional transformKql
streamDeclarations Custom-log column schema Custom-AppLogs Only for Logs Ingestion API Must match the table schema

The built-in dataSources you reach for most, with their stream names and the dial that controls volume:

Data source streams name Volume dial Lands in table When to use
Performance counters Microsoft-Perf samplingFrequencyInSeconds, counter list Perf VM CPU/mem/disk metrics in logs
Syslog (Linux) Microsoft-Syslog facilityNames, logLevels Syslog Linux daemon/auth logs
Windows event logs Microsoft-Event XPath query per channel Event Windows system/application/security
Windows perf counters Microsoft-Perf counter specifiers Perf Windows performance
IIS logs Microsoft-W3CIISLog log directory W3CIISLog Web server access logs
Text / JSON logs Custom-* (declared) file glob + transform custom *_CL App log files on disk
Custom (REST) Custom-* (declared) the Logs Ingestion API custom *_CL Push from anywhere

Register the providers and create the endpoint first:

az provider register --namespace Microsoft.Insights
az provider register --namespace Microsoft.OperationalInsights

# Data Collection Endpoint -- the ingestion entry point
az monitor data-collection endpoint create \
  --name dce-platform-eastus \
  --resource-group rg-observability \
  --location eastus \
  --public-network-access Enabled

When you actually need a DCE – the decision table that saves a re-architecture:

If you are… Do you need a DCE? Why
Collecting perf/syslog via AMA over public network Optional AMA can ingest without an explicit DCE
Pushing custom logs via the Logs Ingestion API Required The API endpoint is the DCE
Ingesting over Private Link (AMPLS) Required DCE is the private ingestion target
Standardising for future Private Link Recommended Make it a config change later, not a redesign
Collecting from a region with no workspace Region-paired DCE/DCR region rules apply

Authoring the DCR

This DCR collects a focused set of Linux perf counters and syslog, sending them to a workspace. Note the fixed streams names:

{
  "location": "eastus",
  "properties": {
    "dataCollectionEndpointId": "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/dataCollectionEndpoints/dce-platform-eastus",
    "dataSources": {
      "performanceCounters": [
        {
          "name": "perf-core",
          "streams": ["Microsoft-Perf"],
          "samplingFrequencyInSeconds": 60,
          "counterSpecifiers": [
            "\\Processor(_Total)\\% Processor Time",
            "\\Memory\\Available MBytes",
            "\\LogicalDisk(_Total)\\% Free Space"
          ]
        }
      ],
      "syslog": [
        {
          "name": "syslog-warn",
          "streams": ["Microsoft-Syslog"],
          "facilityNames": ["auth", "daemon", "syslog"],
          "logLevels": ["Warning", "Error", "Critical", "Alert", "Emergency"]
        }
      ]
    },
    "destinations": {
      "logAnalytics": [
        {
          "name": "la-platform",
          "workspaceResourceId": "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-platform"
        }
      ]
    },
    "dataFlows": [
      { "streams": ["Microsoft-Perf"],   "destinations": ["la-platform"] },
      { "streams": ["Microsoft-Syslog"], "destinations": ["la-platform"] }
    ]
  }
}

Create it and associate machines. Association is what actually arms the agent:

az monitor data-collection rule create \
  --name dcr-linux-platform \
  --resource-group rg-observability \
  --location eastus \
  --rule-file ./dcr-linux-platform.json

# Bind the DCR to a VM (repeat per machine, or drive via Policy at scale)
az monitor data-collection rule association create \
  --name dcra-vm-app-01 \
  --rule-id "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/dataCollectionRules/dcr-linux-platform" \
  --resource "/subscriptions/<sub>/resourceGroups/rg-fleet/providers/Microsoft.Compute/virtualMachines/vm-app-01"

At fleet scale you never run that association by hand. Use the built-in Azure Policy initiative that installs AMA and creates the association from a DCR parameter, assigned at a management-group scope with a DeployIfNotExists effect and a managed identity for remediation. One machine or ten thousand, the same DCR is the unit of intent. The ways to associate, ranked by scale, in Bicep for the policy assignment:

resource dcrAssoc 'Microsoft.Insights/dataCollectionRuleAssociations@2023-03-11' = {
  name: 'dcra-vm-app-01'
  scope: vm
  properties: {
    dataCollectionRuleId: dcr.id
    description: 'Associate platform DCR to the VM'
  }
}
Association method Scale Effort When to use
az ... association create 1 machine Manual Spot fixes, labs
Bicep dataCollectionRuleAssociations A known set IaC Per-workload modules
Azure Policy DeployIfNotExists Whole MG/sub One assignment Fleets; the default at scale
Arc-enabled servers + Policy Hybrid/on-prem Arc onboarding Non-Azure machines

The most common reasons data does not land after association – the symptom→cause→confirm→fix table for the collection plane:

Symptom Likely cause Confirm Fix
No rows in Perf/Syslog after association AMA not installed on the VM az vm extension list for AzureMonitorLinuxAgent Install AMA (extension or Policy remediation)
Some machines collect, others don’t Association missing on those VMs List associations per resource Add association / run Policy remediation
Rows arrive but timestamps are flat Transform dropped TimeGenerated Inspect transformKql project list Keep TimeGenerated in the projection
Custom logs rejected DCE missing or schema mismatch Ingestion API 4xx; streamDeclarations Add DCE; match declared columns
Private network, no data No AMPLS / DCE not private AMPLS scoping; DCE network access Add DCE to AMPLS; set private access
Wrong table populated dataFlows maps stream to wrong dest Read dataFlows mapping Correct stream→destination map

Ingestion-time transformations and KQL filtering for cost control

This is the highest-leverage feature in the whole platform and the one most teams have never enabled. A transformation is a KQL snippet attached to a dataFlow that runs at ingestion time, before data is billed and stored. You can drop rows, drop columns, redact PII, and project new computed fields. Because billing is on ingested volume, a transformation that filters 60% of chatty Information-level syslog is a direct, permanent line-item reduction.

The transform operates on a pipeline variable named source and must project the columns that match the destination table’s schema. Add a transformKql to the relevant data flow:

"dataFlows": [
  {
    "streams": ["Microsoft-Syslog"],
    "destinations": ["la-platform"],
    "transformKql": "source | where SeverityLevel != 'info' | where ProcessName !in ('CRON','sudo') | project TimeGenerated, Computer, Facility, SeverityLevel, SyslogMessage"
  }
]

What you can do in a transform, and what it costs

Transform operation KQL pattern Effect on bill Risk
Drop rows where SeverityLevel != 'info' Lower (fewer rows) Dropping a row you needed in an incident
Drop columns project A, B, C (omit the rest) Lower (narrower rows) Omitting a column the table requires
Redact PII extend Email = "[redacted]" Neutral Over-redaction loses forensic value
Compute a field extend Severity = case(...) Slightly higher per row Logic bug mis-classifies severity
Parse free text parse Message with ... Neutral Brittle parser on format drift
Route by _IsBillable shaping filter then narrow Lower None if schema preserved

A few rules that bite people:

The classic transform mistakes, as a confirm/fix table:

Mistake Symptom Confirm Fix
Dropped TimeGenerated Flat time-series; all rows same time `Syslog summarize min(TimeGenerated), max(TimeGenerated)`
Schema mismatch Column nulls or rows dropped Compare project to table schema Match the destination columns exactly
Filter too aggressive A signal vanished from an incident Diff row counts before/after Loosen the where; keep the slice
Transform on wrong flow No volume change Check which dataFlows has it Move transformKql to the chatty flow
Expensive extend per row Ingest latency creeps Watch ingestion latency Simplify the computed field

For custom logs over the Logs Ingestion API, the transform is even more powerful because you control the input shape. A common pattern is to send fat JSON and let the transform split it into a normal column and a DynamicJson blob, or to compute a severity from a free-text message:

source
| extend Severity = case(
    Message has_cs "ERROR", "Error",
    Message has_cs "WARN",  "Warning",
    "Information")
| where Severity != "Information"
| project TimeGenerated = todatetime(EventTime), Computer, Severity, Message

Cost rule of thumb. Filter at ingestion for volume you will never query (debug chatter, health-probe 200s). Use a cheaper table plan (next section) for volume you query rarely but must retain. Never solve a cost problem by turning off collection you will wish you had during an incident.

A rough sense of what each filter buys, so you target the fattest stream first:

Stream pattern Typical share of volume Filter to apply Expected reduction
Information/debug syslog 40-60% where SeverityLevel !in ('info','debug','notice') Often >50% of syslog
Health-probe 200s in IIS/app logs 10-30% drop probe paths / 200 status 10-30% of web logs
Chatty processes (cron, kubelet) 5-20% where ProcessName !in (...) Removes recurring noise
Verbose app diagnostic columns varies project only needed columns Narrows every row
Duplicate/redundant fields small drop in project Marginal but free

Log Analytics workspace design, tables, and table-level plans

Two workspace decisions dominate the bill: how many workspaces you run, and the table plan on each table. The modern guidance is few workspaces, many tables, per-table plans – one regional platform workspace per major boundary rather than a workspace per team, because cross-workspace KQL (workspace()/union) is awkward and access control is now solvable at the table and row level.

Few workspaces or many?

Topology Pro Con Use when
One workspace per team Simple ownership/billing split Cross-team KQL is painful; sprawl Hard billing isolation is mandatory
One per region per boundary (recommended) Easy union, central queries Needs table/row RBAC to scope access The default for most estates
One global workspace Simplest queries Data-residency and blast-radius concerns Single-region, small estate
Per-environment (prod/non-prod) Clean prod isolation Duplicated config Strong prod/non-prod separation

Table plans, side by side

Azure Monitor Logs offers three table plans:

Plan Use for Query Ingest cost Retention model
Analytics Hot, frequently queried signals (alerts, dashboards) Full KQL, fast Highest Interactive retention (up to long term)
Basic High-volume, occasionally queried logs (verbose app/network logs) KQL subset, per-query billed Lower Short interactive + long-term archive
Auxiliary Very high-volume, low-fidelity (raw audit, verbose firewall) Limited KQL, lowest ingest Lowest Long-term, cheapest ingest

The capability differences that decide which plan a table can tolerate – read this before you move an alerting table to Basic:

Capability Analytics Basic Auxiliary
Full KQL (joins, all operators) Yes Subset Limited
Source for alert rules Yes No (not for alerting) No
Source for workbooks/dashboards Yes Limited Limited
Per-query billing No Yes Yes
Interactive retention max Long Short (then archive) Short
Best for Hot signals Rarely-queried logs Raw, cheap, kept

Set retention with two dials: interactive retention (queryable without restore) and total retention (interactive + cheap long-term archive). Alert rules and dashboards must read from interactive retention; archived data needs a search job or restore first.

Retention dial What it controls Lower bound Upper bound Watch-out
Interactive retention Days you can query directly days long term Alerts/workbooks need data inside this
Total retention Interactive + archive interactive very long (years) Archive needs restore/search job to query
Workspace default Applies to tables without an override configurable Per-table override beats the default
# Create the workspace
az monitor log-analytics workspace create \
  --resource-group rg-observability \
  --workspace-name law-platform \
  --location eastus \
  --retention-time 90

# Move a chatty custom table to the Basic plan and set retention split
az monitor log-analytics workspace table update \
  --resource-group rg-observability \
  --workspace-name law-platform \
  --name AppVerbose_CL \
  --plan Basic \
  --retention-time 30 \
  --total-retention-time 365

Pair this with table-level RBAC so an app team sees its own *_CL tables but not the platform security tables, instead of minting a workspace per team just to scope access. A decision table for placing any new high-volume table:

If the table is… Query pattern Place it on… Retention split
Alert/dashboard source Frequent, interactive Analytics Long interactive
Verbose app log, rare queries Occasional, incident-only Basic 30d interactive / 365d total
Raw firewall/audit firehose Almost never, kept for compliance Auxiliary Short interactive / years total
Regulated auth/audit events Alerted + 1-year keep Analytics Interactive ≥ retention requirement
Health-probe noise Never queried (don’t ingest) Drop in transform

Workbooks: parameters, queries, and reusable visual templates

A workbook is a JSON template (an ARM resource of type Microsoft.Insights/workbooks) that combines parameters, KQL query steps, text, and visualisations. The feature that makes them reusable – rather than a screenshot with extra steps – is parameters: a parameter is itself usually a KQL query, and downstream steps interpolate it with {ParamName}.

Parameter types you actually use

Parameter type type value Source Interpolates as Typical use
Time range 4 Picker where TimeGenerated {TimeRange} Top-of-workbook range
Resource picker 5 ARM query resource ids Scope to selected resources
Subscription 6 ARG/KQL subscription ids Cross-subscription scoping
Dropdown (query) 2 KQL summarize by a value Pick a Computer / app
Text 1 Free text a string Ad-hoc filter
Multi-value 2 (multi) KQL comma list “all of these machines”

The pattern that scales: a top-of-workbook time-range parameter plus a resource/subscription picker, then every query references both. Here is the parameter-and-query skeleton inside the workbook items array:

{
  "type": 9,
  "content": {
    "parameters": [
      {
        "name": "TimeRange",
        "type": 4,
        "isRequired": true,
        "value": { "durationMs": 3600000 }
      },
      {
        "name": "Subscription",
        "type": 6,
        "query": "summarize by subscriptionId",
        "queryType": 1,
        "crossComponentResources": ["value::all"]
      }
    ]
  }
}

A query step that consumes them. Note {TimeRange} expands into a full where TimeGenerated ... clause and the time-brush feeds the chart automatically:

Perf
| where TimeGenerated {TimeRange}
| where CounterName == "% Processor Time" and InstanceName == "_Total"
| summarize avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| render timechart

Visualisations and when to reach for each

Visual render / step type Best for Avoid when
Time chart render timechart Trends over time Categorical comparison
Bar/column render barchart Top-N by category Time on the x-axis
Grid (table) grid step Per-entity detail rows Dense trend data
Tiles tiles step KPI headline numbers Many categories
Stat / big number the “1” visualization A single SLO number Distribution detail
Map map step Geo-distributed signal Non-geographic data

Two practices keep workbooks maintainable. First, pin parameter queryType and crossComponentResources so the same template works whether it is scoped to one resource or an entire subscription. Second, template it, then publish as a gallery template via Bicep so every team gets the same “service health” workbook rather than forking ten copies:

resource wb 'Microsoft.Insights/workbooks@2023-06-01' = {
  name: guid('platform-health-workbook')
  location: location
  kind: 'shared'
  properties: {
    displayName: 'Platform Health'
    category: 'workbook'
    sourceId: workspaceResourceId
    serializedData: loadTextContent('./workbooks/platform-health.json')
  }
}

The workbook mistakes that turn a “reusable template” back into a screenshot:

Mistake Symptom Fix
Step ignores {TimeRange} Chart never changes with the picker Add where TimeGenerated {TimeRange} to the step
Hardcoded resource id Template works in only one sub Use a resource/subscription parameter
crossComponentResources unset Scope picker has no effect Set it on parameters and queries
Workbook saved per-team Ten forks drift apart Publish one shared/gallery template via Bicep
Heavy query on every load Slow workbook Narrow the default range; pre-aggregate

Metric alerts, dynamic thresholds, and multi-resource scoping

Metric alerts evaluate platform metrics (or custom metrics) on a near-real-time, pre-aggregated stream – they are cheap, fast, and stateful. Two capabilities make them scale. Multi-resource scope lets one alert rule watch every VM in a resource group or subscription of the same type, so you author one rule instead of one-per-VM. Dynamic thresholds replace a hand-picked number with a machine-learned band over the metric’s history, which is the only sane choice for metrics whose “normal” varies by time of day.

The metric-alert setting matrix

Setting Values Default When to change Trade-off / limit
Scope single / multi-resource single Author one rule over a fleet Multi-resource limited to same type/region
Aggregation type avg / min / max / total / count avg Match the metric’s meaning Wrong agg hides spikes
Operator >, <, >=, <= Direction of the breach
Threshold type static / dynamic static Dynamic when normal varies Dynamic needs history to learn
Window size 1m-24h 5m Smooth noise vs react fast Bigger window = slower to fire
Evaluation frequency 1m-1h 1m Cost vs responsiveness Too frequent = noisier
Sensitivity (dynamic) low / medium / high medium High = tighter band High = more false positives
Violations (dynamic) N of M periods 4 of 4 cuts noise 1 of 1 flaps
Severity Sev0-Sev4 Sev3 Page-worthiness Routing depends on it
Auto-mitigate on / off on Stateful resolve Off means manual close

A static, multi-resource CPU alert over an entire resource group:

az monitor metrics alert create \
  --name "vm-cpu-high" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
  --target-resource-type "Microsoft.Compute/virtualMachines" \
  --target-resource-region eastus \
  --condition "avg Percentage CPU > 85" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

For dynamic thresholds the condition uses the dynamic operator with a sensitivity and a violation count (4 violations out of 4 periods is far less noisy than 1 of 1):

az monitor metrics alert create \
  --name "vm-cpu-dynamic" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
  --target-resource-type "Microsoft.Compute/virtualMachines" \
  --target-resource-region eastus \
  --condition "avg Percentage CPU >< dynamic medium 4 of 4" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --severity 2 \
  --action "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Static vs dynamic, decided by the shape of the metric:

If the metric… Use Why
Has a hard SLA limit (disk 90%) Static The line is a real contract
Has a clear daily/weekly rhythm Dynamic A fixed line either over- or under-fires
Is brand new (no history) Static first Dynamic needs weeks to learn
Is bursty but bounded Dynamic + high violations Rides spikes, catches sustained shifts
Is a count of rare events Static (low threshold) Dynamic band collapses near zero

Auto-mitigation matters. Metric alerts are stateful: a fired alert auto-resolves when the condition clears (default behaviour), and the action group is notified of resolved as well as fired. Do not build alert logic that assumes you must manually close alerts – wire your downstream automation to handle the resolved signal too.

The standard severity ladder, so routing and suppression have a consistent contract:

Severity Meaning Example Routing
Sev0 Critical, customer-impacting outage Region down, all instances 5xx Page on-call immediately
Sev1 Severe, imminent impact Capacity nearly exhausted Page on-call
Sev2 Error, degraded One node CPU pinned Ticket + notify
Sev3 Warning Approaching a threshold Notify, business hours
Sev4 Informational A scale event happened Log only

Scheduled query (log) alerts and stateful alert processing

When the signal lives in logs rather than a metric – “more than 20 5xx responses from one pod in 5 minutes,” “a privileged role was assigned” – you need a scheduled query rule (Microsoft.Insights/scheduledQueryRules, API version 2023-12-01 and later, sometimes called Log Alerts v2). It runs KQL on a schedule, compares an aggregated result to a threshold, and fires.

The scheduled-query setting matrix

Setting Values Default When to change Gotcha
Query any KQL returning an aggregate Always Must aggregate to a number per dimension
Threshold operator >, >=, <, <=, = Direction of breach
Window (--window-size) 5m-2d 5m Match the signal’s burst length Must cover data latency
Evaluation frequency 1m-1d 5m Cost vs responsiveness Set ≥ data latency
Dimensions grouping columns none Per-entity firing Each value = a separate alert
autoMitigate true/false true Stateful resolve False leaves alerts open
Number of violations N of M 1 of 1 Cut flapping Higher = slower to fire
Severity Sev0-Sev4 Sev3 Page-worthiness Drives routing
Mute actions (per rule) duration none After-fire cooldown Suppresses re-notify

The two settings that separate a good log alert from an alert storm are stateful alerts (autoMitigate) and dimensions. Dimensions split one rule into one alert per value of a grouping column – so a rule grouped by Computer fires a separate, independently-resolving alert per machine, instead of one giant alert that flaps.

az monitor scheduled-query create \
  --name "syslog-error-burst" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.OperationalInsights/workspaces/law-platform" \
  --condition "count 'errs' > 20" \
  --condition-query errs='Syslog | where SeverityLevel in ("err","crit","alert","emerg")' \
  --dimension "Computer" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --severity 2 \
  --auto-mitigate true \
  --action-groups "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Insights/actionGroups/ag-platform-oncall"

Three principal-level rules for log alerts:

  1. Aggregate inside the query, not in your head. The rule compares a single aggregated number per dimension to the threshold. summarize count() by Computer, bin(TimeGenerated, 5m) keeps the evaluation deterministic.
  2. Keep evaluation-frequency >= the data latency. Log ingestion has minutes of latency; evaluating every 1 minute against data that arrives every 3 produces false negatives and duplicate fires. Match frequency to reality.
  3. Read from interactive retention only. Alert queries cannot reach archived (long-term) data without a restore. If a table is on the Basic/Auxiliary plan with short interactive retention, your alert window must fit inside it.

Metric vs scheduled-query alert – pick the cheaper, faster plane whenever the signal exists there:

Dimension Metric alert Scheduled-query alert
Data source Pre-aggregated metric stream KQL over the logs store
Latency to fire Seconds to ~1 min Minutes (ingest + eval)
Cost Near-zero per rule Query cost per evaluation
Expressiveness Single metric + dims Full KQL (joins, parsing)
Multi-resource Native (one rule, many resources) Via query scope
Per-entity firing Dimensions on the metric Dimensions on the query
Best for CPU/mem/latency thresholds “20 5xx from one pod”, audit events

The log-alert failure modes you will actually hit:

Symptom Cause Confirm Fix
Duplicate fires every few minutes evaluation-frequency < data latency Compare ingest delay to frequency Raise frequency to ≥ latency
One giant flapping alert No dimensions Rule has no grouping column Add --dimension for per-entity
Alert never fires though data exists Query doesn’t aggregate to a number Run the KQL manually summarize to one value per dim
Alert returns nothing after retention change Table archived / short interactive Check table plan + interactive days Widen interactive or window
Threshold always breached Window too long, accumulates Inspect window vs frequency Shorten window; use rate

Action groups, alert processing rules, and suppression

An action group is the reusable fan-out target: a named bundle of notifications (email, SMS, push, voice) and actions (webhook, Logic App, Function, Automation Runbook, ITSM connector). Every alert type – metric, log, activity log, Service Health – points at the same action group resource, so you manage on-call routing in one place.

Action types in a group

Action type Delivery Latency Idempotent by design? Best for
Email Inbox Seconds-minutes N/A Humans, low urgency
SMS Text Seconds N/A On-call escalation
Push (Azure mobile app) Notification Seconds N/A On-call awareness
Voice Phone call Seconds N/A Sev0 wake-up
Webhook HTTP POST Seconds You must make it so Custom integrations
Logic App Workflow Seconds Build idempotently Orchestration, approvals
Azure Function HTTP/queue Seconds You must make it so Code remediation
Automation Runbook Job Seconds-minutes You must make it so VM/OS actions
ITSM / event hub Connector Varies Connector-dependent ServiceNow, SIEM
Secure webhook HTTP + Entra Seconds You must make it so Authenticated callouts
az monitor action-group create \
  --name ag-platform-oncall \
  --resource-group rg-observability \
  --short-name pltoncall \
  --action email oncall-lead [email protected] \
  --action webhook pagerduty https://events.pagerduty.com/integration/<key>/enqueue \
  --action logicapp incident-workflow \
    "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Logic/workflows/wf-incident" \
    "/subscriptions/<sub>/resourceGroups/rg-observability/providers/Microsoft.Logic/workflows/wf-incident/triggers/manual/paths/invoke"

The piece teams miss is the alert processing rule (Microsoft.AlertsManagement/actionRules). It sits between alerts and action groups and does two jobs without touching a single alert rule:

Alert processing rule types and filters

Rule type Effect Typical scope Schedule?
RemoveAllActionGroups Suppress notifications A resource group during patching Yes (recurring window)
AddActionGroups Attach an AG to matching alerts All Sev0/1 in prod → SecOps Optional (always-on)
Filtered by severity Apply only to chosen severities Sev2/Sev3 only Either
Filtered by resource type Apply to one service All Microsoft.Compute/* Either
Filtered by alert context Apply by signal/monitor service Only platform metrics Either

Maintenance-window suppression across a resource group:

az monitor alert-processing-rule create \
  --name "suppress-maint-window" \
  --resource-group rg-observability \
  --scopes "/subscriptions/<sub>/resourceGroups/rg-fleet" \
  --rule-type RemoveAllActionGroups \
  --filter-severity Equals Sev2 Sev3 \
  --schedule-recurrence-type Weekly \
  --schedule-start-time "02:00:00" \
  --schedule-end-time "04:00:00" \
  --schedule-recurrence Sunday \
  --description "Mute Sev2/Sev3 during Sunday patch window"

This is how you keep a noisy estate humane: the alert rules stay armed and the processing layer decides who hears them and when. The decision table for routing and muting:

If you want to… Use Not
Mute pages during a patch window Alert processing rule (RemoveAllActionGroups, scheduled) Disabling the alert rules
Add SecOps to every prod Sev0/1 Alert processing rule (AddActionGroups) Editing every alert rule
Change who is on-call Edit the action group once Editing each alert
Stop one rule entirely Disable that alert rule A processing rule (overkill)
Cool down re-notification after fire Per-rule mute / suppression duration Deleting the alert

Automation hooks to Logic Apps, Functions, and webhooks

The point of all of the above is to do something without a human. An action group can call a Logic App, an Azure Function, or a raw webhook, passing the alert as JSON. Use the common alert schema so every downstream gets the same envelope regardless of whether a metric or log alert fired – otherwise your Function has to parse three different payload shapes.

The common alert schema envelope

Field path Holds Why you read it
data.essentials.alertRule The rule name Logging / routing
data.essentials.severity Sev0-Sev4 Decide how hard to act
data.essentials.monitorCondition Fired / Resolved Act only on Fired
data.essentials.alertTargetIDs Affected resource ids What to remediate
data.essentials.signalType Metric / Log Branch if needed
data.essentials.firedDateTime When it fired Dedup window
data.alertContext Signal-specific detail Thresholds, dimensions

A Function that auto-remediates by parsing the common schema and restarting a service (sketch, Node.js):

module.exports = async function (context, req) {
  const alert = req.body?.data?.essentials;
  if (!alert) { context.res = { status: 400, body: "no alert payload" }; return; }

  context.log(`Alert ${alert.alertRule} is ${alert.monitorCondition} (${alert.severity})`);

  // Only act on a freshly fired alert, ignore the auto-resolve callback
  if (alert.monitorCondition === "Fired") {
    const target = alert.alertTargetIDs?.[0];
    context.log(`Remediating ${target}`);
    // ... call ARM / Az SDK to restart/scale the resource ...
  }
  context.res = { status: 202, body: "accepted" };
};

The two non-negotiables for automation handlers:

For richer orchestration – approvals, multi-step runbooks, ServiceNow tickets – a Logic App is the better target: enable the common alert schema on the action, and the trigger body is the same well-known structure, no custom parsing. (See Azure Logic Apps Standard: Stateful Workflows, VNet & B2B/EDI and Azure Functions: Serverless Patterns for building the handlers.) Choosing a target:

Target Reach for it when Avoid when
Raw webhook A third party expects a POST (PagerDuty) You need orchestration/state
Azure Function Code remediation, fast and cheap You need a human approval step
Logic App Approvals, ServiceNow, multi-step A 5-line restart is enough
Automation Runbook OS/VM-level PowerShell actions Pure cloud-resource API calls

The automation failure modes that turn auto-remediation into an incident of its own:

Symptom Cause Confirm Fix
Action runs twice per incident Non-idempotent handler + retry Logs show two invocations Dedup on alertId/firedDateTime
Remediation fires on resolve too Not checking monitorCondition Payload shows Resolved Gate on Fired only
Webhook retried, duplicate work Handler blocked >ack timeout Long duration in logs Return 202, queue the work
Function can’t act on the resource Managed identity lacks RBAC az role assignment list Grant least-privilege role
Three parsers for three alert types Common alert schema not enabled Payload shapes differ Enable common alert schema on the action

Architecture at a glance

Read the diagram left to right as the data and signal pipeline it is. On the collection plane (left), the Azure Monitor Agent on each VM is armed by a Data Collection Rule and pushes through a Data Collection Endpoint; custom logs arrive at the same DCE via the Logs Ingestion API. At the DCR’s data flow, an ingestion transformation (transformKql) drops Information-level rows and noisy processes before anything is billed – badge ❶ marks this as the first failure point, because a transform that drops TimeGenerated flat-lines every downstream chart. The clean stream lands in the Log Analytics workspace, where each table sits on the plan its query pattern deserves: hot Analytics tables for alerting, Basic for verbose app logs, Auxiliary for the raw firehose (badge ❷ – put an alerting table on Basic and the alert silently can’t read it).

From the workspace the signal plane (right) reads two ways. Metric alerts evaluate the pre-aggregated stream with multi-resource scope and dynamic thresholds; scheduled-query alerts run KQL with dimensions so they fire per-entity instead of as one flapping storm (badge ❸). Both point at a single reusable action group, but an alert processing rule sits in front of it (badge ❹) to suppress pages during a patch window or bolt on the SecOps group centrally. The action group fans out to humans and to Logic Apps / Functions that auto-remediate using the common alert schema – and badge ❺ marks the automation hop, where a non-idempotent handler double-acts on a retry. The whole picture is the article’s one rule made visual: shape and trim low on the left where it is free, and raise clean, per-entity signals high on the right.

Azure Monitor end-to-end pipeline: on the collection plane the Azure Monitor Agent armed by a Data Collection Rule and the Logs Ingestion API push through a Data Collection Endpoint, where an ingestion-time transformKql drops rows before billing (badge 1) and the cleaned stream lands in a Log Analytics workspace whose tables sit on Analytics, Basic and Auxiliary plans (badge 2); on the signal plane, metric alerts with dynamic thresholds and scheduled-query alerts with dimensions for per-entity firing (badge 3) feed a reusable action group fronted by an alert processing rule for maintenance-window suppression (badge 4), which fans out to email, webhook and Logic App/Function automation using the common alert schema (badge 5)

Real-world scenario

Paywave Systems runs a payments platform: roughly 1,400 VMs plus AKS across three regions (Central India primary, with Southeast Asia and West Europe), all funnelling telemetry into a single regional Log Analytics workspace per region. The platform team is six engineers; observability is one slice of their remit. Over two quarters the combined Log Analytics spend crossed a six-figure annual run rate with no new workloads to explain it – the classic silent-growth curve.

The forensic finding came from a single Usage-table query: a chatty Syslog stream and one verbose application diagnostic table accounted for roughly 70% of ingested volume, and almost none of it was ever queried. It existed because the retired MMA config had collected everything, and nobody revisited it after the AMA migration. The constraint was hard: the security team had a regulatory requirement to retain authentication and audit events for one year, so “just stop collecting” was off the table for that slice – and the on-call team was simultaneously drowning, because a single scheduled-query rule on Syslog errors fired one giant flapping alert every time a handful of nodes flickered during nightly batch.

They fixed it on the collection plane and the processing layer, not the bill. The breakdown of the change:

Workstream Before Change After
Syslog volume All levels, all processes transformKql drops info/debug + cron/health noise ~50%+ less syslog, zero query loss
Verbose app table Analytics plan, 90d Move to Basic, 30d interactive / 365d total Sharp per-GB ingest drop
Regulated auth/audit Mixed in the firehose Split to own Analytics table, 1y retention Security alerts + retention untouched
Syslog error alert One rule, no dimensions Add --dimension Computer, 4 of 4 violations Per-node alerts, no storm
Patch-window pages 400 VMs paged nightly Alert processing rule, Sun 02:00-04:00 suppress On-call sleeps through patching
Change process Quarterly autopsy DCR/table/alert as reviewed PRs “What do we collect” = a diff

The net was a ~45% reduction in monthly ingestion cost with no loss of any signal anyone actually used, and the nightly alert storm went silent. The load-bearing change was a few lines of KQL on a data flow:

source
| where SeverityLevel !in ("info", "debug", "notice")
| where ProcessName !in ("CRON", "systemd", "kubelet-health")
| project TimeGenerated, Computer, Facility, SeverityLevel, SyslogMessage

The 3am page that used to fan out across 400 machines now fires one alert per genuinely-affected node, auto-resolves when the node recovers, and stays muted entirely during the Sunday patch window. The lesson the team took away: in Azure Monitor, cost and noise are collection and processing design decisions, not billing surprises and not the alert rules’ fault. Once the DCR was the unit of intent and the processing rule owned the patch window, “what do we collect, what does it cost, and who gets paged” became three pull requests instead of two quarterly autopsies.

Advantages and disadvantages

The DCR-plus-processing model both causes the cost/noise problems and gives you the levers to fix them. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Collection is a versioned ARM artifact – “what do we collect” is a reviewable diff The default posture is collect everything; you must opt into trimming
Ingestion transforms cut cost before billing at zero query-experience loss A bad transform (dropped TimeGenerated, schema mismatch) silently breaks data
Per-table plans match each stream to its real query pattern An alerting table mis-placed on Basic/Auxiliary silently can’t be alerted on
One DCR scales to a whole fleet via Policy DeployIfNotExists Per-VM association by hand doesn’t scale and drifts
Metric alerts are cheap, fast, stateful, multi-resource The signal must exist as a metric; otherwise you pay for log queries
Dimensions fire one alert per entity, killing the storm Forget dimensions and one rule flaps as a single giant alert
Action groups centralise routing; processing rules mute/route without editing rules The processing layer is invisible until you know it exists – teams disable rules instead
Common alert schema gives one envelope for all alert types Skip it and every downstream parses three payload shapes

The model is right for any estate past the toy stage that wants cost and noise under engineering control rather than at the mercy of defaults. It bites hardest on teams that migrated off MMA and never revisited collection, that alert on raw firehose data, and that reach for “disable the alert” or “another workspace” instead of the transform, the table plan, and the processing rule. Every disadvantage is manageable – but only if you know the lever exists, which is the entire point of this article.

Hands-on lab

Stand up a workspace, create a custom table fed by the Logs Ingestion API through a DCE and DCR with a transform, and fire a scheduled-query alert into an action group – all free-tier-friendly (Log Analytics has a generous free ingestion allowance; delete at the end). Run in Cloud Shell (Bash).

Step 1 – Variables and resource group.

RG=rg-monitor-lab
LOC=eastus
WS=law-lab-$RANDOM
az group create -n $RG -l $LOC -o table

Step 2 – Create the Log Analytics workspace.

az monitor log-analytics workspace create \
  -g $RG -n $WS -l $LOC --retention-time 30 -o table
WS_ID=$(az monitor log-analytics workspace show -g $RG -n $WS --query id -o tsv)

Expected: a workspace row; WS_ID populated.

Step 3 – Create a Data Collection Endpoint.

az monitor data-collection endpoint create \
  -g $RG -n dce-lab -l $LOC --public-network-access Enabled -o table

Expected: a DCE with a logsIngestion endpoint URL in its properties.

Step 4 – Create a custom table for the logs. A *_CL table to receive pushed rows:

az monitor log-analytics workspace table create \
  -g $RG --workspace-name $WS -n LabEvents_CL \
  --columns TimeGenerated=datetime Computer=string Severity=string Message=string

Expected: the table LabEvents_CL is created on the Analytics plan.

Step 5 – Create a DCR with a transform that drops Information rows. Author a minimal custom-log DCR (dcr-lab.json) with a streamDeclarations matching the table and a transformKql that filters, then create it. The key line is the transform:

# transformKql inside the data flow:
#   source | where Severity != 'Information' | project TimeGenerated, Computer, Severity, Message
az monitor data-collection rule create \
  -g $RG -n dcr-lab -l $LOC --rule-file ./dcr-lab.json -o table

Expected: a DCR whose data flow carries the transform; note its immutableId and the DCE’s ingestion URL for the push.

Step 6 – Push two rows and watch the transform drop one. POST one Information and one Error row to the ingestion endpoint (using the DCE URL, the DCR immutableId, and a bearer token from az account get-access-token --resource https://monitor.azure.com). Then query:

LabEvents_CL
| where TimeGenerated > ago(15m)
| project TimeGenerated, Computer, Severity, Message

Expected: only the Error row appears – the transform dropped the Information row before ingestion, which is the entire cost-control mechanism in miniature.

Step 7 – Create an action group and a scheduled-query alert.

az monitor action-group create -g $RG -n ag-lab --short-name lab \
  --action email me [email protected]

az monitor scheduled-query create -g $RG -n "lab-error-burst" \
  --scopes "$WS_ID" \
  --condition "count 'errs' > 0" \
  --condition-query errs='LabEvents_CL | where Severity == "Error"' \
  --dimension "Computer" \
  --window-size 5m --evaluation-frequency 5m --severity 3 \
  --auto-mitigate true \
  --action-groups $(az monitor action-group show -g $RG -n ag-lab --query id -o tsv)

Expected: the rule fires per Computer when error rows arrive, emails the action group, and auto-resolves.

Validation checklist. You built the whole chain: a DCE entry point, a DCR with an ingestion transform that dropped a row before billing, a custom table, and a per-entity scheduled-query alert into an action group. The steps mapped to the concepts:

Step What you did What it proves
3-5 DCE + DCR + transform Collection is a versioned artifact; transform runs pre-bill
6 Push 2 rows, 1 survives The transform is a real, permanent cost cut
7 Scheduled-query with --dimension Per-entity firing, not one storm
7 Action group + auto-mitigate Centralised routing + stateful resolve

Cleanup (avoid lingering ingestion/retention charges).

az group delete -n $RG --yes --no-wait

Cost note. Log Analytics includes a free ingestion allowance and the lab pushes a handful of rows; an hour of this lab is effectively free, and deleting the resource group stops the workspace, DCE, and DCR.

Common mistakes & troubleshooting

This is the playbook – the part you bookmark. First as a scannable table you can read mid-incident, then the entries that bite hardest expanded with the full confirm-command detail.

# Symptom Root cause Confirm (exact cmd / portal path) Fix
1 Log Analytics bill grew with no new workloads Collect-everything legacy config; chatty stream never trimmed `Usage summarize sum(Quantity) by DataType
2 No rows after DCR association AMA not installed on the VM az vm extension list --query "[?name=='AzureMonitorLinuxAgent']" Install AMA via extension or Policy remediation
3 Every chart is a flat line at one time Transform dropped TimeGenerated `Syslog summarize min(TimeGenerated), max(TimeGenerated)`
4 Custom-log push returns 4xx DCE missing or streamDeclarations mismatch Ingestion API response body; compare schema to table Add DCE; align declared columns to the table
5 Alert rule “can’t find data” after a cost change Alerting table moved to Basic/Auxiliary az monitor log-analytics workspace table show --query plan Keep alerting tables on Analytics
6 One giant alert flaps as nodes flicker Scheduled-query rule has no dimensions Inspect the rule; no grouping column Add --dimension Computer; N of M violations
7 Duplicate alert fires every minute evaluation-frequency < ingestion latency Compare ingest delay to the rule frequency Raise frequency to ≥ data latency
8 400 VMs page during the patch window No suppression on the maintenance window No alert processing rule covers the scope/time Add RemoveAllActionGroups processing rule, scheduled
9 Automation runs twice per incident Non-idempotent handler + action-group retry Function/Logic App logs show two invocations Dedup on alertId/firedDateTime; ack 202 fast
10 Remediation fires on the resolve callback too Handler ignores monitorCondition Payload data.essentials.monitorCondition = Resolved Gate the action on Fired only
11 Dynamic-threshold alert never fires (new metric) No history for the band to learn Rule created on a brand-new metric Use a static threshold until weeks of history exist
12 Alert query returns nothing after retention cut Data archived / interactive retention too short Table interactive days vs the alert window Widen interactive retention or shorten the window
13 Workbook step ignores the time picker Hardcoded range, no {TimeRange} The step’s KQL lacks where TimeGenerated {TimeRange} Interpolate the parameter into the step
14 Volume didn’t drop after adding a transform transformKql on the wrong data flow Which dataFlows entry carries it Move the transform onto the chatty stream

The expanded form, for the entries that cost the most when missed:

1. Log Analytics bill grew with no new workloads. Root cause: A legacy collect-everything config keeps pouring a chatty Syslog stream and a verbose app table into expensive Analytics tables nobody queries. Confirm: Usage | where TimeGenerated > ago(7d) | summarize sum(Quantity) by DataType | order by sum_Quantity desc – the top one or two DataTypes usually dominate. Fix: Attach a transformKql to drop the noise pre-bill, and move rarely-queried tables to Basic/Auxiliary with a sane interactive/total retention split. Keep regulated tables on Analytics.

3. Every chart and time-series is a flat line stamped at one moment. Root cause: The ingestion transform dropped TimeGenerated, so every row is stamped at ingestion time. Confirm: Syslog | summarize min(TimeGenerated), max(TimeGenerated) shows a near-zero spread, or all rows share one timestamp. Fix: Add TimeGenerated back into the transform’s project list (or set it from the source field, e.g. project TimeGenerated = todatetime(EventTime), ...).

5. An alert rule reports it can’t find data right after a cost-optimisation change. Root cause: The table was moved to Basic or Auxiliary, which cannot be a source for alert rules. Confirm: az monitor log-analytics workspace table show -g rg-observability --workspace-name law-platform -n <Table> --query plan returns Basic/Auxiliary. Fix: Keep any table that feeds an alert or dashboard on Analytics; trim its volume with a transform instead of changing its plan.

6. A single alert flaps loudly as a handful of nodes flicker. Root cause: The scheduled-query rule has no dimensions, so it evaluates one aggregate across the whole fleet and fires/resolves as one giant alert. Confirm: Inspect the rule – there is no grouping column; the alert summary names many resources at once. Fix: Add --dimension Computer (or the right grouping column) so it fires per entity, and add an N of M violation count to ride brief flickers.

8. Hundreds of VMs page on-call during a planned patch window. Root cause: The alert rules are correctly armed, but nothing suppresses notifications during maintenance. Confirm: No alert processing rule with RemoveAllActionGroups covers the scope and the time window. Fix: Create a scheduled processing rule (--rule-type RemoveAllActionGroups, recurring window) over the patched scope and the chosen severities – the rules stay armed; nobody is paged.

9. Auto-remediation acts twice for the same incident. Root cause: The handler is not idempotent and the action group retried on a non-2xx, or the alert fired and re-fired. Confirm: The Function/Logic App logs show two invocations with the same alertId/firedDateTime. Fix: Deduplicate on alertId/firedDateTime, return 202 immediately, and push the slow work onto a queue so the call never exceeds the ack timeout.

Best practices

Security notes

The security controls that also keep the pipeline cheap and correct – secure and well-built pull the same way here:

Control Mechanism Secures against Also prevents
Managed identity for AMA/automation System/user-assigned MI Leaked workspace keys Broken rotation taking collection down
Private ingestion DCE + AMPLS Telemetry on the public internet Re-architecture to add Private Link later
PII redaction in transform transformKql masking Storing sensitive data Extra storage cost for data you must scrub
Table-level RBAC Per-table access Over-broad data access Workspace-per-team sprawl
Least-privilege automation Scoped role assignment A handler over-acting Accidental cross-resource remediation

Cost & sizing

The bill is driven almost entirely by the collection plane, so that is where you control it:

A rough monthly picture for a mid-size estate, and what each lever buys:

Cost driver What you pay for Rough INR / month (illustrative) Lever to reduce it Watch-out
Analytics ingestion Per-GB hot ingest the bulk of the bill transformKql + DCR scoping Don’t drop a signal you’ll need
Basic-plan tables Lower ingest, per-query billed fraction of Analytics per GB Move rarely-queried tables here Can’t alert off Basic
Auxiliary-plan tables Cheapest ingest, kept long lowest per GB Raw firehose for compliance Very limited KQL
Interactive retention Days queryable directly scales with GB × days Keep short; archive the rest Alerts need data inside it
Archive (total) retention Cheap long-term keep low per GB-month Long keep without hot cost Restore/search job to query
Scheduled-query evaluations Query cost per run depends on frequency × size Slower frequency; narrower query Too slow misses incidents
Notifications + automation SMS/voice + Func/LA meters small Email for low-urgency; idempotent handlers Retries multiply automation cost

Paywave landed at roughly a 45% lower monthly ingestion bill purely from a transform, two table-plan moves, and a retention split – proof the cheapest fix is almost always collect and keep less of what nobody queries, not a smaller anything.

Interview & exam questions

1. The legacy Log Analytics agent is retired – what replaced it and how is it configured? The Azure Monitor Agent (AMA) replaced MMA/OMS (retired 31 Aug 2024). AMA collects nothing on its own; it is driven entirely by Data Collection Rules that declare dataSources, destinations, and dataFlows, associated to each machine (by hand, by Bicep, or at fleet scale via Azure Policy DeployIfNotExists).

2. What is an ingestion-time transformation and why is it the highest-leverage cost control? A transformKql snippet attached to a data flow that runs before data is billed, operating on a source variable to drop rows/columns or redact fields. Because billing is on ingested volume, dropping chatty Information-level rows is a permanent, query-cost-free reduction – you never pay to store data you filtered at the door.

3. When do you need a Data Collection Endpoint? A DCE is required for the Logs Ingestion API (the endpoint is the DCE) and for Private Link ingestion via an AMPLS. For plain AMA collection over public networking it is optional, but standardising on one makes Private Link a later config change rather than a re-architecture.

4. Explain the three table plans and which can source an alert. Analytics (hot, full KQL, highest ingest) – the only plan that can source alerts and dashboards. Basic (high-volume, KQL subset, per-query billed) for rarely-queried logs. Auxiliary (lowest ingest, limited KQL) for the raw firehose kept for compliance. Putting an alerting table on Basic/Auxiliary silently breaks the alert.

5. Difference between interactive and total retention? Interactive retention is the window you can query directly; total retention is interactive plus a cheap long-term archive. Alert rules and dashboards can only read interactive retention – archived data needs a search job or restore first, so an alert’s window must fit inside interactive retention.

6. When do you choose a metric alert over a scheduled-query alert? Whenever the signal exists as a metric: metric alerts are cheaper, near-real-time, stateful, and natively multi-resource with optional dynamic thresholds. Use a scheduled-query alert only when the signal lives in logs (e.g. “20 5xx from one pod”, a privileged role assignment) – it is more expressive but pays query latency and cost.

7. What do dimensions do for a log alert, and why do they matter? A dimension is a grouping column that splits one rule into one independently-firing, independently-resolving alert per value – so a rule grouped by Computer fires per machine instead of one giant alert that flaps as nodes flicker. Without dimensions you get an alert storm collapsed into one noisy, unhelpful alert.

8. Why must evaluation-frequency be at least the data latency? Log ingestion has minutes of latency. Evaluating every minute against data that arrives every three minutes produces false negatives and duplicate fires. Matching frequency to real ingestion latency keeps evaluations deterministic and avoids re-firing on the same window.

9. What is an alert processing rule and what two jobs does it do? A rule (Microsoft.AlertsManagement/actionRules) that sits between alerts and action groups. It can suppress notifications across a scope on a schedule (mute a patch window so 400 VMs don’t page) or add an action group to every matching alert (attach SecOps to all prod Sev0/1) – all without editing a single alert rule.

10. Why use the common alert schema for automation? It gives every alert type (metric, log, activity log) the same JSON envelope, so a Function or Logic App parses one shape instead of three. You read data.essentials.monitorCondition to act only on Fired, and alertTargetIDs to know what to remediate.

11. What are the two non-negotiables for an alert-triggered automation handler? Idempotency (alerts fire/resolve/re-fire and action groups retry, so the handler must tolerate double invocation without doubling the effect) and fast ack / async work (return 202 immediately and queue slow remediation, or a blocking webhook gets retried and duplicates work).

12. A team’s Log Analytics bill grew 40% with no new workloads – how do you diagnose and fix it? Query Usage | summarize sum(Quantity) by DataType to find the fattest stream (usually one chatty Syslog or app table). Add a transformKql to drop the noise pre-bill, move rarely-queried tables to Basic/Auxiliary, split regulated data to its own Analytics table, and ship it all as reviewed DCR/table changes.

These map to AZ-104 (Administrator)monitor and maintain Azure resources, configure Log Analytics, alerts, and action groups – and AZ-204 (Developer)instrument, monitor, and troubleshoot solutions, custom logs and automation. The design-and-cost angle touches AZ-305 (Solutions Architect). A compact cert mapping for revision:

Question theme Primary cert Objective area
AMA, DCR, DCE, association AZ-104 Configure & manage monitoring
Ingestion transforms, table plans, retention AZ-104 / AZ-305 Design a logging/cost strategy
Metric vs log alerts, dimensions AZ-104 Configure alerts & action groups
Action groups, processing rules, suppression AZ-104 Manage alerts at scale
Custom logs, automation handlers AZ-204 Instrument & troubleshoot solutions
Private ingestion, identity, RBAC AZ-305 / AZ-500 Secure observability

Quick check

  1. The Azure Monitor Agent is collecting nothing from a freshly onboarded VM even though the extension is installed. What is the one thing that actually arms the agent?
  2. Your verbose app table was moved to the Basic plan to save money, and now an alert that reads it reports “no data.” Why, and what’s the fix?
  3. A scheduled-query rule fires one giant alert that flaps every night during batch. What single setting fixes it?
  4. After adding an ingestion transform, every chart on that table is a flat line at one timestamp. What did the transform almost certainly drop?
  5. 400 VMs page on-call every Sunday during the patch window even though the alerts are “correct.” What do you add, and to which layer?

Answers

  1. A Data Collection Rule association. AMA does nothing until a DCR is associated to the resource (by hand, Bicep, or Policy). The installed extension is necessary but inert without the association.
  2. Basic (and Auxiliary) tables cannot be a source for alert rules – only Analytics can. Move the table back to Analytics and trim its volume with a transformKql instead of changing its plan.
  3. Add a dimension (e.g. --dimension Computer) so the rule fires one independently-resolving alert per entity instead of a single aggregate that flaps; a N of M violation count further rides brief flickers.
  4. TimeGenerated – the transform dropped it, so every row is stamped at ingestion time. Add TimeGenerated back to the project (or set it from the source event time).
  5. An alert processing rule of type RemoveAllActionGroups on a recurring schedule, added at the processing layer (between alerts and action groups) over the patched scope – the alert rules stay armed; nobody is paged during the window.

Glossary

Next steps

You can now build the whole Azure Monitor pipeline as code and control cost and noise at the right layer. Build outward:

AzureAzure MonitorObservabilityAlertingLog Analytics
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments