Azure Monitor and Application Insights: Full-Stack Observability

A development team judged application health by one number: server CPU. The CPUs sat at 12%, every dashboard was green, and users were timing out at the checkout. There was no request telemetry, no distributed trace, no way to see that a downstream payment API — not their own compute — was the bottleneck. They were watching the one signal that happened to be fine. Observability is the discipline of being able to ask any question about your running system after the fact, without shipping new code to answer it; and the reason that team was blind is that they had collected exactly one signal and called it monitoring.

This is the field guide to doing it properly on Azure. Azure Monitor is the platform umbrella — it ingests metrics (numeric time-series: CPU, request rate, latency) and logs (timestamped events: exceptions, traces, audit records) from every Azure resource. Log Analytics is the store and query engine underneath it: a columnar log store you interrogate with KQL (Kusto Query Language). Application Insights is the application-performance layer that sits on top of a Log Analytics workspace and adds request, dependency, exception and distributed-tracing telemetry, plus the Failures, Performance and Live Metrics experiences that tell you what a user actually experienced. Metrics tell you that something is wrong; logs tell you why; traces tell you where in a chain of services; Application Insights ties the three to a real user transaction. You need all of them, wired together, before the incident — not during it.

By the end of this article you will stop guessing during incidents. You will know which signal answers which question, how telemetry physically flows from an SDK call to an alert on someone’s phone, where it gets sampled, dropped or made expensive, and the exact az, Bicep and KQL to confirm and fix each failure. Because this is a reference you will return to mid-incident, the data-collection options, table tiers, sampling settings, KQL patterns, alert types and cost drivers are all laid out as scannable tables — read the prose once, then keep the tables open.

What problem this solves

Cloud applications are distributed by default — a single user click can fan out across a web app, three internal APIs, a database, a cache, a queue and a third-party payment provider. When that click is slow or fails, the failure surfaces somewhere, but the cause is usually one or two hops away from where you first look. Without a unified observability stack you debug by guessing: you SSH into a box, tail a log, restart something, and hope. The mean-time-to-resolution is measured in hours and the lesson learned is wrong.

What breaks without it is specific and expensive. You alert on infrastructure metrics (CPU, memory) that look fine while users suffer, because the bottleneck is a downstream dependency your CPU never sees. You cannot answer “which of our nine services made this request slow?” because you have no correlated trace. You discover a regression from a customer tweet rather than a chart. And when you finally do instrument, you either collect nothing useful (default sampling silently dropped the one request you needed) or everything (and the ingestion bill quietly triples). Observability done badly is worse than none, because it gives false confidence.

Who hits this: every team running anything non-trivial on Azure — but it bites hardest on microservice and multi-tier apps (no single log has the whole story), high-traffic apps (where sampling and ingestion cost become load-bearing decisions), PaaS-heavy estates (App Service, Functions, AKS, where you don’t own the host and can’t just tail a file), and on-call teams drowning in noisy alerts that fire on symptoms nobody can act on. The fix is not “more dashboards.” It is the right three signals, collected deliberately, correlated by design, and alerted on the things a user would actually notice.

To frame the whole field before the deep dive, here is the question each signal answers, where it lives, and the first place to look:

Signal	The question it answers	Where it lives on Azure	First place to look	The classic trap
Metrics	Is something wrong, and when did it start?	Azure Monitor Metrics (+ custom in Log Analytics)	Metrics Explorer; a metric alert	Alerting on infra metrics that look fine while users suffer
Logs	Why did it break — what was the error/event?	Log Analytics workspace (KQL)	App Insights Failures; `exceptions`/`traces`	Logging everything → cost spike; or nothing useful
Traces (distributed)	Where in the chain of services?	Application Insights (`requests`/`dependencies`)	Transaction search; Application Map	No correlation → can’t follow one request across services
User experience	What did the user actually see?	Application Insights (browser SDK, availability)	Users/Sessions; availability tests	Watching server health, not real-user outcomes

Learning objectives

By the end of this article you can:

Distinguish metrics, logs and distributed traces, name the question each answers, and pick the right one (and the right Azure surface) for a given diagnostic question instead of staring at CPU.
Stand up the full pipeline: instrument an app with the Application Insights SDK/auto-instrumentation via its connection string, route platform telemetry with diagnostic settings and Data Collection Rules (DCRs), and land it all in a Log Analytics workspace.
Read and write the KQL you actually need in an incident — the requests, dependencies, exceptions and traces queries that find the failing operation, the slow dependency and the noisy exception.
Configure adaptive and ingestion sampling deliberately so you keep the telemetry that matters and pay for it on purpose, and explain exactly what itemCount means when you read sampled data.
Build metric, log and dynamic-threshold alert rules against user-facing SLIs and route them through action groups to email, ITSM, webhooks, Functions and runbooks — without creating pager fatigue.
Choose table plans (Analytics / Basic / Auxiliary), retention and a daily cap to control the ingestion bill, and right-size a workspace topology for a multi-team estate.
Diagnose the common observability failures — no telemetry arriving, telemetry silently sampled away, an ingestion-cost blowout, a slow or empty KQL query, and silent or noisy alerts — with the exact command/portal path to confirm and the fix.

Prerequisites & where this fits

You should be comfortable with the Azure basics: a resource group, an Azure subscription, and running az in Cloud Shell reading JSON output. You should know what an App Service, Function or AKS workload is at a high level, since those are the things you’ll instrument, and understand HTTP request/response and the idea of a dependency call (your app calling a database or another API). Familiarity with at least one application stack (.NET, Java, Node, Python) helps, because instrumentation differs by language, but the concepts are identical across them.

This sits at the centre of the Observability & Operations track and is the tool every other operational article leans on. It is the layer beneath incident response: the Troubleshooting Azure App Service: 502/503, Cold Starts & Restart Loops playbook is half KQL against the very telemetry this article wires up, and the same is true for Troubleshooting Azure SQL: Connectivity, Timeouts, Throttling & Blocking. If you want to go deeper specifically on the collection plumbing — DCRs, transformations, action groups end to end — that is Azure Monitor Data Collection Rules, Workbooks, Alerting & Action Groups. On the cost side it pairs with Azure FinOps: Cost Management at Scale, because ingestion is one of the sneakier line items in an Azure bill.

A quick map of the four layers of an observability stack and who owns each, so you know which team to pull into an incident:

Layer	What it does	Azure surface	Who usually owns it
Instrumentation	Emits the telemetry from code/host	App Insights SDK, AMA agent, diag settings	App / dev team
Collection & shaping	Routes, filters, samples, transforms	DCR / DCE, sampling config	Platform / SRE
Store & query	Holds telemetry; answers KQL	Log Analytics workspace, Metrics store	Platform team
Insight & action	Visualizes, alerts, responds	Workbooks, Grafana, alerts, action groups	SRE + on-call

Core concepts

Five mental models make every later decision obvious.

Metrics, logs and traces are three different shapes of data, not three brands. A metric is a number at a point in time, pre-aggregated and cheap — request count per minute, p95 latency, CPU percent. A log is a structured event with a timestamp and arbitrary fields — an exception with a stack trace, an audit record, a custom event. A trace (distributed trace) is a tree of spans describing one logical operation as it crosses service boundaries, stitched together by a shared operation ID. Metrics are for “is it healthy and trending”; logs are for “what exactly happened”; traces are for “which hop in the chain is to blame.” Application Insights stores requests and dependencies as logs that carry trace correlation, which is why you can pivot from a metric spike to the exact failing operation to its full transaction in three clicks.

Azure Monitor is the umbrella; Log Analytics is the store; Application Insights is a lens. “Azure Monitor” is the product family — it owns the Metrics store, the alerting engine, and the collection pipeline. Log Analytics is the actual log database (a Kusto cluster) where almost all logs land; you query it with KQL. Application Insights is not a separate database — a modern (workspace-based) Application Insights resource writes into a Log Analytics workspace and gives you application-shaped tables (requests, dependencies, exceptions, pageViews) plus the APM experiences. So the same workspace can hold your VM logs, your platform diagnostics and your app telemetry, all queryable together. That unification is the whole point.

Telemetry has a physical pipeline, and things happen at each stage. Telemetry is emitted (SDK or agent), collected and shaped (sampling, a Data Collection Rule’s filter/transform), stored (Log Analytics table at some retention and table-plan), queried (KQL, Workbooks, the App Insights blades) and acted on (an alert rule fires an action group). Each stage is a place where signal can be lost (sampling), made expensive (ingesting a noisy log at full price), or rendered useless (a query that scans the wrong range). Knowing the pipeline tells you exactly where to look when something is wrong with the observability itself.

Sampling is a deliberate trade of fidelity for cost — understand it or it lies to you. High-traffic apps generate more telemetry than you want to pay to store. Adaptive sampling (the App Insights SDK default) keeps a representative fraction and drops the rest, but it is consistent — it keeps or drops an entire transaction together, so traces stay intact — and it records an itemCount multiplier on each retained item so metrics are still statistically correct. The trap: if you forget sampling is on, you’ll search for one specific request, not find it, and conclude the request never happened. It happened; it was sampled out. You must reason about sampling whenever you query individual records.

Identity, network and cost are first-class design decisions, not afterthoughts. Telemetry crosses the network (the SDK calls an ingestion endpoint on 443; a private estate needs a Private Link / AMPLS path). The workspace is an access-control boundary (who can read which logs is RBAC). And ingestion is metered per gigabyte, so what you collect and at which table plan is a recurring cost decision. Treating observability as “just turn it on” is how you get either a blind spot or a surprise invoice.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters
Azure Monitor	The platform umbrella for metrics, logs, alerts	Platform service	The brand that owns the pipeline + alerting
Log Analytics workspace	The Kusto log store you query with KQL	Resource group	Where (almost) all logs land; the cost boundary
Application Insights	APM lens over a workspace (requests/deps/traces)	Backed by a workspace	What the user experienced; distributed tracing
Metric	A pre-aggregated number over time	Metrics store / custom in LAW	“Is it healthy / trending”; cheap alerts
Log	A timestamped structured event	Log Analytics table	“What exactly happened”; rich but billed per GB
Distributed trace	A tree of spans for one operation	App Insights (`requests`/`dependencies`)	“Which hop is to blame”; needs correlation
KQL	Kusto Query Language	Log Analytics / App Insights	The language you debug in
Connection string	Where the SDK sends telemetry (+ keys)	App config / env var	If unset/wrong → no telemetry at all
DCR (Data Collection Rule)	Declarative collect/filter/transform	Azure Monitor	Routes + shapes agent/platform telemetry
Diagnostic setting	Sends a resource’s platform logs/metrics out	Per Azure resource	How a resource’s logs reach the workspace
Sampling	Keep a representative fraction of telemetry	SDK / ingestion	Cuts cost; can hide individual records
Table plan (tier)	Analytics / Basic / Auxiliary	Per Log Analytics table	Trades query power for ingestion price
Alert rule	A condition over metrics/logs that fires	Azure Monitor	Turns signal into a page
Action group	The fan-out of notifications/automation	Azure Monitor	Email/SMS/ITSM/webhook/Function/runbook

The three signals: metrics, logs and traces in depth

Everything starts with choosing the right shape of data for the question. Get this wrong and you collect the expensive thing to answer a question the cheap thing already answers — or worse, you answer with the signal you happen to have rather than the one that’s correct. Here is the full comparison, end to end:

Dimension	Metrics	Logs	Distributed traces
Shape	Numeric time-series, pre-aggregated	Structured timestamped events	Tree of spans (one operation)
Answers	That / when it’s wrong	Why it happened	Where in the service chain
Granularity	1-minute (platform) down to PT1M custom	Per-event (every record)	Per-span / per-hop
Cost model	Cheap (platform metrics largely free)	Per-GB ingested + retention	Per-GB (stored as correlated logs)
Query surface	Metrics Explorer; metric alerts	KQL over Log Analytics	App Insights Transaction search / Map
Retention default	~93 days (platform metrics)	30 days free, up to 730 days	Same as the workspace (App Insights)
Alert latency	Seconds–1 min (near-real-time)	1–5+ min (log query interval)	n/a (you trace after an alert)
Cardinality limit	Bounded (dimensions cost)	High (any field)	High
Best for	SLO dashboards, fast alerts	Forensics, audit, errors	Cross-service root cause
Worst for	Root cause (no detail)	Cheap high-frequency counters	Aggregate trend

Metrics — cheap, fast, and the right thing to alert on first

Platform metrics are emitted automatically by every Azure resource at roughly 1-minute resolution and are largely free to query and alert on. They’re pre-aggregated, so a metric alert can fire in under a minute — which is why the first line of defence is almost always a metric alert (HTTP 5xx rate, response time, CPU, queue depth), not a log query. Custom metrics (emitted by the App Insights SDK, e.g. a business counter) land alongside, and you can also emit metrics into Log Analytics. The discipline: alert on metrics that map to user experience (5xx rate, p95 latency, availability) rather than only infrastructure (CPU), because the infra can look perfect while the user suffers — exactly the trap in the opening story.

The metric properties that matter when you build a chart or an alert:

Metric property	What it controls	Typical value	Why it matters
Aggregation	How samples combine (Avg/Sum/Min/Max/Count)	Avg for latency, Sum for counts	Wrong aggregation hides the spike (Avg masks a tail)
Granularity / time grain	Bucket size	1 min (platform)	Finer grain = faster detection, more points
Dimensions / splitting	Break a metric by a property	by `instance`, by `resultCode`	Find the one bad instance/route; each dimension costs
Namespace	Which resource type the metric belongs to	`Microsoft.Web/sites`	Determines which metrics exist
Retention	How long it’s queryable	~93 days (platform)	Long-term trend needs export to a workspace
Time aggregation window	Period the alert evaluates	5 min	Too short = flaps; too long = slow alert

# List the metric definitions a resource actually exposes (don't guess names)
az monitor metrics list-definitions \
  --resource $(az webapp show -n app-shop-prod -g rg-shop-prod --query id -o tsv) \
  --query "[].{metric:name.value, unit:unit}" -o table

# Pull HTTP 5xx for the last hour, 1-minute grain
az monitor metrics list \
  --resource $(az webapp show -n app-shop-prod -g rg-shop-prod --query id -o tsv) \
  --metric Http5xx --interval PT1M --aggregation Total -o table

Logs — the forensic record, billed per gigabyte

A log is the detailed event you read after a metric tells you something is wrong. Exceptions, request records, dependency calls, platform diagnostics, audit events — all land as rows in Log Analytics tables (exceptions, requests, AzureDiagnostics, AppServiceHTTPLogs, …) and you query them with KQL. Logs are where root cause actually lives, but they are billed per GB ingested plus retention, so the engineering decision is which logs at what fidelity. The most common mistakes are equal and opposite: enabling every diagnostic category “just in case” (a cost blowout), or enabling none (flying blind). The right answer is deliberate — the high-value categories at full price, the high-volume-low-value ones at a cheaper table plan or not at all.

The log tables you’ll actually open, by source:

Table	Source	What it holds	You query it for
`requests`	App Insights	Incoming HTTP requests + result code	Failing/slow operations
`dependencies`	App Insights	Outbound calls (DB, HTTP, queue)	The slow/failing downstream
`exceptions`	App Insights	Server + client exceptions	What’s throwing and where
`traces`	App Insights	App log lines (ILogger/console)	Correlated app logs for an operation
`customEvents` / `customMetrics`	App Insights	Business events / counters	Funnel/business telemetry
`pageViews` / `browserTimings`	App Insights (browser)	Real-user front-end timing	Client-side experience
`AzureDiagnostics`	Diagnostic settings	Many resources’ platform logs	Resource-level operations/errors
`AppServiceHTTPLogs`	App Service diag	Web server access logs	Status codes, latency at the edge
`AzureActivity`	Activity log	Control-plane operations	“Who changed/restarted what”
`Heartbeat`	AMA agent	Agent liveness	Is the VM/agent reporting at all

Distributed traces — following one request across the whole estate

The signal that the opening team lacked. A distributed trace stitches a single user operation across every service it touches using a propagated operation ID (Azure adopts the W3C Trace Context traceparent header). In Application Insights, an incoming request becomes a requests row and every outbound call it makes becomes a dependencies row carrying the same operation_Id — so Transaction search can reconstruct the full waterfall, and the Application Map draws the live topology with per-edge latency and failure rates. This is what lets you say “the checkout was slow because the payment dependency took 4.2 s, not our code” in seconds. The prerequisite is that every service is instrumented and propagates the context; a single un-instrumented hop breaks the chain.

The trace correlation fields and what each is for:

Field	Meaning	Used for
`operation_Id`	The whole end-to-end trace ID	Group every span of one transaction
`operation_ParentId`	The parent span’s ID	Build the call tree / waterfall
`id`	This span’s own ID	Identify a single hop
`operation_Name`	Logical operation (e.g. `POST /checkout`)	Aggregate by route
`cloud_RoleName`	The service/app name	Which service in the Application Map
`cloud_RoleInstance`	The specific instance	Is one instance worse than the rest
`success`	Did the request/dependency succeed	Filter to failures
`resultCode`	Status/result	500 vs 200; SQL error number

// Reconstruct one transaction end-to-end from any operation_Id
let opId = "0HM...EXAMPLE";
union requests, dependencies, exceptions, traces
| where operation_Id == opId
| project timestamp, itemType, name, target, resultCode, duration, success
| order by timestamp asc

Application Insights end to end: instrument, correlate, explore

Application Insights is the single most useful tool in this whole stack — it is where you’ll spend the incident. Getting it wired correctly (and modern) matters more than any dashboard.

The connection string — the one setting that decides whether anything arrives

Modern Application Insights is configured with a connection string, not the legacy bare instrumentation key (iKey). The connection string carries the iKey and the regional ingestion endpoint (and Live Metrics endpoint), which is why the legacy iKey-only path is deprecated — it hard-codes the global endpoint and breaks in sovereign/regional clouds and Private Link setups. If telemetry isn’t arriving, this is the first thing to check: is APPLICATIONINSIGHTS_CONNECTION_STRING set, and is it the connection string (not just a GUID)?

# Set the connection string on an App Service (the modern, correct way)
az webapp config appsettings set -n app-shop-prod -g rg-shop-prod \
  --settings APPLICATIONINSIGHTS_CONNECTION_STRING="$(az monitor app-insights component show \
    -a ai-shop-prod -g rg-shop-prod --query connectionString -o tsv)"

// Workspace-based App Insights + wire its connection string into the web app
resource ai 'Microsoft.Insights/components@2020-02-02' = {
  name: 'ai-shop-prod'
  location: location
  kind: 'web'
  properties: {
    Application_Type: 'web'
    WorkspaceResourceId: law.id          // workspace-based (classic is retired)
  }
}
resource appSettings 'Microsoft.Web/sites/config@2023-12-01' = {
  parent: site
  name: 'appsettings'
  properties: {
    APPLICATIONINSIGHTS_CONNECTION_STRING: ai.properties.ConnectionString
    ApplicationInsightsAgent_EXTENSION_VERSION: '~3'   // codeless auto-instrumentation
  }
}

The connection-string and instrumentation choices, compared:

Approach	How it works	Pros	Cons / when not
Connection string (current)	iKey + ingestion + live endpoints in one string	Works in all clouds + Private Link; future-proof	None — this is the standard
Instrumentation key only (legacy)	Bare GUID, global endpoint assumed	Simplest historically	Deprecated; breaks regional/PL routing
Codeless / auto-instrumentation	Platform agent injects the SDK	No code change; fast	Less control; not every stack/feature
SDK (manual)	Add the package, configure in code	Full control, custom telemetry	You own upgrades and config
OpenTelemetry + Azure Monitor exporter	Vendor-neutral OTel → App Insights	Portable instrumentation	Newer; feature parity still maturing

The instrumentation surface: what gets captured

Once wired, the SDK/agent captures a standard set of telemetry types automatically, and you can add custom ones. Knowing the type tells you which table it lands in and how it’s billed:

Telemetry type	Captured automatically?	Lands in table	Notes
Request	Yes (server SDK/agent)	`requests`	One per incoming HTTP request
Dependency	Yes (HTTP/SQL/queue auto-collected)	`dependencies`	One per outbound call
Exception	Yes (unhandled) + manual	`exceptions`	TrackException for handled ones
Trace (log)	Via ILogger / console capture	`traces`	App log lines, severity-filtered
Custom event	Manual (`TrackEvent`)	`customEvents`	Business funnels
Custom metric	Manual (`TrackMetric`/`GetMetric`)	`customMetrics`	Pre-aggregate hot counters
Page view / browser	Browser (JS) SDK	`pageViews`	Real-user front-end
Availability	Availability tests	`availabilityResults`	Synthetic uptime probes

The experiences you live in during an incident

The portal blades are not decoration — each maps to a question. Failures groups failed requests/dependencies/exceptions by type and shows the exact failing operation and stack. Performance ranks operations by duration so you find the slow one. Live Metrics streams request/failure rate, CPU and live exceptions in real time with sub-second latency (and is not sampled) — invaluable while an incident is unfolding. Transaction search reconstructs one operation’s full waterfall. The Application Map draws the topology with per-edge health.

Experience	Question it answers	Sampled?	When to reach for it
Failures	What’s failing and why?	Yes (respects sampling)	Triage a spike in errors
Performance	What’s slow, and which operation?	Yes	Latency regression
Live Metrics	What’s happening right now?	No (live stream)	During an active incident
Transaction search	What did this one request do?	Yes	Follow a specific failed transaction
Application Map	What’s the topology + per-hop health?	Aggregated	Find the unhealthy service/edge
Users / Sessions / Funnels	Real-user behaviour & impact	Yes	Blast radius, business impact
Availability	Is it up from the outside?	n/a	Synthetic uptime / SLA

// Top failing operations in the last hour with their result code
requests
| where timestamp > ago(1h) and success == false
| summarize failures = sum(itemCount) by operation_Name, resultCode
| order by failures desc

Log Analytics and KQL: the query layer you debug in

All of the above lands in a Log Analytics workspace, and KQL is how you ask it questions. You don’t need to be a Kusto wizard — a dozen patterns cover ninety percent of incidents. The structure is always the same: pick a table, filter by time first (this bounds the scan and the cost), filter by condition, then summarize or project.

The KQL operators you’ll actually use

Operator	What it does	Example fragment
`where`	Filter rows (put time first)	`where timestamp > ago(30m)`
`summarize`	Aggregate (count, avg, percentile)	`summarize count() by operation_Name`
`project` / `extend`	Select / compute columns	`project name, duration`
`join`	Correlate two tables	`join kind=inner (exceptions) on operation_Id`
`bin()`	Bucket time for trends	`summarize count() by bin(timestamp, 5m)`
`percentile()`	Latency tails (p95/p99)	`summarize percentile(duration, 95)`
`top` / `order by`	Rank	`top 10 by failures desc`
`parse` / `extract`	Pull fields from strings	`parse message with ...`
`union`	Combine tables	`union requests, dependencies`
`make-series`	Dense time-series for charts/anomaly	`make-series ... default=0`
`materialize()`	Cache a subquery reused multiple times	`let x = materialize(...)`

The queries you reach for in an incident

One query per question — keep these bookmarked:

Question	Table	One-liner
Which requests are failing and where?	`requests`	`where success==false \| summarize sum(itemCount) by resultCode, operation_Name`
What’s actually throwing?	`exceptions`	`summarize count() by problemId, outerMessage`
Which dependency is failing/slow under load?	`dependencies`	`where success==false \| summarize count() by target, type`
Is one instance worse than the rest?	`requests`	`summarize count() by cloud_RoleInstance`
Are requests slow (cold start / timeout)?	`requests`	`summarize percentile(duration,95) by bin(timestamp,1m)`
What did this one transaction do?	`union`	`where operation_Id == "<id>" \| order by timestamp asc`
How much am I ingesting, by source?	`Usage`	`summarize sum(Quantity)/1000 by DataType`
Did sampling drop my record?	`requests`	`summarize keptRepresenting = sum(itemCount), rows = count()`

// Slow dependencies in the last hour, ranked — the "which downstream?" query
dependencies
| where timestamp > ago(1h)
| summarize calls = sum(itemCount), p95 = percentile(duration, 95) by target, type
| where p95 > 1000   // ms
| order by p95 desc

// Exception rate trend, 5-minute buckets — feeds a chart or a log alert
exceptions
| where timestamp > ago(6h)
| summarize errors = sum(itemCount) by bin(timestamp, 5m), cloud_RoleName
| render timechart

A note on itemCount: when sampling is on, each retained row represents itemCount original items, so you sum(itemCount) (not count()) to get true volumes. Forgetting this undercounts everything by the sampling factor — a real source of “the numbers don’t match the load balancer.”

Data collection: diagnostic settings, DCRs and agents

Telemetry doesn’t collect itself. Three mechanisms feed the workspace, and choosing the right one — and shaping the data on the way in — is where you control both coverage and cost.

Diagnostic settings — the per-resource firehose

Every Azure resource can have diagnostic settings that send its platform logs (categories like AuditLogs, AppServiceHTTPLogs, SQLSecurityAuditEvents) and platform metrics to a destination — a Log Analytics workspace, a storage account (cheap archive), or an Event Hub (stream out). The decision per resource is which categories (each has a volume/value profile) and which destination. Sending high-volume categories to Log Analytics at full price is the classic bill-inflater; route those to storage or a cheaper table plan instead.

Destination	Use it for	Cost profile	Query story
Log Analytics workspace	Interactive query + alerting	Per-GB ingest + retention	Full KQL
Storage account	Long-term cheap archive / compliance	Cheapest per-GB	No KQL (export/parse)
Event Hub	Stream to SIEM / third-party	Throughput-based	External consumer
Partner / Marketplace	Datadog, etc.	Vendor billing	Vendor tooling

# Send App Service HTTP + console logs and all metrics to the workspace
az monitor diagnostic-settings create \
  --name diag-to-law \
  --resource $(az webapp show -n app-shop-prod -g rg-shop-prod --query id -o tsv) \
  --workspace $(az monitor log-analytics workspace show -g rg-obs -n law-shared --query id -o tsv) \
  --logs    '[{"category":"AppServiceHTTPLogs","enabled":true},{"category":"AppServiceConsoleLogs","enabled":true}]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

Data Collection Rules (DCRs) — collect, filter and transform on the way in

For agent-based and many resource-log paths, the modern control plane is the Data Collection Rule (DCR): a declarative object that says what to collect (perf counters, syslog, Windows events, custom logs), where to send it, and — crucially — an optional KQL transformation that filters or reshapes rows before ingestion. That transform is a cost lever and a privacy lever: drop debug-level noise, strip a PII column, or down-sample chatty events before you pay to store them. A Data Collection Endpoint (DCE) is the network ingress the DCR uses (and the anchor for Private Link). DCRs are covered in depth in Azure Monitor Data Collection Rules, Workbooks, Alerting & Action Groups; here is the shape and the knobs.

DCR element	What it controls	Why it matters
Data sources	Counters / syslog / events / custom	Defines what is collected
Destinations	Which workspace(s)	Routing / multi-home
Transform (KQL)	Filter/reshape pre-ingestion	Cut volume + cost; drop/mask fields
Streams	Named schema of the data	Binds a source to a transform/destination
DCE	Network ingestion endpoint	Private Link anchor; regional ingress
Association	Which resources the DCR applies to	One rule, many machines

// DCR that collects perf + syslog and DROPS debug-level syslog before ingestion
resource dcr 'Microsoft.Insights/dataCollectionRules@2022-06-01' = {
  name: 'dcr-linux-prod'
  location: location
  properties: {
    dataSources: {
      syslog: [ { name: 'sys', facilityNames: ['auth','daemon'], logLevels: ['Warning','Error','Critical'], streams: ['Microsoft-Syslog'] } ]
    }
    destinations: { logAnalytics: [ { name: 'law', workspaceResourceId: law.id } ] }
    dataFlows: [ {
      streams: ['Microsoft-Syslog']
      destinations: ['law']
      transformKql: 'source | where SeverityLevel != "Debug"'   // cost control at ingestion
    } ]
  }
}

Agents — what runs inside a VM/AKS to collect host telemetry

For VMs and Kubernetes you need an in-host collector. The current one is the Azure Monitor Agent (AMA), configured by DCRs — it replaced the legacy Log Analytics agent (MMA/OMS), which is retired. The distinction matters because old docs and old templates still reference MMA; on a greenfield estate you use AMA + DCRs exclusively.

Agent	Status	Configured by	Use it when
Azure Monitor Agent (AMA)	Current	DCRs	Everything new (VMs, Arc, AKS host)
Log Analytics agent (MMA/OMS)	Retired	Workspace config	Migrate off it
Diagnostics extension (WAD/LAD)	Legacy, niche	Extension config	Specific legacy guest-metric paths
Container Insights ( AKS)	Current	DCR (managed)	AKS cluster/node/pod telemetry
Dependency agent	Add-on	With AMA	Service Map / VM dependency view

Sampling and ingestion control: keeping the right data at the right price

High-traffic apps force a choice: store everything (expensive) or sample (cheaper, but you must understand what you keep). This section is where observability stops being “turn it on” and becomes engineering.

How sampling works, and the three kinds

Adaptive sampling is the App Insights SDK default for server telemetry: it dynamically keeps a target rate (e.g. ~5 items/second per instance) and drops the rest, consistently (a whole transaction is kept or dropped together, so traces stay intact) and with itemCount so aggregate metrics remain correct. Fixed-rate sampling keeps a constant percentage (good when you want predictable volume and to coordinate client+server). Ingestion sampling happens at the service after data leaves the SDK (a blunt fallback when you can’t change code). The cardinal rule: never let sampling silently drop the telemetry you most need — exclude critical types (e.g. all exceptions, or a specific high-value operation) from sampling.

Sampling type	Where it runs	Keeps	Pros	Cons
Adaptive (default)	SDK, per instance	Target items/sec, consistent	Auto-tunes to load; traces intact	Rate varies; must reason about `itemCount`
Fixed-rate	SDK (client + server)	Constant %	Predictable; coordinate end-to-end	Doesn’t adapt to spikes
Ingestion sampling	App Insights service	Constant % post-SDK	No code change	Blunt; you already paid to send it
No sampling	—	Everything	Full fidelity	Highest cost; high-volume apps can’t

// ASP.NET Core: adaptive sampling but NEVER sample exceptions (you always want those)
// appsettings.json fragment expressed as guidance:
//   "ApplicationInsights": {
//     "EnableAdaptiveSampling": true,
//     "SamplingSettings": { "MaxTelemetryItemsPerSecond": 5,
//        "ExcludedTypes": "Exception" }
//   }

Ingestion cost levers: table plans, retention, and the daily cap

Two settings move the bill more than anything else. First, the table plan (tier): Analytics (full KQL, alerting, dashboards), Basic (cheaper ingestion, query-only with limits, short interactive retention — for high-volume, occasionally-queried logs like verbose app logs), and Auxiliary (cheapest, for rarely-queried archival/audit data). Second, retention: 30 days is included; beyond that you pay, up to 730 days interactive, with cheaper long-term archive beyond. And the daily cap is the seatbelt: it stops ingestion (or warns) when you hit a GB ceiling, so a runaway log can’t produce a runaway invoice — but set it carefully, because a cap that’s too low drops the telemetry you need during the very incident that spiked it.

Table plan	Ingestion cost	Query	Interactive retention	Best for
Analytics	Standard (highest)	Full KQL, alerts, dashboards	up to 730 days	Security/ops logs you query + alert on
Basic	Lower	Query-only, limited operators	30 days (then archive)	High-volume verbose logs, occasional query
Auxiliary	Lowest	Limited, batch	Long (archive-first)	Rarely-queried audit/compliance

Cost lever	What it does	Range / default	Watch-out
Daily cap (GB/day)	Stops/warns ingestion at a ceiling	off by default	Too low → drops data mid-incident
Commitment tier	Discounted reserved GB/day	100/200/…/5000 GB	Under-commit wastes; over-commit unused
Retention (interactive)	Days queryable with full KQL	30 free → 730	Long retention multiplies storage cost
Archive	Cheap cold retention	beyond interactive	Restore/search has latency + cost
Per-table retention	Override workspace default per table	per table	Keep audit long, telemetry short
Basic/Aux tier move	Re-tier a noisy high-volume table	per table	Lose alerting on Basic tables

# Set a daily cap and a sane default retention on the workspace
az monitor log-analytics workspace update -g rg-obs -n law-shared \
  --retention-time 90
az monitor log-analytics workspace update -g rg-obs -n law-shared \
  --workspace-capping-daily-quota-gb 50

// Where is the volume actually going? Run this before you optimize anything.
Usage
| where TimeGenerated > ago(7d) and IsBillable == true
| summarize GB = sum(Quantity)/1000 by DataType
| order by GB desc

Alerts and action groups: turning signal into the right page

Telemetry you don’t alert on is a forensic luxury; telemetry you over-alert on is pager fatigue that trains people to ignore the page. The art is alerting on symptoms a user would notice, at thresholds that mean “act now,” routed to the right responder.

Alert rule types

Alert type	Evaluates	Latency	Cost	Best for
Metric alert	A metric vs threshold (static/dynamic)	Near-real-time (≈1 min)	Cheap (per rule)	5xx rate, latency, CPU, queue depth
Log (scheduled query) alert	A KQL query result on an interval	Minutes (query interval)	Per evaluation	Anything only logs can express
Activity log alert	Control-plane events	Minutes	Free	“Someone deleted/restarted X”
Resource health alert	Azure-reported resource health	Minutes	Free	Platform-side outages
Smart Detection (App Insights)	ML over your telemetry	Auto	Included	Anomaly/failure-rate surprises

Static vs dynamic thresholds, and severity

A static threshold is a fixed number (5xx > 1%). A dynamic threshold learns the metric’s normal pattern (including daily/weekly seasonality) and alerts on deviation — better for metrics with no obvious fixed line (traffic, latency that varies by time of day). Severity (Sev0 critical → Sev4 verbose) should map to response expectation, and you should tune evaluation frequency and aggregation window to avoid flapping.

Threshold / setting	What it does	When to use
Static threshold	Fixed value comparison	You know the SLO line (e.g. p95 < 800 ms)
Dynamic threshold	ML-learned normal band	Seasonal/variable metrics (traffic, latency)
Severity Sev0–Sev4	Criticality → response expectation	Sev0 = wake someone; Sev3/4 = ticket
Aggregation window	Period evaluated	Smooth out 1-minute spikes
Evaluation frequency	How often it’s checked	Balance speed vs flapping/cost
Auto-mitigate	Resolve when condition clears	Reduce stale alerts
Suppression / action rules	Mute during maintenance; dedupe	Stop storms; respect change windows

Action groups — the fan-out

An action group is the reusable list of what happens when an alert fires: notify (email, SMS, push, voice), integrate (webhook, ITSM/ServiceNow, Logic App), or automate (Azure Function, Automation runbook). One well-built action group is referenced by many alerts. Test it before you rely on it — an untested action group is the reason a real alert pages nobody.

Action group action	Channel	Use for
Email / SMS / Push / Voice	Azure Mobile App, phone	Human notification, tiered by severity
Webhook / Secure webhook	HTTP callback	Custom integrations, ChatOps
ITSM / ServiceNow	Connector	Auto-create incidents/tickets
Logic App	Workflow	Enrich, route, multi-step response
Azure Function	Code	Auto-remediation logic
Automation runbook	PowerShell/Python	Restart/scale/heal actions

# Create an action group, then a metric alert that uses it
az monitor action-group create -g rg-obs -n ag-oncall \
  --short-name oncall \
  --email-receiver name=sre email=sre@kloudvin.example

az monitor metrics alert create -g rg-obs -n alert-5xx \
  --scopes $(az webapp show -n app-shop-prod -g rg-shop-prod --query id -o tsv) \
  --condition "total Http5xx > 10" \
  --window-size 5m --evaluation-frequency 1m \
  --severity 1 --action ag-oncall \
  --description "HTTP 5xx spike on shop-prod"

// A log (scheduled query) alert on the exception rate, wired to the action group
resource logAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-exception-rate'
  location: location
  properties: {
    severity: 2
    enabled: true
    scopes: [ ai.id ]
    evaluationFrequency: 'PT5M'
    windowSize: 'PT5M'
    criteria: { allOf: [ {
      query: 'exceptions | summarize errors = sum(itemCount) by bin(timestamp, 5m)'
      timeAggregation: 'Total'
      metricMeasureColumn: 'errors'
      operator: 'GreaterThan'
      threshold: 50
    } ] }
    actions: { actionGroups: [ actionGroup.id ] }
  }
}

Workspace topology, identity and network

How you arrange workspaces and lock them down is an architecture decision that’s painful to change later.

One workspace or many?

The pull is between centralization (one workspace = cross-resource KQL joins, one place to query, simpler) and isolation (separate workspaces for RBAC boundaries, data residency, or per-team cost attribution). The pragmatic answer for most estates: a small number of workspaces — often one per environment (prod/non-prod) or per region — not one-per-app (which fragments your queries) and not literally one-for-everything (which muddies access and cost). For Application Insights specifically, workspace-based resources let many app components share a workspace while staying logically separate.

Topology	Pros	Cons	Fits
Single workspace	Easiest cross-query; one cost view	Coarse RBAC; residency limits; blast radius	Small/single-team estates
Per-environment (prod/non-prod)	Clean prod isolation; sane cost split	Two places to look	Most teams (recommended default)
Per-team / per-BU	Cost attribution; access boundaries	Cross-team queries need union/Lighthouse	Large multi-team orgs
Per-region	Data residency; latency	Global view needs cross-workspace	Regulated / global apps
Per-app (anti-pattern)	Tight isolation	Fragments queries; sprawl	Rarely justified

Access control and the secure ingestion path

Reading logs is RBAC: Log Analytics Reader to query, Log Analytics Contributor to manage, and table-level RBAC to scope sensitive tables. There’s also a workspace access-control mode governing whether resource-level permissions or only workspace permissions apply. On the network side, a private estate uses Private Link via Azure Monitor Private Link Scope (AMPLS) so telemetry and queries never traverse the public internet, with a DCE as the private ingress.

Control	Mechanism	Secures
Who can query	`Log Analytics Reader` role	Read access to logs
Who can manage	`Log Analytics Contributor`	Workspace/DCR management
Per-table access	Table-level RBAC	Sensitive tables (e.g. security)
Resource-vs-workspace scope	Access control mode	Whether resource perms grant log read
Private ingestion/query	AMPLS + Private Link + DCE	No public-internet telemetry path
Managed-identity ingestion	MI on agents/exporters	No keys in config
Customer-managed keys	CMK on the workspace	Encryption key ownership

Architecture at a glance

The diagram traces telemetry exactly as it flows in production, left to right, and marks the five places it most often goes wrong. Start at SOURCES: your application emits request, dependency and exception telemetry through the Application Insights SDK (configured by a connection string); Azure resources emit platform metrics and diagnostic logs via diagnostic settings; and VMs/AKS emit host telemetry through the Azure Monitor Agent, driven by DCRs. All three feed COLLECTION, where a Data Collection Rule can filter and transform rows before they cost anything, and sampling keeps a representative fraction of high-volume app telemetry. Everything then lands once in the STORE — a Log Analytics workspace for logs (30-day free, up to 730-day retention, queried with KQL) alongside the metrics store (≈93-day, 1-minute grain). From there, INSIGHT reads it back: Application Insights (Failures, Live Metrics, Transaction search) and Workbooks (KQL dashboards, also feeding Grafana). Finally ACTION: alert rules evaluate metrics and log queries, and a fired alert fans out through an action group to email, ITSM, a Function or a runbook.

Read the numbered badges as the failure map that overlays this path. (1) No telemetry at the source — the connection string is unset or egress to 443 is blocked, so the Failures and Live blades are simply empty. (2) Sampling silently drops the one record you’re searching for — the row exists but represents others via itemCount. (3) Ingestion and cost spike at the store — a noisy log blows the daily cap and data stops. (4) A KQL query times out or returns nothing because it scans the wrong range or hits a Basic-tier table. (5) An alert is silent (missed incident) or noisy (pager fatigue) because it watches the wrong signal or has no dynamic threshold. The whole method is in that overlay: localise the problem to a stage, run the named confirm, apply the fix.

Real-world scenario

Cartwheel Commerce runs a mid-size e-commerce platform on Azure: a customer-facing web app and nine internal services (catalog, cart, checkout, pricing, inventory, payments-gateway, notifications, search, recommendations) across App Service and AKS in Central India, fronted by Application Gateway. Traffic averages 600 requests/second with a 9pm spike to ~2,200 rps during sales. The SRE team is five engineers; before this project, “monitoring” was a Grafana board of CPU and memory per node, and a single Sev-everything email alias.

The triggering incident was a checkout slowdown during a Tuesday-night sale. Conversion dropped 18% in ninety minutes. The CPU/memory board was entirely green — every node comfortable. The on-call engineer did what the tooling allowed: restarted the checkout pods (no change), scaled the AKS node pool (no change), and escalated. Two hours in, with real revenue lost, someone manually grepped the payments-gateway logs on one pod and found 4-second waits calling the third-party processor. The root cause had been one un-instrumented hop away the entire time, invisible to a stack that only watched infrastructure.

The rebuild was deliberate and followed the pipeline in this article. Instrumentation: every service got the Application Insights SDK wired by connection string (not the legacy iKey), all sharing one workspace-based Application Insights backed by a single prod Log Analytics workspace, with W3C trace context propagated end to end so a checkout could be followed across all nine services. Collection & shaping: App Service and AKS diagnostic settings routed platform logs to the workspace; a DCR transform dropped debug-level syslog before ingestion, and verbose request logs were moved to a Basic table plan. Sampling was set to adaptive at 5 items/sec per instance — but with exceptions excluded from sampling, so no error could ever be sampled away. Insight: a Workbook became the live reliability board (p95 by operation, 5xx rate, top failing dependencies), and the Application Map drew the real topology with per-edge latency. Action: metric alerts on user-facing SLIs — checkout p95, overall 5xx rate, availability — with dynamic thresholds to handle the nightly traffic curve, routed through a tiered action group (Sev1 → on-call phone, Sev3 → a ticket).

The next sale told the story. Checkout latency crept up again at 9:05pm; this time a dynamic-threshold alert fired in 90 seconds, the responder opened Application Insights → Failures, and the Application Map lit the payments-gateway → external processor edge red with a 3.9 s p95. They flipped checkout to the backup processor in four minutes; conversion never moved. The KQL that confirmed it was one line — dependencies | where target contains "processor" | summarize percentile(duration,95) by bin(timestamp,1m). MTTR fell from ~2 hours to under 8 minutes. The cost surprised them in the right direction: deliberate sampling and the Basic-tier move halved ingestion versus their first “collect everything” draft, landing the whole observability bill near ₹22,000/month for the estate. The lesson on the wall: “Watch what the user feels, follow the trace to the hop, and never let sampling eat your exceptions.”

The incident, before and after, as a contrast table:

Aspect	Before (infra-only)	After (full-stack observability)
What was watched	CPU/memory per node	User-facing SLIs (checkout p95, 5xx, availability)
Time to detect	Customer/conversion drop (~30 min)	Dynamic-threshold alert (~90 s)
Time to root cause	~2 h (manual log grep)	~4 min (Failures + App Map)
Cross-service view	None	Distributed trace + Application Map
Error capture	Best-effort	Exceptions excluded from sampling (always kept)
Alerting	One Sev-everything email	Tiered action group, dynamic thresholds
Ingestion cost	~₹44k (collect-everything draft)	~₹22k (sampling + Basic tier)
MTTR	~2 h	< 8 min

Advantages and disadvantages

The unified Azure-native stack both removes the blind spots that hurt Cartwheel and introduces decisions you must make on purpose. Weigh it honestly:

Advantages (why this model helps you)	Disadvantages (why it bites)
One workspace holds platform, infra and app telemetry — cross-resource KQL joins in one place	Ingestion is billed per GB; “collect everything” quietly inflates the bill
Application Insights gives end-to-end distributed tracing and the exact failing operation in clicks	A single un-instrumented hop breaks the trace chain and hides the cause
KQL is expressive and fast at scale; one language across logs, traces and security data	KQL is a learning curve; a badly-scoped query is slow and can return nothing
Native, deeply integrated with every Azure resource (diag settings, DCRs, alerts)	Multi-cloud / third-party sources need extra plumbing (Event Hub, agents, OTel)
Built-in correlation (operation IDs, W3C trace context) — no custom stitching	You must propagate context everywhere or correlation silently fails
Sampling + table tiers let you tune fidelity vs cost precisely	Sampling can hide individual records if you don’t reason about `itemCount`
Alerts → action groups automate response (Functions, runbooks, ITSM)	Poorly-tuned alerts create pager fatigue that trains people to ignore pages
Defaults get you telemetry fast (auto-instrumentation, adaptive sampling)	Defaults are not free or complete — connection string, sampling and cost need tuning

The model is the right default for any Azure-centric estate where you want native integration and one query language across infrastructure, application and security telemetry. It’s less of a slam-dunk when you’re heavily multi-cloud (you’ll bridge other sources in, or run a vendor-neutral OTel pipeline), or when a specialised APM/SIEM is mandated. The disadvantages are all manageable — they’re decisions (what to collect, how to sample, what to alert on), not flaws — which is exactly why this article enumerates the knobs.

Hands-on lab

Stand up a real observability slice end to end — workspace, Application Insights, an instrumented App Service, a KQL query, and an alert — all free-tier-friendly (we use B1 + the included telemetry allowance; delete at the end). Run in Cloud Shell (Bash).

Step 1 — Variables and resource group.

RG=rg-obs-lab
LOC=centralindia
LAW=law-obs-lab
AI=ai-obs-lab
PLAN=plan-obs-lab
APP=app-obs-$RANDOM
az group create -n $RG -l $LOC -o table

Step 2 — Create the Log Analytics workspace.

az monitor log-analytics workspace create -g $RG -n $LAW -l $LOC -o table
LAW_ID=$(az monitor log-analytics workspace show -g $RG -n $LAW --query id -o tsv)

Expected: a workspace row; note the id.

Step 3 — Create a workspace-based Application Insights resource.

az extension add -n application-insights 2>/dev/null
az monitor app-insights component create -g $RG -a $AI -l $LOC \
  --workspace "$LAW_ID" --application-type web -o table
AI_CONN=$(az monitor app-insights component show -g $RG -a $AI --query connectionString -o tsv)

Expected: a component row; AI_CONN is a full connection string (contains InstrumentationKey= and IngestionEndpoint=), not a bare GUID.

Step 4 — Create a B1 Linux App Service and wire the connection string + codeless agent.

az appservice plan create -n $PLAN -g $RG --is-linux --sku B1 -o table
az webapp create -n $APP -g $RG -p $PLAN --runtime "DOTNETCORE:8.0" -o table
az webapp config appsettings set -n $APP -g $RG --settings \
  APPLICATIONINSIGHTS_CONNECTION_STRING="$AI_CONN" \
  ApplicationInsightsAgent_EXTENSION_VERSION="~3"

Step 5 — Generate traffic, then confirm telemetry arrived. Hit the site a few times so there’s something to see:

for i in $(seq 1 20); do curl -s -o /dev/null "https://$APP.azurewebsites.net/"; done

Wait 2–3 minutes (ingestion latency), then query the workspace via App Insights:

az monitor app-insights query -g $RG -a $AI \
  --analytics-query "requests | summarize count() by resultCode | order by count_ desc"

Expected: at least one row (e.g. 200). If empty after 5 minutes, re-check the connection string is the full string and the app restarted.

Step 6 — Create an action group and a metric alert on HTTP 5xx.

az monitor action-group create -g $RG -n ag-lab --short-name lab \
  --email-receiver name=me email=h.vinod@gmail.com
az monitor metrics alert create -g $RG -n alert-5xx-lab \
  --scopes $(az webapp show -n $APP -g $RG --query id -o tsv) \
  --condition "total Http5xx > 0" \
  --window-size 5m --evaluation-frequency 1m \
  --severity 2 --action ag-lab \
  --description "Any 5xx on the lab app"

Expected: an alert-rule row; the action group is referenced by id.

Validation checklist. You created a workspace, a workspace-based Application Insights, an instrumented app sending real requests telemetry, confirmed it with KQL, and wired a metric alert through an action group — the entire pipeline in miniature. The steps mapped to what each proves:

Step	What you did	What it proves
2–3	Workspace + workspace-based App Insights	Telemetry lands once, in a shared store
4	Connection string + codeless agent	The one setting that makes telemetry flow
5	Traffic → KQL query returns rows	The data path works end to end
6	Action group + metric alert	Signal can become a page

Cleanup (avoid lingering charges).

az group delete -n $RG --yes --no-wait

Cost note. A B1 plan is a few rupees per hour and the lab’s telemetry is well within the included allowance; an hour of this lab is under ₹50, and deleting the resource group stops everything.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark for when the observability itself misbehaves. First as a scannable table, then the entries that bite hardest expanded with the exact confirm step.

#	Symptom	Root cause	Confirm (exact cmd / portal path)	Fix
1	Failures/Live Metrics blades empty; no `requests` rows	Connection string unset/wrong (or legacy iKey)	`az webapp config appsettings list --query "[?name=='APPLICATIONINSIGHTS_CONNECTION_STRING']"`	Set the full connection string; restart; verify outbound 443
2	A known request is missing from search	Adaptive sampling dropped that transaction	`requests \| summarize rows=count(), represented=sum(itemCount)` (represented ≫ rows)	Exclude critical types from sampling; raise sampling %
3	Aggregate counts are lower than the load balancer’s	Counting rows, not `itemCount`, under sampling	Compare `count()` vs `sum(itemCount)`	Always `sum(itemCount)` for true volumes
4	Ingestion bill spikes; data suddenly stops mid-day	Noisy log + daily cap reached	`Usage \| summarize sum(Quantity)/1000 by DataType`; capReached banner	Tier the noisy table to Basic; DCR-drop debug; raise/right-size cap
5	KQL query times out or returns nothing	Scans too wide a range, or table is Basic tier	Check time filter; `.tables` plan; error text	Put `where timestamp > ago(...)` first; summarize early; use Analytics tier
6	Distributed trace breaks at one service	That hop isn’t instrumented / doesn’t propagate context	Application Map shows a gap; missing `operation_Id` link	Instrument the hop; propagate W3C `traceparent`
7	Alert never fired during a real incident	Watching infra metric, not user SLI; or wrong threshold/window	Alert rule History = no fire; scope/condition review	Alert on 5xx/latency/availability; dynamic threshold
8	Pager storm / alert fatigue	Too many low-value alerts; no dedupe/suppression	Alerts list volume; action-rule config	Consolidate to SLIs; add action rules (dedupe, maintenance suppression)
9	Action group never notified anyone	Untested/misconfigured receiver	Action group → Test; check receiver	Fix/verify receiver; test before relying on it
10	Logs missing for a VM/AKS node	No DCR association or AMA not installed	`Heartbeat \| where Computer == "<name>"` empty; DCR associations	Install AMA; associate the DCR
11	A resource’s platform logs aren’t in the workspace	No diagnostic setting on that resource	`az monitor diagnostic-settings list --resource <id>` empty	Create a diagnostic setting → workspace
12	Two App Insights resources, split data	App reporting to the wrong/duplicate component	Compare connection strings across slots/services	Consolidate to one connection string per component
13	Browser/real-user data absent	JS (browser) SDK not added	No `pageViews` rows	Add the JavaScript SDK snippet to the front-end
14	Query works in App Insights, not in Log Analytics (or vice-versa)	Querying the wrong scope/table name	Confirm workspace-based; table availability	Query the workspace (workspace-based AI surfaces both)

The expanded form for the entries that cost the most time:

1. Failures and Live Metrics are empty; no telemetry at all. Root cause: The connection string is unset, wrong, or you’re still using a bare legacy instrumentation key; or outbound 443 to the ingestion endpoint is blocked (firewall/NSG/no Private Link path). Confirm: az webapp config appsettings list -n <app> -g <rg> --query "[?name=='APPLICATIONINSIGHTS_CONNECTION_STRING']" — is it present and a full string (with IngestionEndpoint=)? Then confirm egress to 443. Fix: Set the full connection string (from az monitor app-insights component show --query connectionString), restart the app, and ensure outbound 443 (or an AMPLS path) is open. This is the number-one “no data” cause.

2. A specific request you know happened isn’t in search. Root cause: Adaptive sampling consistently dropped that whole transaction — it’s working as designed, you just didn’t account for it. Confirm: requests | where timestamp > ago(1h) | summarize rows = count(), represented = sum(itemCount) — if represented is much larger than rows, sampling is active. Fix: Exclude critical telemetry types from sampling (always keep Exception; consider keeping a high-value operation), or raise the sampling rate / disable it for that app. Never search for a single record without remembering sampling exists.

4. Ingestion cost spikes and data stops mid-day. Root cause: A newly-noisy log source flooded the workspace and hit the daily cap, which then stopped ingestion — so you lose data during the very window you care about. Confirm: Usage | where TimeGenerated > ago(1d) and IsBillable | summarize sum(Quantity)/1000 by DataType shows the culprit; the workspace shows a “daily cap reached” banner. Fix: Move the noisy high-volume table to the Basic plan, add a DCR transform to drop low-value rows pre-ingestion, and right-size (don’t just remove) the cap with an alert before the cap, not at it.

5. A KQL query times out or returns nothing. Root cause: The query scans too much (no/late time filter), or it targets a Basic-tier table whose querying is limited (no cross-table joins, restricted operators). Confirm: Is where timestamp > ago(...) the first operation? Is the table on the Analytics or Basic plan? Read the error text — it usually names the limit. Fix: Filter by time first to bound the scan, summarize as early as possible, and keep tables you query interactively on the Analytics plan.

6. The distributed trace breaks at one service. Root cause: One hop in the chain isn’t instrumented, or doesn’t propagate the W3C traceparent header, so its spans don’t share the operation_Id. Confirm: The Application Map shows a gap/dead-end at that service; a transaction’s spans stop before that hop. Fix: Instrument that service and ensure context propagation (modern SDKs do W3C by default; custom HTTP clients may need it added). One blind hop hides everything beyond it.

7. The alert that should have caught the incident never fired. Root cause: It watched an infrastructure metric (CPU) that was fine while users suffered, or the threshold/window was wrong (too high, too long). Confirm: The alert rule’s History shows no fire during the incident window; review its scope and condition. Fix: Alert on user-facing SLIs (5xx rate, p95 latency, availability), use dynamic thresholds for variable metrics, and tune the aggregation window so a real breach actually trips it.

Best practices

Collect all three signals, deliberately. Metrics for fast alerts, logs for forensics, distributed traces for cross-service root cause — and decide what you collect on purpose, not “everything just in case.”
Configure Application Insights with the connection string, never the legacy iKey. Verify it’s set and is the full string after every deploy; it’s the single point of “no telemetry.”
Centralize into a small number of workspaces. Per-environment (prod/non-prod) is the sane default — not one-per-app (fragments queries) and not literally-one (muddies RBAC and cost).
Propagate trace context everywhere. One un-instrumented hop breaks correlation and blinds you past it; ensure W3C traceparent flows through every service and custom client.
Sample on purpose, and never sample away your exceptions. Use adaptive sampling for cost, but exclude Exception (and any must-keep operation) from sampling, and always sum(itemCount) for true volumes.
Tier and cap your ingestion. Put high-volume/low-query logs on the Basic plan, drop debug noise in a DCR transform before it’s billed, and set a daily cap with an alert before the cap.
Alert on user-facing SLIs, not infrastructure. 5xx rate, p95/p99 latency, availability — the things a user notices. CPU/memory are supporting evidence, not the page.
Use dynamic thresholds for variable metrics. Traffic and latency have daily/weekly seasonality; a static line either flaps or misses. Reserve static thresholds for hard SLO lines.
Build and test action groups. Tier them by severity (Sev1 → phone, Sev3 → ticket), reuse one across many alerts, and test before you rely on them.
Right-size retention per table. Keep audit/security logs long, app telemetry short; 30 days is free, 730 is the ceiling, archive is for cold compliance data.
Treat dashboards as decision tools. A Workbook of SLIs and top failing dependencies beats fifty CPU charts. Observability is faster, better decisions — not more graphs.
Wire it before you need it. The whole point is to ask questions after the fact without shipping code; instrument on day one, not during the first incident.

The leading-indicator alerts worth wiring before the next incident — symptoms a user feels, not lagging “it’s down”:

Alert on	Signal	Threshold (starting point)	Why it’s the right one
Server errors	HTTP 5xx rate	> 1% of requests, 5 min	Direct user-facing failure
Latency tail	request `duration` p95	> your SLO (e.g. 800 ms), 5 min	Slowness users actually feel
Availability	availability test success	< 100% from 2+ regions	Outside-in “is it up”
Dependency failures	failed `dependencies`	spike vs dynamic baseline	The downstream that’s breaking you
Exception surge	`exceptions` rate	dynamic threshold	New error class / regression
Ingestion runaway	`Usage` GB/day	> 80% of daily cap	Catch a cost blowout before the cap drops data

Security notes

Least-privilege on log access. Reading logs is RBAC — grant Log Analytics Reader to query and Contributor only to those who manage the workspace/DCRs. Use table-level RBAC to fence off sensitive tables (e.g. security/audit) from general viewers.
Don’t leak secrets or PII into telemetry. App logs and custom events can accidentally capture tokens, connection strings or personal data; scrub at the source (telemetry processors) or strip with a DCR transform before ingestion. Logs are queryable by everyone with reader access.
Private ingestion and query paths. For a locked-down estate, use Azure Monitor Private Link Scope (AMPLS) with a Data Collection Endpoint so telemetry and KQL never traverse the public internet — and so the connection string’s ingestion endpoint resolves privately.
Managed identity over keys. Where agents/exporters support it, authenticate ingestion with managed identity rather than instrumentation keys or shared keys; rotate and avoid embedding secrets in app settings.
Protect the alerting path. Action groups can trigger Functions and runbooks that take real action (restart, scale, heal) — secure those endpoints (secure webhooks, scoped identities) so an alert can’t be spoofed into running automation.
Encryption and residency. Workspace data is encrypted at rest; use customer-managed keys where key ownership is mandated, and choose workspace region to satisfy data-residency requirements (a driver for a per-region topology).
Audit the observability plane itself. The control-plane operations on workspaces, DCRs and alerts are in AzureActivity — monitor changes to your monitoring (someone disabling a critical alert or a diagnostic setting is itself an event worth alerting on).

The security knobs that also improve the observability — secure and useful pull the same way here:

Control	Mechanism	Secures against	Also improves
Table-level RBAC	Per-table role assignment	Over-broad log access	Cleaner, scoped queries per team
DCR transform (scrub/drop)	`transformKql` pre-ingestion	PII/secret leakage into logs	Lower ingestion cost
AMPLS + Private Link + DCE	Private telemetry path	Public-internet exposure	Reliable regional ingestion
Managed-identity ingestion	MI on agents/exporters	Leaked keys	Fewer secrets to rotate/break
Customer-managed keys	CMK on workspace	Key-ownership gaps	Compliance posture
Activity-log alerting	Alert on monitoring changes	Silent disabling of alerts/diag	Catches drift in coverage

Cost & sizing

The bill drivers and how they interact with the design choices:

Log Analytics ingestion (per GB) dominates. You pay primarily for data ingested, then for retention beyond the free 30 days. The biggest lever is what you collect and at which table plan — moving a verbose, rarely-queried table from Analytics to Basic can cut its ingestion cost substantially, and a DCR transform that drops debug rows pre-ingestion saves the full per-GB price on what it removes.
Application Insights telemetry is billed through its workspace on the same per-GB basis, which is why sampling is a cost decision as much as a fidelity one — adaptive sampling on a high-traffic app can cut telemetry volume dramatically while keeping metrics correct (via itemCount) and traces intact.
Metrics are largely free. Platform metrics and their alerting cost little; this is why metric alerts are the cheap first line of defence and you should push as much alerting as possible onto them.
Retention is a multiplier. 30 days is included; every extra day of interactive retention multiplies stored-data cost, so set per-table retention (audit long, telemetry short) and use cheap archive for cold compliance data.
The daily cap is a seatbelt, not a budget. It prevents a runaway invoice but, set too low, drops data mid-incident — pair it with an alert before the cap. A commitment tier (reserved GB/day) discounts predictable high volume.

A rough monthly picture for a small-to-mid production estate (a dozen services, moderate traffic): Log Analytics ingestion in the ₹12,000–30,000 range depending on collection discipline, App Insights telemetry folded into that via the workspace, plus negligible metric-alert cost. Cartwheel landed near ₹22,000 after applying sampling and the Basic-tier move — roughly half their naive “collect everything” first draft — proving the bill is a design outcome, not a fixed cost. The drivers and what each buys you:

Cost driver	What you pay for	Rough INR / month	What it buys	Watch-out
Log Analytics ingestion (Analytics)	Per-GB of full-tier logs	bulk of the bill	Full KQL + alerting	Noisy categories inflate it fast
Basic-tier tables	Per-GB, cheaper	fraction of Analytics	Cheap high-volume logs	No alerting; limited query
Retention beyond 30 days	Stored-GB-days	scales with days × GB	Longer forensics/compliance	730-day on everything is wasteful
Application Insights telemetry	Per-GB via workspace	folded into ingestion	App traces/Failures/Live	Tune with sampling
Metric alerts	Per rule (cheap)	~₹0–small	Near-real-time alerting	Effectively free — use them
Log (query) alerts	Per evaluation	small	Log-expressible conditions	Frequent eval × many rules adds up
Commitment tier	Reserved GB/day, discounted	depends on volume	Lower effective per-GB	Under/over-commit both waste

Interview & exam questions

1. What is the difference between metrics, logs and traces, and which answers which question? Metrics are pre-aggregated numbers over time and answer is something wrong and when (cheap, fast alerts). Logs are timestamped structured events and answer why it happened (forensics, billed per GB). Distributed traces are trees of spans for one operation and answer where in a chain of services the problem is (cross-service root cause). You need all three, and the classic failure is alerting on infra metrics while the cause lives in a downstream dependency only a trace would reveal.

2. How are Azure Monitor, Log Analytics and Application Insights related? Azure Monitor is the umbrella product family (metrics store, alerting engine, collection pipeline). Log Analytics is the underlying log database you query with KQL. Application Insights is an APM lens that, in its modern workspace-based form, writes into a Log Analytics workspace and adds application-shaped tables (requests, dependencies, exceptions) plus Failures/Performance/Live Metrics. They’re layers, not competitors — one workspace can hold platform, infra and app telemetry together.

3. What does the Application Insights connection string contain, and why is the bare instrumentation key deprecated? The connection string carries the instrumentation key and the regional ingestion (and Live Metrics) endpoints. The legacy bare iKey assumed the global public endpoint, so it breaks in sovereign/regional clouds and Private Link setups. Always configure the connection string; an unset or wrong one is the number-one “no telemetry arriving” cause.

4. What is adaptive sampling and what is the trap when reading sampled data? Adaptive sampling keeps a representative fraction of telemetry (target items/sec per instance), dropping the rest consistently (a whole transaction together, so traces stay intact) and recording an itemCount multiplier so aggregate metrics remain correct. The trap: searching for one specific record, not finding it, and concluding it never happened — it was sampled out. Also, you must sum(itemCount), not count(), to get true volumes, and you should exclude critical types (exceptions) from sampling.

5. You need to follow a single user’s checkout across nine services. What makes that possible, and what breaks it? A distributed trace correlated by a shared operation_Id propagated via the W3C trace context (traceparent) header — Application Insights stores each incoming request and outbound dependency with that ID, so Transaction search rebuilds the waterfall and the Application Map shows per-edge health. It breaks if any hop isn’t instrumented or doesn’t propagate the header, which hides everything beyond that hop.

6. What is a Data Collection Rule and why is its transform important? A DCR declaratively defines what telemetry to collect (counters, syslog, events, custom logs), where to send it, and an optional KQL transformation applied before ingestion. The transform is both a cost lever (drop debug-level/low-value rows so you don’t pay to store them) and a privacy lever (strip a PII/secret column before it lands). DCRs also drive the Azure Monitor Agent (AMA), which replaced the retired Log Analytics agent.

7. Compare the Analytics, Basic and Auxiliary table plans. Analytics is full-price, full-KQL, alertable, up to 730-day retention — for logs you query and alert on. Basic is cheaper ingestion with query-only, limited operators and short interactive retention — for high-volume, occasionally-queried logs (no alerting). Auxiliary is cheapest, for rarely-queried archival/audit data. Choosing the right plan per table is a primary cost lever.

8. How do you control Log Analytics cost without going blind? Collect deliberately (only valuable categories), shape with a DCR transform to drop noise pre-ingestion, move high-volume/low-query tables to the Basic plan, set per-table retention (audit long, telemetry short), enable adaptive sampling for app telemetry, and set a daily cap with an alert before the cap (since the cap itself stops ingestion). A commitment tier discounts predictable volume.

9. When do you use a metric alert versus a log (scheduled query) alert? Use a metric alert for anything expressible as a metric vs threshold — it’s near-real-time (≈1 min), cheap, and ideal for 5xx rate, latency, CPU, queue depth. Use a log alert when only a KQL query can express the condition (correlated/derived conditions over event detail); it runs on an interval (minutes) and costs per evaluation. Push as much as possible onto metric alerts for speed and cost.

10. What is an action group and why test it? An action group is the reusable fan-out of what happens when an alert fires — email/SMS/push/voice, webhook, ITSM/ServiceNow, Logic App, Azure Function, or Automation runbook — referenced by many alerts. You test it because an untested/misconfigured receiver is a leading reason a real alert pages nobody; the alert “fired” but nothing reached a human or the automation never ran.

11. Static vs dynamic alert thresholds — when each? A static threshold is a fixed number, right when you have a hard SLO line (p95 < 800 ms, 5xx > 1%). A dynamic threshold learns the metric’s normal pattern including daily/weekly seasonality and alerts on deviation — right for variable metrics like traffic or time-of-day-dependent latency, where a fixed line either flaps or misses the real anomaly.

12. How should you design alerts to avoid pager fatigue? Alert on user-facing SLIs (5xx, latency, availability), not every infra metric; tier severity to response expectation (Sev0 wakes someone, Sev3/4 file a ticket); use dynamic thresholds and sane aggregation windows to avoid flapping; and apply action rules for deduplication and maintenance-window suppression so a known event doesn’t storm the pager. Fewer, higher-signal alerts beat many noisy ones.

These map primarily to AZ-204 (Developer Associate) — instrument an app with Application Insights, monitor and troubleshoot — and AZ-104 (Administrator) — monitor resources with Azure Monitor, configure Log Analytics, alerts and action groups. The design-level topology, cost and security choices touch AZ-305 (Solutions Architect), and the security-logging angle (table RBAC, scrubbing, AMPLS) touches AZ-500. A compact cert mapping for revision:

Question theme	Primary cert	Objective area
App Insights instrumentation, connection string, traces	AZ-204	Instrument, monitor & troubleshoot solutions
KQL, Log Analytics, Failures/Live Metrics	AZ-204	Troubleshoot solutions
Workspaces, alerts, action groups, DCRs	AZ-104	Monitor & maintain Azure resources
Sampling, table tiers, retention, cost	AZ-104 / AZ-305	Cost & monitoring design
Workspace topology, residency, HA	AZ-305	Design monitoring & governance
Table RBAC, scrubbing, AMPLS, CMK	AZ-500	Secure logging & data

Quick check

You’re staring at a green CPU dashboard while users report timeouts at checkout. Which signal are you missing, and what’s the first place to look?
You search Application Insights for a specific failed request you know occurred and it isn’t there. What is the most likely reason, and what should you sum() to get true volumes?
Telemetry stopped arriving in Application Insights entirely after a redeploy. Name the single setting to check first.
You want to cut Log Analytics ingestion cost on a verbose, rarely-queried log without losing it. Name two levers.
An alert on CPU never fired during a real user-facing outage. What should the alert have watched instead, and what threshold style suits a metric that varies by time of day?

Answers

You’re missing distributed traces (and request/dependency telemetry) — the CPU metric can’t see a slow downstream dependency. First place to look: Application Insights → Failures / Performance, then the Application Map to find the unhealthy hop (the payment/dependency edge).
Adaptive sampling consistently dropped that whole transaction — it happened, it just wasn’t retained. Confirm with requests | summarize rows=count(), represented=sum(itemCount) (represented ≫ rows means sampling is active). Use sum(itemCount) — not count() — for true volumes, and exclude exceptions/critical types from sampling.
The Application Insights connection string (APPLICATIONINSIGHTS_CONNECTION_STRING) — verify it’s set, is the full string (not a bare legacy iKey/GUID), and that the app restarted; then confirm outbound 443 to the ingestion endpoint isn’t blocked.
(a) Move the table to the Basic table plan (cheaper ingestion); (b) add a DCR transform (transformKql) that drops the low-value rows before ingestion so you don’t pay the per-GB price on them. (Also: shorter per-table retention.)
It should have watched a user-facing SLI — HTTP 5xx rate, request p95 latency, or availability — not infrastructure CPU. For a metric that varies by time of day (traffic, latency), use a dynamic threshold that learns the seasonal baseline rather than a fixed static line.

Glossary

Observability — the ability to ask any question about a running system after the fact, without shipping new code to answer it; achieved by collecting metrics, logs and traces.
Azure Monitor — the platform umbrella for metrics, logs, alerting and the collection pipeline across all Azure resources.
Log Analytics workspace — the Kusto-based log store where (almost) all logs land; queried with KQL; the unit of cost and access control.
Application Insights — the application-performance lens over a workspace, adding requests/dependencies/exceptions telemetry, distributed tracing and the Failures/Performance/Live Metrics experiences.
Metric — a pre-aggregated numeric time-series (CPU, request rate, latency); cheap, fast to alert on.
Log — a timestamped structured event (exception, audit record, custom event); rich detail, billed per GB.
Distributed trace — a tree of spans describing one logical operation across services, correlated by a shared operation ID.
KQL (Kusto Query Language) — the query language for Log Analytics and Application Insights.
Connection string — the modern Application Insights config carrying the instrumentation key plus regional ingestion/Live Metrics endpoints; replaces the deprecated bare instrumentation key.
Instrumentation key (iKey) — the legacy GUID-only identifier for an App Insights resource; deprecated in favour of the connection string.
Operation ID / W3C trace context — the correlation identifier (propagated via the traceparent header) that stitches a transaction’s spans together.
itemCount — the multiplier on a sampled telemetry row indicating how many original items it represents; sum(itemCount) gives true volumes.
Adaptive sampling — the SDK default that keeps a representative, consistent fraction of telemetry (target items/sec) and records itemCount.
Diagnostic setting — per-resource config that sends platform logs/metrics to a workspace, storage or Event Hub.
Data Collection Rule (DCR) — declarative object defining what telemetry to collect, where to send it, and an optional pre-ingestion KQL transform.
Data Collection Endpoint (DCE) — the network ingress a DCR uses; the anchor for Private Link ingestion.
Azure Monitor Agent (AMA) — the current in-host telemetry collector (configured by DCRs); replaced the retired Log Analytics agent (MMA/OMS).
Table plan (tier) — Analytics (full KQL/alerting), Basic (cheap, query-limited), or Auxiliary (cheapest, archival) per Log Analytics table.
Daily cap — a GB/day ceiling that stops or warns ingestion to prevent a runaway bill (but can drop data if set too low).
Alert rule — a condition over metrics or a KQL query that fires when breached (metric, log, activity-log, resource-health).
Action group — the reusable fan-out of notifications and automation (email/SMS/webhook/ITSM/Function/runbook) an alert triggers.
Dynamic threshold — an ML-learned normal band for a metric (handles seasonality), versus a fixed static threshold.
AMPLS (Azure Monitor Private Link Scope) — the construct that keeps telemetry ingestion and queries on a private network path.
Workbook — an interactive KQL-backed report/dashboard in Azure Monitor; can also feed Grafana.

Next steps

You can now wire the full pipeline and ask any question of your running system. Build outward:

Next: Azure Monitor Data Collection Rules, Workbooks, Alerting & Action Groups — go deep on the collection plumbing, transforms and dashboards behind this article.
Related: Troubleshooting Azure App Service: 502/503, Cold Starts & Restart Loops — half this playbook is KQL against the telemetry you just set up.
Related: Troubleshooting Azure SQL: Connectivity, Timeouts, Throttling & Blocking — apply the same metrics/logs/traces method to the data tier.
Related: Azure FinOps: Cost Management at Scale — keep ingestion (one of the sneakier Azure line items) under control.
Related: Azure Functions: Serverless Patterns — instrument event-driven workloads where cold starts and dependencies need the same tracing.