Most “security reviews” I sit in on are a checklist copied from a vendor PDF, run once, never reopened. The Well-Architected security pillar is only worth the effort when you reverse that: start from how an attacker actually moves through the workload, then map each design principle down to a control you can write in code and a signal you can alert on. This is the end-to-end process I use to take a workload from a STRIDE threat model to layered defense in depth, with the Azure config I actually ship.
1 - Design principles and the shared responsibility model
The security pillar reduces to a handful of principles, and every control below traces back to one of them:
- Strong identity foundation - authenticate and authorize on identity, not network location.
- Apply security at all layers - defense in depth, so no single failure is fatal.
- Least privilege - grant the minimum, scope it tight, expire it.
- Protect data in transit and at rest - classify first, then encrypt accordingly.
- Automate - controls as code, remediation as code, evidence as code.
- Prepare for security events - assume breach, instrument for detection, rehearse response.
Before you design anything, draw the responsibility line. The provider secures the substrate (physical hosts, hypervisor, managed control planes); you own identity, data, network config, and workload code. The line moves with the service model, and most incidents I have reviewed lived on the customer side of a boundary the team assumed the provider covered.
| Concern | IaaS (VM) | PaaS (App Service) | SaaS (M365) |
|---|---|---|---|
| OS patching | You | Provider | Provider |
| Runtime / framework | You | Shared | Provider |
| Identity and access | You | You | You |
| Data classification | You | You | You |
| Network controls | You | Shared | Provider |
The takeaway: identity, access, and data classification never leave your side of the table. That is where you spend first.
2 - Threat model the workload with STRIDE and attack-path analysis
You cannot defend what you have not enumerated. STRIDE gives you six failure categories to walk every trust boundary against. Take a typical workload - a public API in front of a database with a queue and object storage - and tabulate it.
| Threat (STRIDE) | Example in this workload | Primary control |
|---|---|---|
| Spoofing | Stolen API token replayed | Entra ID auth, short-lived tokens, mTLS |
| Tampering | Mutated message on the queue | Signed payloads, TLS, integrity checks |
| Repudiation | “I never made that change” | Immutable audit logs, signed commits |
| Info disclosure | Storage account left public | Private Endpoints, encryption, RBAC |
| Denial of service | Volumetric flood on the API | WAF, rate limits, autoscale, DDoS plan |
| Elevation of privilege | App identity over-permissioned | Least-privilege RBAC, no wildcard roles |
STRIDE finds individual weaknesses; attack-path analysis finds the chain that turns three medium findings into a domain compromise. Write the path as an ordered list of hops and the privilege gained at each:
[Internet] -> exposed API (no WAF)
-> SSRF in image fetch
-> reach IMDS / managed identity endpoint
-> mint token for app's managed identity
-> Key Vault (over-broad access policy)
-> read DB connection string [GAME OVER]
Each hop might score “medium” in isolation. The path is critical. This is the single artifact that reframes the conversation from “patch this one thing” to “any of these four breaks the chain - which is cheapest?” Microsoft’s Threat Modeling Tool or any DFD-on-a-whiteboard works; the deliverable is the path list, not the diagram.
3 - Identity foundation: authentication, least privilege, segmentation
Identity is the new perimeter, so it is where defense in depth starts. Three moves, in order.
Kill standing privilege. Human admin access should be just-in-time and approved, not a permanent role assignment. In Entra, that is Privileged Identity Management (PIM):
# Make an eligible (not active) role assignment - elevation requires
# explicit activation, MFA, and (optionally) approval.
az role assignment create \
--assignee "$PRINCIPAL_ID" \
--role "Contributor" \
--scope "/subscriptions/$SUB_ID/resourceGroups/rg-prod-app" \
--assignee-principal-type User
# Then govern activation/MFA/approval via PIM policy in Entra,
# so the role is eligible rather than permanently active.
Enforce phishing-resistant MFA with Conditional Access. Require strong auth for admins and block legacy protocols that bypass it. Conditional Access policies are the enforcement plane:
{
"displayName": "Require phishing-resistant MFA for admins",
"state": "enabled",
"conditions": {
"users": { "includeRoles": ["62e90394-69f5-4237-9190-012177145e10"] },
"applications": { "includeApplications": ["All"] }
},
"grantControls": {
"operator": "OR",
"authenticationStrength": { "id": "00000000-0000-0000-0000-000000000004" }
}
}
The role GUID
62e90394-...is the built-in Global Administrator template ID; it is stable across every tenant. Scope CA policies to role templates, not named users - people change, roles do not.
Give workloads identities, not secrets. Application code should never hold a connection string. Use a managed identity and grant it a tightly scoped RBAC role:
# System-assigned identity on a web app, scoped to read ONE Key Vault.
az webapp identity assign --name app-prod --resource-group rg-prod-app
PRINCIPAL=$(az webapp identity show -n app-prod -g rg-prod-app --query principalId -o tsv)
az role assignment create \
--assignee-object-id "$PRINCIPAL" \
--assignee-principal-type ServicePrincipal \
--role "Key Vault Secrets User" \
--scope "/subscriptions/$SUB_ID/resourceGroups/rg-prod-app/providers/Microsoft.KeyVault/vaults/kv-prod-app"
Note the role is Key Vault Secrets User (read secrets only), not Contributor, and the scope is one vault, not the resource group. That one habit - narrowest role, narrowest scope - closes the Key Vault hop in the attack path above.
4 - Layered network controls and private connectivity
Even with identity as the primary perimeter, the network is a control layer you do not get to skip - it shrinks blast radius and buys detection time. The goal is that nothing in the data tier is reachable from the internet, and lateral movement inside the VNet is denied by default.
Take PaaS data services off the public internet with Private Endpoints, and explicitly disable public access:
resource "azurerm_storage_account" "data" {
name = "stproddata01"
resource_group_name = azurerm_resource_group.app.name
location = azurerm_resource_group.app.location
account_tier = "Standard"
account_replication_type = "ZRS"
public_network_access_enabled = false # no public endpoint
min_tls_version = "TLS1_2"
}
resource "azurerm_private_endpoint" "blob" {
name = "pe-stproddata01-blob"
resource_group_name = azurerm_resource_group.app.name
location = azurerm_resource_group.app.location
subnet_id = azurerm_subnet.private_endpoints.id
private_service_connection {
name = "psc-blob"
private_connection_resource_id = azurerm_storage_account.data.id
subresource_names = ["blob"]
is_manual_connection = false
}
}
Micro-segment with NSGs and a default-deny posture. The mistake I see most is allowing the whole VNet address space east-west. Deny it, then allow only the specific tier-to-tier flow:
# App subnet may reach the data subnet only on 1433; everything else east-west is denied.
az network nsg rule create -g rg-prod-app --nsg-name nsg-data \
--name allow-app-to-sql --priority 100 --direction Inbound --access Allow \
--protocol Tcp --source-address-prefixes 10.20.1.0/24 \
--destination-port-ranges 1433 --destination-address-prefixes 10.20.2.0/24
az network nsg rule create -g rg-prod-app --nsg-name nsg-data \
--name deny-vnet-inbound --priority 4000 --direction Inbound --access Deny \
--protocol "*" --source-address-prefixes VirtualNetwork \
--destination-port-ranges "*" --destination-address-prefixes "*"
Put a WAF (Azure Front Door or Application Gateway) at the edge in Prevention mode to handle the spoofing/DoS rows of the STRIDE table, and enable the DDoS Protection plan on internet-facing VNets. Layering - WAF, then NSG default-deny, then Private Endpoint - means breaking the edge gets an attacker nowhere near the data.
5 - Protecting data: classification, encryption, key custody
You cannot apply the right protection until you know the sensitivity. Classify first - public, internal, confidential, restricted - then let the tier drive the encryption and key-custody decision. Over-encrypting public data wastes money and operational pain; under-protecting restricted data is the breach.
Encryption at rest and TLS in transit are table stakes and largely on by default in Azure. The real decision is key custody: who can destroy access to the data?
| Model | Who holds the key | Use when |
|---|---|---|
| Platform-managed (PMK) | Provider | Internal/confidential, low custody requirement |
| Customer-managed (CMK) | You, in Key Vault | Restricted data, regulatory key-control mandate |
| HSM-backed (Managed HSM) | You, FIPS 140-2 L3 | Highest assurance, crypto-erase guarantees |
For restricted data, wire CMK with an explicit cryptographic boundary and rotation:
# Create a key with an enforced rotation policy (Premium vault / Managed HSM for HSM keys).
az keyvault key create --vault-name kv-prod-app --name cmk-storage \
--protection software --kty RSA --size 3072
az keyvault key rotation-policy update --vault-name kv-prod-app --name cmk-storage \
--value '{"lifetimeActions":[{"trigger":{"timeAfterCreate":"P350D"},
"action":{"type":"Rotate"}}],"attributes":{"expiryTime":"P365D"}}'
CMK gives you crypto-shred as a kill switch: revoke or destroy the key and the data is unrecoverable even though the ciphertext still sits in storage. That is a powerful control and a loaded gun - put the key vault behind soft-delete and purge protection so an accidental delete does not become your own denial-of-service.
--enable-purge-protection trueon the vault is non-negotiable for production CMK.
Application secrets follow the same discipline: store in Key Vault, reference (never copy) from the app, rotate on a schedule, and prefer managed identity over any secret at all.
6 - Detection and response: telemetry, automated remediation, runbooks
Assume breach. The question is not whether something gets through but how fast you see it and how much of the response runs without a human. Centralize telemetry, write detections as queries, and codify the response.
Ship control-plane, data-plane, identity, and network logs into one Log Analytics workspace via diagnostic settings, then hunt with KQL. A detection I keep in every environment - a sign-in from an anonymous/Tor IP immediately followed by a role assignment - because that pattern is the attack path:
// Suspicious sign-in followed by a privileged role assignment within 30 min
let risky = SigninLogs
| where RiskLevelDuringSignIn in ("high","medium")
| project riskUser = UserPrincipalName, riskTime = TimeGenerated, IPAddress;
AuditLogs
| where OperationName has "Add member to role"
| extend actor = tostring(InitiatedBy.user.userPrincipalName)
| join kind=inner risky on $left.actor == $right.riskUser
| where TimeGenerated between (riskTime .. (riskTime + 30m))
| project TimeGenerated, actor, OperationName, IPAddress, TargetResources
Promote the high-signal queries to Microsoft Sentinel analytics rules so they create incidents, then attach automated remediation for the unambiguous cases. A Logic App playbook that disables a user on a confirmed token-theft signal turns a 30-minute manual scramble into a sub-minute containment:
# Auto-remediate a publicly exposed storage account via Azure Policy (deployIfNotExists).
az policy assignment create \
--name "deny-public-blob" \
--policy "4fa4b6c0-31ca-4c0d-b10d-24b96f62a751" \
--scope "/subscriptions/$SUB_ID" \
--params '{"effect":{"value":"Deny"}}'
Built-in policy
4fa4b6c0-...is “Storage accounts should restrict network access”. Run it asAuditfor a sprint to size the blast radius, then flip toDeny. Shipping straight to Deny on a brownfield estate is how you take down production on a Friday.
Then write the runbook for what automation cannot decide: who declares the incident, how you preserve forensic snapshots, the comms tree, the rollback. A playbook nobody has rehearsed is a document, not a control - run a game day against it quarterly.
7 - Embedding security in the SDLC and infra pipelines
Controls applied after deploy are remediation; controls applied in the pipeline are prevention. Shift every check left so insecure config and leaked secrets never reach an environment in the first place.
A minimal but real gate set in CI - secret scanning, IaC static analysis, and dependency CVE checks - that fails the build on a finding:
# Azure DevOps / GitHub Actions equivalent: hard gates before any deploy stage.
steps:
- name: Secret scan
run: gitleaks detect --source . --redact --exit-code 1
- name: IaC static analysis
run: |
checkov -d ./infra --quiet --compact \
--framework terraform \
--soft-fail-on LOW # fail build on MEDIUM/HIGH misconfig
- name: Dependency CVEs
run: trivy fs --severity HIGH,CRITICAL --exit-code 1 .
Pair the pipeline gate with platform-level guardrails so nothing can drift back: deny-effect Azure Policy at the management-group scope enforces the same rules on resources created outside the pipeline (someone clicking in the portal). Pipeline catches it pre-merge; policy catches it at the control plane. Defense in depth applies to your governance, not just your runtime.
Verify
A control you have not proven is a control you do not have. Validate each layer:
# 1. Identity - confirm NO permanent privileged role assignments outside PIM.
az role assignment list --all --query \
"[?roleDefinitionName=='Owner' || roleDefinitionName=='Contributor'].{p:principalName,scope:scope}" -o table
# 2. Network - prove the storage public endpoint is actually closed.
az storage account show -n stproddata01 -g rg-prod-app \
--query "{public:publicNetworkAccess, tls:minimumTlsVersion}" -o table
# expect: public=Disabled, tls=TLS1_2
# 3. Data - confirm purge protection on the CMK vault (no silent kill switch).
az keyvault show -n kv-prod-app \
--query "properties.enablePurgeProtection" -o tsv # expect: true
# 4. Detection - confirm diagnostic settings are routing logs to the workspace.
az monitor diagnostic-settings list \
--resource "$STORAGE_ID" --query "[].workspaceId" -o tsv
Then go beyond config inspection: run the attack path from Step 2 as an actual test. Try the SSRF, try to reach the metadata endpoint, try to read the vault with the app identity. If any hop succeeds, the chain is live regardless of what the checklist says. Pen test the path, do not just admire the diagram.
Enterprise scenario
A platform team I worked with ran a regulated multi-tenant SaaS on AKS. Their constraint was hard: a compliance mandate required that no human could read tenant data at rest, and they had to prove it to an auditor, but engineers still needed break-glass access to the cluster for incidents. The naive read - lock everyone out - was incompatible with operating the system.
The resolution was to separate cluster access from data access using two different key custodies. Tenant data used CMK in a Managed HSM that no human principal had unwrapKey rights to; only the application’s managed identity held that permission. Engineers kept break-glass to the AKS control plane through PIM, but that path could reach pods, not plaintext - the data stayed encrypted under a key humans literally could not use. Break-glass activation also fired the Sentinel rule, so every human elevation generated an audited incident.
# Only the workload identity can unwrap; the break-glass admin group cannot.
az keyvault role assignment create --hsm-name mhsm-prod \
--role "Managed HSM Crypto User" \
--assignee-object-id "$APP_IDENTITY_OID" \
--scope "/keys/cmk-tenant-data"
# Note: the engineer break-glass group is granted NO crypto role on this key.
The auditor got a clean answer - “show me a human who can decrypt tenant data” returned nobody - while the on-call kept the access they needed to actually run the platform. That is the security pillar working as designed: least privilege and assume-breach turned into a key-custody boundary, not a slide.