Most “threat modeling” I see in the wild is a one-off whiteboard session that produces a photo nobody looks at again. That is not threat modeling; it is theater. Done properly, threat modeling is the cheapest security control you own: it finds design flaws before a single line of code is written, when the cost of changing your mind is a sticky note rather than a re-architecture. The method below is the one I run with platform and application teams. It is deliberately mechanical so a mid-level engineer can drive it without a security PhD, and it produces artifacts you can diff, track, and re-run when the system changes.
The four questions that anchor the whole exercise (the Shostack frame) are: What are we building? What can go wrong? What are we going to do about it? Did we do a good enough job? Every step maps to one of those.
1. Why threat model, and when
The economic argument is simple. A SQL injection caught in a design review is a one-line note: “parameterize this query.” The same flaw caught in pen testing is a bug ticket, a code review, a redeploy, and a regression test. Caught in production it is an incident, a forensics bill, and possibly a breach notification. Threat modeling shifts the discovery left to the point of lowest remediation cost.
Trigger a threat model when:
- A new system or service is being designed (before the API contract is frozen).
- An existing system crosses a new trust boundary — exposing an internal API to the internet, adding a third-party integration, ingesting untrusted file uploads.
- The data classification changes — you start handling PII, payment data, or regulated health records.
- Authentication, authorization, or crypto design changes.
You do not need to re-model on every commit. Model on architectural change. That distinction is what keeps the practice sustainable.
2. Scope the system and draw a data-flow diagram
You cannot threat-model what you cannot see. The foundational artifact is a data-flow diagram (DFD) — not a UML class diagram, not a network topology, but a picture of how data moves. DFDs use exactly four element types, and that constraint is a feature:
| Element | Shape (classic) | Meaning |
|---|---|---|
| External entity | Rectangle | An actor outside your control: user, third-party API, another team’s service |
| Process | Circle | Code that transforms data: a service, a Lambda, a function |
| Data store | Two parallel lines | Where data rests: database, S3 bucket, queue, cache |
| Data flow | Arrow | Data in motion between the above, labeled with protocol/payload |
The fifth thing — and the single most important — is the trust boundary: a dashed line crossing the flows where the privilege or trust level changes. Threats cluster on trust boundaries. The internet-to-DMZ edge, the app-to-database edge, the tenant-A-to-tenant-B edge: that is where an attacker’s input meets your assumptions.
Keep DFDs in version control as text so they diff. I use a Structurizr-style or PlantUML description; here is a minimal DFD-as-code for a document-upload service:
@startuml
title DFD - Document Upload Service (Level 1)
actor "Browser (user)" as user
rectangle "CDN / WAF" as waf
component "Upload API\n(process)" as api
database "Object Store\n(S3, data store)" as s3
queue "Convert Queue\n(data store)" as q
component "Converter\n(process)" as conv
database "Metadata DB\n(data store)" as db
user --> waf : HTTPS multipart upload
waf --> api : forwarded request (auth token)
api --> s3 : PUT object (SSE-KMS)
api --> db : INSERT metadata
api --> q : enqueue job {objectKey}
q --> conv : dequeue job
conv --> s3 : GET object / PUT rendered
conv --> db : UPDATE status
' Trust boundaries
rectangle "== Internet boundary ==" as tb1 #line.dashed
rectangle "== VPC / private boundary ==" as tb2 #line.dashed
@enduml
Two rules that save you later:
- Decompose only until threats stop changing. A “Level 0” context diagram (one process, the external entities, the major stores) is enough to start. Drill into a “Level 1” only for the processes where new threats emerge.
- Every flow that crosses a trust boundary gets a label describing protocol, authentication, and data sensitivity. Unlabeled crossings are where teams hand-wave.
3. Apply STRIDE per element to enumerate threats
STRIDE is a mnemonic for six threat categories, and its power is that it is exhaustive by construction for the properties we care about. Each letter is the negation of a security property:
| STRIDE | Threat | Property violated |
|---|---|---|
| S | Spoofing | Authentication |
| T | Tampering | Integrity |
| R | Repudiation | Non-repudiation |
| I | Information disclosure | Confidentiality |
| D | Denial of service | Availability |
| E | Elevation of privilege | Authorization |
The technique I trust most is STRIDE-per-element. Walk every element in the DFD and ask which STRIDE categories apply to that element type. Microsoft’s mapping is the standard starting grid:
| Element type | S | T | R | I | D | E |
|---|---|---|---|---|---|---|
| External entity | x | x | ||||
| Process | x | x | x | x | x | x |
| Data store | x | x* | x | x | ||
| Data flow | x | x | x |
(*Repudiation against a data store applies when it is a log/audit store.)
So for the Upload API process, you generate one threat per applicable letter:
- S — Can a request spoof another user’s identity? (Stolen/forged token, missing audience check.)
- T — Can the multipart body be tampered to write outside the user’s prefix?
- R — If a malicious upload happens, can we prove who did it? (Audit log of
objectKey+ principal.) - I — Can the API leak another tenant’s metadata via an IDOR on the object key?
- D — Can a huge or zip-bomb upload exhaust the converter?
- E — Can a crafted file trigger code execution in the converter, escalating from “uploader” to “runs code in VPC”?
Capture these in a structured, diffable format. I keep threats in YAML next to the code so they live in the same PR:
# threats/upload-api.yaml
- id: TM-UPLOAD-007
element: "Upload API (process)"
stride: I # Information disclosure
title: "IDOR allows cross-tenant metadata read"
description: >
objectKey is client-supplied and used directly in the metadata
lookup. A user can request another tenant's key and read status,
filename, and size.
status: open # open | mitigated | accepted | transferred
mitigation_ref: CTRL-AUTHZ-OBJ-SCOPE
Do not skip categories because “that won’t happen here.” The discipline is to write the threat down and then explicitly mark it not applicable with a reason. An empty cell you reasoned about is very different from one you never considered.
4. Build attack trees to reason about adversary goals
STRIDE enumerates threats element by element; attack trees reason top-down from an adversary’s goal. They are complementary. Use an attack tree when a threat is serious enough that you need to understand the full set of paths to it — and therefore where a single control breaks multiple paths at once.
The root node is the attacker’s objective. Children are sub-goals joined by AND (all required) or OR (any sufficient) logic. Leaves are concrete attacker actions you can mitigate.
GOAL: Exfiltrate another tenant's documents
|
+-- OR 1. Read objects directly from the store
| +-- AND
| +-- Obtain object key (OR: IDOR on API; guess UUID; leaked log)
| +-- Bypass store authz (OR: misconfigured bucket policy;
| stolen IAM creds; SSRF to metadata svc)
|
+-- OR 2. Abuse the converter
| +-- AND
| +-- Upload malicious file that triggers RCE in converter
| +-- Converter IAM role can read all tenants' objects
|
+-- OR 3. Compromise the metadata DB
+-- OR
+-- SQL injection on a query path
+-- Stolen DB credential from environment / secrets store
Reading the tree pays off immediately. Notice that “Converter IAM role can read all tenants’ objects” is an AND-child on path 2 and effectively the enabling condition behind several others. Scoping that role to a single object prefix per job collapses an entire subtree. That is the kind of high-leverage control attack trees surface that per-element STRIDE alone can miss, because STRIDE looks at one element at a time while the tree shows you the chain.
You do not need a tool. ASCII trees in a markdown file are fine and they diff. If you want tooling, Microsoft’s free Threat Modeling Tool generates STRIDE threats from a DFD automatically, and OWASP Threat Dragon does the same in-browser or via a JSON model you can commit:
# OWASP Threat Dragon model is plain JSON - keep it in the repo
git add threat-models/upload-service.json
git commit -m "threat model: upload service v2 (adds converter RCE path)"
5. Rate and prioritize threats
A list of 60 threats with no priority is noise. You need a consistent score so the riskiest items rise. Two common approaches:
DREAD scores each threat 1-3 (or 1-10) across Damage, Reproducibility, Exploitability, Affected users, Discoverability, then averages. It is simple but notoriously subjective — Discoverability in particular tempts people into security-by-obscurity. If you use it, fix a rubric so “Damage = 3” means the same thing to everyone.
Risk = (Damage + Reproducibility + Exploitability + Affected + Discoverability) / 5
TM-UPLOAD-007 (cross-tenant IDOR):
Damage 3, Reproducibility 3, Exploitability 3, Affected 3, Discoverability 2
Risk = 14 / 5 = 2.8 -> High
I increasingly prefer a likelihood x impact matrix tied to your existing risk register, because it speaks the language leadership already uses and avoids DREAD’s pseudo-precision:
| Likelihood \ Impact | Low | Medium | High |
|---|---|---|---|
| High | Medium | High | Critical |
| Medium | Low | Medium | High |
| Low | Low | Low | Medium |
Whichever you pick, be consistent within a model and record the rationale. The score is an input to a decision, not the decision. For each rated threat you then choose one of four responses, and you must pick one explicitly:
- Mitigate — add a control that reduces likelihood or impact.
- Eliminate — remove the feature/flow that creates the threat.
- Transfer — push the risk elsewhere (managed service, insurance, contract).
- Accept — sign off with a named owner and an expiry date.
“Accept” without a name and a date is just “ignore.” Make it auditable.
6. Select mitigations and map them to controls
This is where threat modeling earns its keep — every open threat must trace to a concrete, testable control, and ideally to a requirement or a ticket. Keep a control catalog and reference it by ID so the same control can cover many threats:
# controls/catalog.yaml
- id: CTRL-AUTHZ-OBJ-SCOPE
title: "Per-tenant object authorization on every store access"
detail: >
Resolve the caller's tenant from the validated token, derive the
allowed key prefix server-side, and reject any objectKey outside it.
Never trust a client-supplied key for authorization.
verifies: [TM-UPLOAD-007]
test: "integration test attempts cross-tenant key, expects 403"
owner: platform-security
status: implemented
For the upload service, the STRIDE pass and attack tree converge on a short, high-value control set:
- I / IDOR (TM-UPLOAD-007): derive the key prefix from the token, not the request. Authorize server-side.
- E / converter RCE + over-broad role: scope the converter IAM role to a per-job prefix and run the converter in a sandbox (separate account/namespace, no outbound egress). Defense in depth — even if RCE succeeds, blast radius is one object.
- D / zip bomb: enforce a max decompressed-size limit and a wall-clock timeout in the converter.
- T+I in transit: enforce TLS and bucket-level encryption.
A correct, minimal example of the scoped-role control in Terraform — the policy uses an IAM condition so the role can only touch keys under the job’s tenant prefix:
data "aws_iam_policy_document" "converter_scoped" {
statement {
sid = "ReadWriteTenantPrefixOnly"
effect = "Allow"
actions = ["s3:GetObject", "s3:PutObject"]
resources = ["${aws_s3_bucket.docs.arn}/tenants/$${aws:PrincipalTag/tenant}/*"]
}
}
The point is traceability: open the threat, follow mitigation_ref to the control, follow the control’s test to the assertion that proves it. No orphan threats, no orphan controls.
7. Embed threat modeling into design reviews and the SDLC
Threat modeling sticks only when it is a gate, not a heroic act. Wire it into the process at three points:
- Design-review gate. Every design doc / RFC for a system that crosses a trust boundary must link a threat model before approval. Make it a checklist item in your RFC template.
- PR gate for the model itself. The threat YAML and DFD live in the repo. A schema check in CI fails the build if a threat is
openwith nomitigation_ref, or a control references a nonexistent threat ID:
# .github/workflows/threat-model.yml
name: threat-model-lint
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate threat models
run: |
pip install yamllint
yamllint threats/ controls/
python scripts/check_threats.py # fails on open threats w/o mitigation_ref
- Right-sized cadence. A full STRIDE-per-element pass for a greenfield platform; a 30-minute “delta model” for an incremental change that only touches one or two flows. Calibrate effort to the size of the architectural change, or teams will route around the process.
Verify
How do you know the model is good enough — Shostack’s fourth question? Verification is concrete, not vibes:
# 1. Coverage: every cross-boundary flow has at least one threat recorded
python scripts/check_coverage.py --dfd dfd/upload.puml --threats threats/
# 2. No orphans: every open threat has a mitigation_ref; every control maps to a real threat
python scripts/check_threats.py --threats threats/ --controls controls/
# 3. Mitigations are tested: each implemented control has a passing test
grep -r "TM-UPLOAD" tests/ | sort # confirm the IDOR/RCE tests exist
pytest tests/security/ -q
Quality checks I run before signing off:
- Boundary coverage: every trust-boundary crossing produced at least one STRIDE threat. A clean crossing usually means you missed something, not that it is safe.
- Per-element completeness: for each element, every applicable STRIDE letter is either a recorded threat or an explicit “N/A because…”.
- Top risks have controls: every Critical/High threat is
mitigated,transferred, oracceptedwith an owner and date. - Attack-tree leaves are mitigated: for each high-value goal, at least one node on every path to it is broken by a control.
- Tests exist: every
implementedcontrol has an automated test that would fail if the control regressed.
Enterprise scenario
A fintech platform team I worked with ran a multi-tenant document service exactly like the example above, processing KYC documents for dozens of business customers. Their original design used a Lambda converter (LibreOffice headless) with a single execution role that had s3:GetObject on the whole bucket — convenient, and it passed code review because “it’s our own bucket.” The STRIDE-per-element pass flagged Elevation of Privilege on the converter process, and the attack tree made the chain undeniable: untrusted file in -> known LibreOffice/font-parsing RCE history -> converter role can read every tenant’s KYC documents. One uploaded malicious document equals full cross-tenant data exposure of regulated PII. That is a reportable breach and a license-threatening event.
The constraint: they could not drop LibreOffice (it was the only renderer that handled their customers’ .docx/.xlsx variety), and per-tenant Lambdas were operationally untenable at their tenant count.
The fix came straight out of the model. They (1) scoped the converter role using an IAM PrincipalTag condition so each invocation could only read the one tenant prefix for its job, (2) moved the converter to a per-job ephemeral execution context with no outbound network egress so even a successful RCE couldn’t exfiltrate or call home, and (3) added a decompressed-size and CPU-time guard to kill the DoS path. The session policy that scoped each invocation looked like this:
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "JobScopedRead",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::kyc-docs/tenants/${aws:PrincipalTag/tenantId}/*"
}]
}
The blast radius of the worst-case threat went from “all tenants” to “the single document already being processed for that tenant” — a risk the business could actually accept. None of it required new tooling. It required drawing the boundary, walking STRIDE against the converter, and following one attack tree to its root. That is the whole return on the practice.