Security Multi-Cloud

Practical Threat Modeling: STRIDE, Data-Flow Diagrams, and Attack Trees for Real Systems

Most “threat modeling” I see in the wild is a one-off whiteboard session that produces a photo nobody looks at again. That is not threat modeling; it is theater. Done properly, threat modeling is the cheapest security control you own: it finds design flaws before a single line of code is written, when the cost of changing your mind is a sticky note rather than a re-architecture. The method below is the one I run with platform and application teams. It is deliberately mechanical so a mid-level engineer can drive it without a security PhD, and it produces artifacts you can diff, track, and re-run when the system changes.

The four questions that anchor the whole exercise (the Shostack frame) are: What are we building? What can go wrong? What are we going to do about it? Did we do a good enough job? Every step maps to one of those.

1. Why threat model, and when

The economic argument is simple. A SQL injection caught in a design review is a one-line note: “parameterize this query.” The same flaw caught in pen testing is a bug ticket, a code review, a redeploy, and a regression test. Caught in production it is an incident, a forensics bill, and possibly a breach notification. Threat modeling shifts the discovery left to the point of lowest remediation cost.

Trigger a threat model when:

You do not need to re-model on every commit. Model on architectural change. That distinction is what keeps the practice sustainable.

2. Scope the system and draw a data-flow diagram

You cannot threat-model what you cannot see. The foundational artifact is a data-flow diagram (DFD) — not a UML class diagram, not a network topology, but a picture of how data moves. DFDs use exactly four element types, and that constraint is a feature:

Element Shape (classic) Meaning
External entity Rectangle An actor outside your control: user, third-party API, another team’s service
Process Circle Code that transforms data: a service, a Lambda, a function
Data store Two parallel lines Where data rests: database, S3 bucket, queue, cache
Data flow Arrow Data in motion between the above, labeled with protocol/payload

The fifth thing — and the single most important — is the trust boundary: a dashed line crossing the flows where the privilege or trust level changes. Threats cluster on trust boundaries. The internet-to-DMZ edge, the app-to-database edge, the tenant-A-to-tenant-B edge: that is where an attacker’s input meets your assumptions.

Keep DFDs in version control as text so they diff. I use a Structurizr-style or PlantUML description; here is a minimal DFD-as-code for a document-upload service:

@startuml
title DFD - Document Upload Service (Level 1)

actor "Browser (user)" as user
rectangle "CDN / WAF" as waf
component "Upload API\n(process)" as api
database "Object Store\n(S3, data store)" as s3
queue "Convert Queue\n(data store)" as q
component "Converter\n(process)" as conv
database "Metadata DB\n(data store)" as db

user --> waf : HTTPS multipart upload
waf --> api : forwarded request (auth token)
api --> s3 : PUT object (SSE-KMS)
api --> db : INSERT metadata
api --> q : enqueue job {objectKey}
q --> conv : dequeue job
conv --> s3 : GET object / PUT rendered
conv --> db : UPDATE status

' Trust boundaries
rectangle "== Internet boundary ==" as tb1 #line.dashed
rectangle "== VPC / private boundary ==" as tb2 #line.dashed
@enduml

Two rules that save you later:

  1. Decompose only until threats stop changing. A “Level 0” context diagram (one process, the external entities, the major stores) is enough to start. Drill into a “Level 1” only for the processes where new threats emerge.
  2. Every flow that crosses a trust boundary gets a label describing protocol, authentication, and data sensitivity. Unlabeled crossings are where teams hand-wave.

3. Apply STRIDE per element to enumerate threats

STRIDE is a mnemonic for six threat categories, and its power is that it is exhaustive by construction for the properties we care about. Each letter is the negation of a security property:

STRIDE Threat Property violated
S Spoofing Authentication
T Tampering Integrity
R Repudiation Non-repudiation
I Information disclosure Confidentiality
D Denial of service Availability
E Elevation of privilege Authorization

The technique I trust most is STRIDE-per-element. Walk every element in the DFD and ask which STRIDE categories apply to that element type. Microsoft’s mapping is the standard starting grid:

Element type S T R I D E
External entity x x
Process x x x x x x
Data store x x* x x
Data flow x x x

(*Repudiation against a data store applies when it is a log/audit store.)

So for the Upload API process, you generate one threat per applicable letter:

Capture these in a structured, diffable format. I keep threats in YAML next to the code so they live in the same PR:

# threats/upload-api.yaml
- id: TM-UPLOAD-007
  element: "Upload API (process)"
  stride: I        # Information disclosure
  title: "IDOR allows cross-tenant metadata read"
  description: >
    objectKey is client-supplied and used directly in the metadata
    lookup. A user can request another tenant's key and read status,
    filename, and size.
  status: open      # open | mitigated | accepted | transferred
  mitigation_ref: CTRL-AUTHZ-OBJ-SCOPE

Do not skip categories because “that won’t happen here.” The discipline is to write the threat down and then explicitly mark it not applicable with a reason. An empty cell you reasoned about is very different from one you never considered.

4. Build attack trees to reason about adversary goals

STRIDE enumerates threats element by element; attack trees reason top-down from an adversary’s goal. They are complementary. Use an attack tree when a threat is serious enough that you need to understand the full set of paths to it — and therefore where a single control breaks multiple paths at once.

The root node is the attacker’s objective. Children are sub-goals joined by AND (all required) or OR (any sufficient) logic. Leaves are concrete attacker actions you can mitigate.

GOAL: Exfiltrate another tenant's documents
|
+-- OR 1. Read objects directly from the store
|     +-- AND
|         +-- Obtain object key  (OR: IDOR on API; guess UUID; leaked log)
|         +-- Bypass store authz  (OR: misconfigured bucket policy;
|                                      stolen IAM creds; SSRF to metadata svc)
|
+-- OR 2. Abuse the converter
|     +-- AND
|         +-- Upload malicious file that triggers RCE in converter
|         +-- Converter IAM role can read all tenants' objects
|
+-- OR 3. Compromise the metadata DB
      +-- OR
          +-- SQL injection on a query path
          +-- Stolen DB credential from environment / secrets store

Reading the tree pays off immediately. Notice that “Converter IAM role can read all tenants’ objects” is an AND-child on path 2 and effectively the enabling condition behind several others. Scoping that role to a single object prefix per job collapses an entire subtree. That is the kind of high-leverage control attack trees surface that per-element STRIDE alone can miss, because STRIDE looks at one element at a time while the tree shows you the chain.

You do not need a tool. ASCII trees in a markdown file are fine and they diff. If you want tooling, Microsoft’s free Threat Modeling Tool generates STRIDE threats from a DFD automatically, and OWASP Threat Dragon does the same in-browser or via a JSON model you can commit:

# OWASP Threat Dragon model is plain JSON - keep it in the repo
git add threat-models/upload-service.json
git commit -m "threat model: upload service v2 (adds converter RCE path)"

5. Rate and prioritize threats

A list of 60 threats with no priority is noise. You need a consistent score so the riskiest items rise. Two common approaches:

DREAD scores each threat 1-3 (or 1-10) across Damage, Reproducibility, Exploitability, Affected users, Discoverability, then averages. It is simple but notoriously subjective — Discoverability in particular tempts people into security-by-obscurity. If you use it, fix a rubric so “Damage = 3” means the same thing to everyone.

Risk = (Damage + Reproducibility + Exploitability + Affected + Discoverability) / 5

TM-UPLOAD-007 (cross-tenant IDOR):
  Damage 3, Reproducibility 3, Exploitability 3, Affected 3, Discoverability 2
  Risk = 14 / 5 = 2.8  -> High

I increasingly prefer a likelihood x impact matrix tied to your existing risk register, because it speaks the language leadership already uses and avoids DREAD’s pseudo-precision:

Likelihood \ Impact Low Medium High
High Medium High Critical
Medium Low Medium High
Low Low Low Medium

Whichever you pick, be consistent within a model and record the rationale. The score is an input to a decision, not the decision. For each rated threat you then choose one of four responses, and you must pick one explicitly:

“Accept” without a name and a date is just “ignore.” Make it auditable.

6. Select mitigations and map them to controls

This is where threat modeling earns its keep — every open threat must trace to a concrete, testable control, and ideally to a requirement or a ticket. Keep a control catalog and reference it by ID so the same control can cover many threats:

# controls/catalog.yaml
- id: CTRL-AUTHZ-OBJ-SCOPE
  title: "Per-tenant object authorization on every store access"
  detail: >
    Resolve the caller's tenant from the validated token, derive the
    allowed key prefix server-side, and reject any objectKey outside it.
    Never trust a client-supplied key for authorization.
  verifies: [TM-UPLOAD-007]
  test: "integration test attempts cross-tenant key, expects 403"
  owner: platform-security
  status: implemented

For the upload service, the STRIDE pass and attack tree converge on a short, high-value control set:

A correct, minimal example of the scoped-role control in Terraform — the policy uses an IAM condition so the role can only touch keys under the job’s tenant prefix:

data "aws_iam_policy_document" "converter_scoped" {
  statement {
    sid       = "ReadWriteTenantPrefixOnly"
    effect    = "Allow"
    actions   = ["s3:GetObject", "s3:PutObject"]
    resources = ["${aws_s3_bucket.docs.arn}/tenants/$${aws:PrincipalTag/tenant}/*"]
  }
}

The point is traceability: open the threat, follow mitigation_ref to the control, follow the control’s test to the assertion that proves it. No orphan threats, no orphan controls.

7. Embed threat modeling into design reviews and the SDLC

Threat modeling sticks only when it is a gate, not a heroic act. Wire it into the process at three points:

  1. Design-review gate. Every design doc / RFC for a system that crosses a trust boundary must link a threat model before approval. Make it a checklist item in your RFC template.
  2. PR gate for the model itself. The threat YAML and DFD live in the repo. A schema check in CI fails the build if a threat is open with no mitigation_ref, or a control references a nonexistent threat ID:
# .github/workflows/threat-model.yml
name: threat-model-lint
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate threat models
        run: |
          pip install yamllint
          yamllint threats/ controls/
          python scripts/check_threats.py   # fails on open threats w/o mitigation_ref
  1. Right-sized cadence. A full STRIDE-per-element pass for a greenfield platform; a 30-minute “delta model” for an incremental change that only touches one or two flows. Calibrate effort to the size of the architectural change, or teams will route around the process.

Verify

How do you know the model is good enough — Shostack’s fourth question? Verification is concrete, not vibes:

# 1. Coverage: every cross-boundary flow has at least one threat recorded
python scripts/check_coverage.py --dfd dfd/upload.puml --threats threats/

# 2. No orphans: every open threat has a mitigation_ref; every control maps to a real threat
python scripts/check_threats.py --threats threats/ --controls controls/

# 3. Mitigations are tested: each implemented control has a passing test
grep -r "TM-UPLOAD" tests/ | sort        # confirm the IDOR/RCE tests exist
pytest tests/security/ -q

Quality checks I run before signing off:

Enterprise scenario

A fintech platform team I worked with ran a multi-tenant document service exactly like the example above, processing KYC documents for dozens of business customers. Their original design used a Lambda converter (LibreOffice headless) with a single execution role that had s3:GetObject on the whole bucket — convenient, and it passed code review because “it’s our own bucket.” The STRIDE-per-element pass flagged Elevation of Privilege on the converter process, and the attack tree made the chain undeniable: untrusted file in -> known LibreOffice/font-parsing RCE history -> converter role can read every tenant’s KYC documents. One uploaded malicious document equals full cross-tenant data exposure of regulated PII. That is a reportable breach and a license-threatening event.

The constraint: they could not drop LibreOffice (it was the only renderer that handled their customers’ .docx/.xlsx variety), and per-tenant Lambdas were operationally untenable at their tenant count.

The fix came straight out of the model. They (1) scoped the converter role using an IAM PrincipalTag condition so each invocation could only read the one tenant prefix for its job, (2) moved the converter to a per-job ephemeral execution context with no outbound network egress so even a successful RCE couldn’t exfiltrate or call home, and (3) added a decompressed-size and CPU-time guard to kill the DoS path. The session policy that scoped each invocation looked like this:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "JobScopedRead",
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:PutObject"],
    "Resource": "arn:aws:s3:::kyc-docs/tenants/${aws:PrincipalTag/tenantId}/*"
  }]
}

The blast radius of the worst-case threat went from “all tenants” to “the single document already being processed for that tenant” — a risk the business could actually accept. None of it required new tooling. It required drawing the boundary, walking STRIDE against the converter, and following one attack tree to its root. That is the whole return on the practice.

Checklist

threat-modelingSTRIDEdata-flow-diagramsattack-treessecure-designrisk-assessment

Comments

Keep Reading