Microsoft 365 Security

Conducting Investigations with Microsoft Purview eDiscovery (Premium): Holds, Collections, and Review Sets

eDiscovery is a forcing function for defensibility. Every step you take has to survive a hostile question from opposing counsel: did you preserve in time, did you scope the collection reasonably, can you prove you did not alter the evidence? Microsoft Purview eDiscovery (Premium) is the tool, but the tool will happily let you do indefensible things. This guide is the workflow I run for an end-to-end investigation, with the gotchas that bite platform teams.

One thing up front, because it invalidates most older blog posts: Microsoft retired all classic eDiscovery experiences on August 31, 2025 — classic Content Search, classic eDiscovery (Standard), and classic eDiscovery (Premium). The only supported surface now is the new unified eDiscovery in the Microsoft Purview portal (purview.microsoft.com). Standard and Premium are now tiers of one product, not separate apps. Export to your own Azure Storage account was also retired on that date. If a guide tells you to open compliance.microsoft.com or set a SAS token for export, it is stale.

1. eDiscovery tiers compared: Content Search, Standard, and Premium

Three capability tiers, one UI. Pick the lowest tier that satisfies the legal requirement — Premium consumes more and licenses heavier.

Capability Search (was Content Search) eDiscovery (Standard) eDiscovery (Premium)
Search all M365 locations Yes Yes Yes
Case container, role-scoped access No Yes Yes
Legal hold on locations No Yes (case holds) Yes (custodian + non-custodial holds)
Custodian management + hold notifications No No Yes
Collections with statistics + draft preview No No Yes
Review sets (static, taggable, redactable) No No Yes
Analytics: near-dup, email threading, themes No No Yes
Predictive coding (ML relevance) No No Yes
Export with load files Basic Yes Yes (review-set export)

Rule of thumb: Search answers “does this content exist and roughly how much.” Standard adds preservation and a case boundary. Premium is for anything that will be reviewed by humans under legal scrutiny — it is the only tier with custodian communications, draft collections, review sets, and analytics. Premium requires E5 / E5 Compliance (or the eDiscovery & Audit add-on) for the custodians whose data you process.

Everything below is Premium. You can drive it from the portal, but the durable, auditable path is the Microsoft Graph eDiscovery API via the Microsoft.Graph.Security PowerShell module. The legacy Security & Compliance cmdlets (New-ComplianceCase, New-CaseHoldPolicy) still exist for the hold object model but the search/collection/review-set flow is Graph-only now. Connect once:

# Modern eDiscovery automation uses the Graph SDK, not Connect-IPPSSession
Install-Module Microsoft.Graph.Security -Scope CurrentUser
Install-Module Microsoft.Graph.Beta.Security -Scope CurrentUser  # legal hold lives in beta

Connect-MgGraph -Scopes "eDiscovery.ReadWrite.All"

2. Case creation, role groups, and the data source model

Create the case first — it is the security and audit boundary for everything else.

$case = New-MgSecurityCaseEdiscoveryCase -BodyParameter @{
    displayName = "Project Falcon - Trade Secret Inquiry"
    description = "Departed-employee IP exfiltration review, filed 2026-06-02"
    externalId  = "LEGAL-2026-0147"   # tie to your matter-management system
}
$case.Id

Role groups before people touch data. Access in eDiscovery (Premium) is role-based and scoped per case. Do not hand reviewers the global eDiscovery Administrator role — that grants access to every case in the tenant, which is exactly what an internal-investigations team cannot have. The clean model:

Set case-level permissions on the case itself (Settings -> Permissions inside the case, or via role-group membership). A Manager who is not a member of the case cannot open it even though they hold the role.

The data source model is the part people get wrong. There are two kinds:

Add a custodian (this is what makes Premium “Premium”):

New-MgSecurityCaseEdiscoveryCaseCustodian -EdiscoveryCaseId $case.Id -BodyParameter @{
    email = "dana.okafor@kloudvin.com"
    applyHoldToSources = $true   # preserve at the moment of association - critical
}

applyHoldToSources = $true preserves the custodian’s sources at association time. The most common defensibility failure I see is associating custodians for weeks while “scoping,” with no hold — content ages out of retention and is gone. Preserve first, scope second.

3. Placing and communicating legal holds, tracking acknowledgement

Two distinct preservation mechanisms, and you usually want both:

  1. Custodian hold — set by applyHoldToSources above; preserves everything tied to the custodian, broadly.
  2. Query-based legal hold — preserves only items matching a query across specified sources. Narrower, defensible when over-preservation is a cost or privacy concern.

A query-based hold (note: the legal hold object is in the beta Graph module today):

$hold = New-MgBetaSecurityCaseEdiscoveryCaseLegalHold -EdiscoveryCaseId $case.Id -BodyParameter @{
    displayName  = "Falcon - source code and pricing"
    description  = "Preserve mail/files mentioning Falcon, pricing models, or the repo"
    contentQuery = '(Falcon OR "pricing model" OR repo-falcon) AND (Date>=2025-12-01)'
}

Then bind sources to the hold (user sources for mailbox/OneDrive, site sources for SharePoint):

New-MgBetaSecurityCaseEdiscoveryCaseLegalHoldUserSource `
    -EdiscoveryCaseId $case.Id -EdiscoveryHoldPolicyId $hold.Id `
    -BodyParameter @{ "@odata.type" = "#microsoft.graph.security.userSource"; email = "dana.okafor@kloudvin.com" }

New-MgBetaSecurityCaseEdiscoveryCaseLegalHoldSiteSource `
    -EdiscoveryCaseId $case.Id -EdiscoveryHoldPolicyId $hold.Id `
    -BodyParameter @{ "@odata.type" = "#microsoft.graph.security.siteSource";
                      site = @{ webUrl = "https://kloudvin.sharepoint.com/sites/pricing" } }

A query-based hold is a preservation filter, not a deletion gate. Items that do not match the query are not held — but neither are they deleted by the hold. Do not assume a query-based hold quietly culls anything; it only adds protection to the matching set.

Hold notifications and acknowledgement are a custodian-only feature and a legal requirement in most jurisdictions (you must put custodians on notice). In the portal, open the case -> Custodians -> select custodians -> Manage hold notifications. You compose a notice (a templated email), issue it, optionally send reminders and escalations, and the system tracks acknowledgement — who received it, who opened it, who clicked “I acknowledge.” That acknowledgement log is your evidence that custodians were instructed to preserve. Set a recurring reminder cadence (e.g., every 14 days) for non-responders; the gap between issued and acknowledged is the first thing opposing counsel will probe.

4. Building search queries, statistics, and the draft collection preview

This is where Premium’s “draft-then-commit” discipline matters. A collection in Premium is not a search that dumps results into a bucket — it is a two-phase operation: first an estimate (a non-destructive preview), then a commit (ingestion into a review set). You iterate on the estimate until the scope is right, then commit once.

Create the collection estimate as a search over the case’s custodial sources (DataSourceScopes) plus any extra sources:

$search = New-MgSecurityCaseEdiscoveryCaseSearch -EdiscoveryCaseId $case.Id -BodyParameter @{
    displayName      = "Falcon collection v1"
    description      = "Estimate before commit"
    contentQuery     = '(Falcon OR "pricing model") AND (Sent>=2025-12-01 AND Sent<=2026-06-02)'
    dataSourceScopes = "allCaseCustodians"   # all custodians added to the case
}

Add a non-custodial source (a site, group, or extra user) to the same search:

New-MgSecurityCaseEdiscoveryCaseSearchAdditionalSource `
    -EdiscoveryCaseId $case.Id -EdiscoverySearchId $search.Id `
    -BodyParameter @{ "@odata.type" = "#microsoft.graph.security.siteSource";
                      site = @{ webUrl = "https://kloudvin.sharepoint.com/sites/pricing" } }

The query language is KQL (Keyword Query Language) — the same property:value syntax used across Microsoft Search. High-value operators for investigations:

participants:dana.okafor@kloudvin.com     # any party on the message (to/from/cc/bcc)
sent>=2025-12-01 AND sent<=2026-06-02     # date-bounded
kind:email OR kind:docs                    # restrict item kinds
attachmentnames:"falcon_pricing.xlsx"     # by attachment name
NOT subject:"out of office"               # exclude noise

Run the estimate, then read the statistics — do not skip this. The estimate returns: total data volume, the content locations that contained hits, and hit counts per query condition. Statistics are how you tell a $50/document review from a $5,000 one. If the estimate says 40 GB across 600 locations, your query is too broad — tighten the date range or add NOT clauses and rerun. The estimate is cheap; the review is not. Iterate the search and rerun until the numbers are defensible and proportional.

In the portal you also get a sample preview of actual hit documents on the estimate — eyeball ten of them. If your top hits are all calendar invites and OOF replies, your keywords are wrong, and you just learned that for free instead of after committing 40 GB.

5. Committing collections to a review set, with cloud attachments and versions

When the estimate is right, commit it to a review set. Commit is the irreversible-ish step: it ingests content, reprocesses it (text extraction, OCR of images, deep/partial-index handling), and makes it static. A committed collection cannot be edited or rerun — only copied or deleted — which is exactly the immutability you want for a defensible record.

At commit time you choose what related content to pull in. These options are the difference between a complete production and a sanctionable gap:

During commit, child items (inline images, signatures) are extracted, OCR’d, and their text folded into the parent rather than added as separate review items — keeping the review set from drowning in immaterial logos.

After commit, the collection page retains a permanent summary: the query used, the locations collected from, deep-index statistics, and item counts. That summary is chain-of-custody gold — it is the contemporaneous record of exactly what you collected and how.

6. Review-set analytics: near-duplicate detection, email threading, themes

A raw review set is unreviewable at scale — duplicates, near-duplicates, and 12-deep reply chains where every email repeats the one below it. Analytics culls that without losing information. First configure analytics settings for the case (Settings -> Search & analytics), then in the review set: Analytics -> Run document & email analytics. Watch progress on the case Jobs tab.

Three engines run:

The payoff is an auto-generated filter named [AutoGen] For Review. Select it from Saved filter queries and the review set collapses to representative, unique, inclusive items only. Its actual query is worth understanding:

(((FileClass="Email") AND (IsInclusive="True") AND (MarkAsRepresentative="Unique"))
 OR ((FileClass="Attachment") AND (MarkAsRepresentative="Unique"))
 OR ((FileClass="Document") AND (MarkAsRepresentative="Unique"))
 OR ((FileClass="Conversation") AND (MarkAsRepresentative="Unique"))
 OR ((FileClass="Attachment" OR FileClass="Conversation" OR FileClass="Document" OR FileClass="Email")
     AND NOT (Exists:MarkAsRepresentative)))

That last clause is the important safety net: it includes items analytics could not score (no MarkAsRepresentative value). Those are not silently dropped — a human still has to look at them. Reducing a 200,000-item set to ~30,000 “For Review” items is a routine, defensible 80-85% cull.

7. Tagging, redaction, and predictive coding to cull the review set

Now the human review, structured so it is auditable.

Tagging. Build a tag panel (Manage tags) before reviewers start — a mutually-exclusive group for the responsiveness call (Responsive / Not Responsive / Needs Further Review) and independent toggles for issues (Privileged, Confidential, Hot Doc). Mutually-exclusive groups stop a document being both Responsive and Not Responsive. Reviewers tag from the document viewer; tags are queryable, so a lead can pull “everything marked Privileged” instantly and tags travel into the export load file.

Redaction. Open a document in the viewer and apply redactions; eDiscovery generates a redacted PDF alongside the native. At export you choose to replace the redacted natives with the converted PDFs so privileged content never leaves in native form. This is how you produce a partially-privileged document without handing over the privileged passages.

Predictive coding (ML-assisted relevance). For large sets, do not brute-force-review everything. Create a predictive coding model in the review set, hand-label a training set (a few hundred documents as relevant / not relevant), and the model scores every remaining document 0-100 for relevance. Sort by score and review high-confidence-relevant first; defensibly down-prioritize the long low-score tail. The model improves as you label more rounds. Predictive coding plus the For Review filter is what makes a million-document matter tractable on a human budget — and the labeling rounds are logged, which is your defensibility for why you stopped reviewing at a given score.

8. Export, load files, and chain of custody

Export produces a package for outside counsel or a third-party review platform (Relativity, etc.). In the review set: select items (or apply a filter) -> Action -> Export. Key options:

The package structure (condensed):

Falcon-export 1 of 2.zip
├── Export_load_file_1 of 2.csv     # metadata load file (one row/doc, columns of fields, native + text paths)
├── Warnings and errors 1 of 2.csv  # items that failed to export - READ THIS
├── NativeFiles/                    # native docs (or redacted PDFs if you chose replacement)
├── Extracted_text_files/           # per-document extracted text
└── Error_files/                    # ExtractionError / ProcessingError items
Summary.csv                          # Total / Actual / Errors / Skipped Processing / Export Containers

The load file (Export_load_file_*.csv) is the spine — a metadata CSV (the Relativity/Concordance-style DAT equivalent) with one row per document, columns for every metadata field, and the relative paths to each native and text file so the review platform can rebuild the set. Note two operational facts: exports partition into ZIPs at a maximum 75 GB uncompressed each (a >75 GB export yields multiple ZIPs that recombine on extraction), and you have 30 days to download after the job completes, though the job record is retained for the life of the case.

Chain of custody is not one feature; it is the sum of artifacts you have accumulated:

Remember Azure Storage export was retired on August 31, 2025. The supported paths are now direct download (Reports only / Loose files and PSTs / Condensed directory structure). Do not architect a pipeline around an export-to-blob handoff — it no longer exists.

Verify

Run these to confirm the investigation is sound before you hand anything over.

# 1. Case exists and is open
Get-MgSecurityCaseEdiscoveryCase -EdiscoveryCaseId $case.Id |
    Select-Object displayName, status, externalId

# 2. Every intended custodian is associated AND held
Get-MgSecurityCaseEdiscoveryCaseCustodian -EdiscoveryCaseId $case.Id |
    Select-Object email, status, holdStatus

# 3. The legal hold is applied (not just created)
Get-MgBetaSecurityCaseEdiscoveryCaseLegalHold -EdiscoveryCaseId $case.Id |
    Select-Object displayName, status, contentQuery

Then in the portal:

Enterprise scenario

A platform/security team at a mid-size SaaS company got a litigation hold notice on a Friday for a departed senior engineer suspected of taking pricing models and source-code references to a competitor. The constraint was brutal on two axes: the custodian’s mailbox was under an aggressive 30-day delete retention (Legal had not yet been looped in when he left), and a large share of the responsive evidence was cloud attachments — Teams and Outlook messages linking to OneDrive/SharePoint files rather than embedding them. An earlier, panicked Content Search had returned ~38 GB across 400+ locations and nobody could afford to review that.

What went wrong on the first pass: they ran a collection and committed it without checking Cloud attachments, so the review set was full of messages reading “see the model attached” with a dead link and no file. They had also waited four days before placing any hold while “scoping,” and two days of the custodian’s deleted-items had already aged past the 30-day window.

The fix, redone properly:

  1. Preserve first. Add the custodian with applyHoldToSources = $true immediately, before any further scoping. (The lesson: never scope an at-risk custodian unheld.)
  2. Tighten the estimate, do not commit blind. Iterate the search to a defensible scope using statistics, then commit with the right options:
# Add and HOLD the custodian in one shot - stop the bleeding
New-MgSecurityCaseEdiscoveryCaseCustodian -EdiscoveryCaseId $case.Id -BodyParameter @{
    email = "former.eng@saas.example"
    applyHoldToSources = $true
}

# Scoped, date-bounded estimate over all case custodians + the pricing site
$s = New-MgSecurityCaseEdiscoveryCaseSearch -EdiscoveryCaseId $case.Id -BodyParameter @{
    displayName      = "Falcon scoped v3"
    contentQuery     = '(Falcon OR "pricing model" OR repo-falcon) AND (sent>=2025-12-01 AND sent<=2026-06-02)'
    dataSourceScopes = "allCaseCustodians"
}
  1. Commit with Cloud attachments, document versions, and partially indexed items checked, so the linked OneDrive files (and prior versions showing the edit trail) actually landed in the review set.
  2. Run analytics + [AutoGen] For Review, which collapsed the scoped set from ~210,000 items to ~24,000 inclusive/representative items, then predictive coding to rank the remainder so reviewers hit the high-relevance pricing documents first.

The outcome: the two lost days of mailbox content were genuinely unrecoverable and had to be disclosed (the cost of the four-day delay). But the cloud-attachment evidence — the actual pricing spreadsheets and their version history — was fully preserved and produced, and the review budget dropped by roughly an order of magnitude versus reviewing the raw collection. The two durable lessons that went into their runbook: hold the custodian the moment the case opens, and Cloud attachments is non-optional in any modern M365 collection — without it you are collecting envelopes, not letters.

Checklist

PurvieweDiscoveryLegal HoldReview SetsCompliance

Comments

Keep Reading