eDiscovery is a forcing function for defensibility. Every step you take has to survive a hostile question from opposing counsel: did you preserve in time, did you scope the collection reasonably, can you prove you did not alter the evidence? Microsoft Purview eDiscovery (Premium) is the tool, but the tool will happily let you do indefensible things. This guide is the workflow I run for an end-to-end investigation, with the gotchas that bite platform teams.
One thing up front, because it invalidates most older blog posts: Microsoft retired all classic eDiscovery experiences on August 31, 2025 — classic Content Search, classic eDiscovery (Standard), and classic eDiscovery (Premium). The only supported surface now is the new unified eDiscovery in the Microsoft Purview portal (purview.microsoft.com). Standard and Premium are now tiers of one product, not separate apps. Export to your own Azure Storage account was also retired on that date. If a guide tells you to open compliance.microsoft.com or set a SAS token for export, it is stale.
1. eDiscovery tiers compared: Content Search, Standard, and Premium
Three capability tiers, one UI. Pick the lowest tier that satisfies the legal requirement — Premium consumes more and licenses heavier.
| Capability | Search (was Content Search) | eDiscovery (Standard) | eDiscovery (Premium) |
|---|---|---|---|
| Search all M365 locations | Yes | Yes | Yes |
| Case container, role-scoped access | No | Yes | Yes |
| Legal hold on locations | No | Yes (case holds) | Yes (custodian + non-custodial holds) |
| Custodian management + hold notifications | No | No | Yes |
| Collections with statistics + draft preview | No | No | Yes |
| Review sets (static, taggable, redactable) | No | No | Yes |
| Analytics: near-dup, email threading, themes | No | No | Yes |
| Predictive coding (ML relevance) | No | No | Yes |
| Export with load files | Basic | Yes | Yes (review-set export) |
Rule of thumb: Search answers “does this content exist and roughly how much.” Standard adds preservation and a case boundary. Premium is for anything that will be reviewed by humans under legal scrutiny — it is the only tier with custodian communications, draft collections, review sets, and analytics. Premium requires E5 / E5 Compliance (or the eDiscovery & Audit add-on) for the custodians whose data you process.
Everything below is Premium. You can drive it from the portal, but the durable, auditable path is the Microsoft Graph eDiscovery API via the Microsoft.Graph.Security PowerShell module. The legacy Security & Compliance cmdlets (New-ComplianceCase, New-CaseHoldPolicy) still exist for the hold object model but the search/collection/review-set flow is Graph-only now. Connect once:
# Modern eDiscovery automation uses the Graph SDK, not Connect-IPPSSession
Install-Module Microsoft.Graph.Security -Scope CurrentUser
Install-Module Microsoft.Graph.Beta.Security -Scope CurrentUser # legal hold lives in beta
Connect-MgGraph -Scopes "eDiscovery.ReadWrite.All"
2. Case creation, role groups, and the data source model
Create the case first — it is the security and audit boundary for everything else.
$case = New-MgSecurityCaseEdiscoveryCase -BodyParameter @{
displayName = "Project Falcon - Trade Secret Inquiry"
description = "Departed-employee IP exfiltration review, filed 2026-06-02"
externalId = "LEGAL-2026-0147" # tie to your matter-management system
}
$case.Id
Role groups before people touch data. Access in eDiscovery (Premium) is role-based and scoped per case. Do not hand reviewers the global eDiscovery Administrator role — that grants access to every case in the tenant, which is exactly what an internal-investigations team cannot have. The clean model:
- eDiscovery Manager — creates and manages their own cases. Assign litigation owners here.
- eDiscovery Administrator — superset; can access all cases and reassign orphaned ones. Two people, max, both in Legal/IT-Security.
- Reviewer — can only see review sets they are explicitly added to; cannot run collections or place holds. This is your outside-counsel role.
Set case-level permissions on the case itself (Settings -> Permissions inside the case, or via role-group membership). A Manager who is not a member of the case cannot open it even though they hold the role.
The data source model is the part people get wrong. There are two kinds:
- Custodial data sources — bound to a person (a custodian). Adding a custodian auto-associates their mailbox, OneDrive, and (optionally) their Teams 1:1 chats. The point of a custodian is accountability and hold notification, not just data — you can prove you preserved that individual’s content and told them about it.
- Non-custodial data sources — content not owned by a single person: a SharePoint site, a shared/Group mailbox, a Teams channel. No notification, no acknowledgement, because there is no person to notify.
Add a custodian (this is what makes Premium “Premium”):
New-MgSecurityCaseEdiscoveryCaseCustodian -EdiscoveryCaseId $case.Id -BodyParameter @{
email = "dana.okafor@kloudvin.com"
applyHoldToSources = $true # preserve at the moment of association - critical
}
applyHoldToSources = $true preserves the custodian’s sources at association time. The most common defensibility failure I see is associating custodians for weeks while “scoping,” with no hold — content ages out of retention and is gone. Preserve first, scope second.
3. Placing and communicating legal holds, tracking acknowledgement
Two distinct preservation mechanisms, and you usually want both:
- Custodian hold — set by
applyHoldToSourcesabove; preserves everything tied to the custodian, broadly. - Query-based legal hold — preserves only items matching a query across specified sources. Narrower, defensible when over-preservation is a cost or privacy concern.
A query-based hold (note: the legal hold object is in the beta Graph module today):
$hold = New-MgBetaSecurityCaseEdiscoveryCaseLegalHold -EdiscoveryCaseId $case.Id -BodyParameter @{
displayName = "Falcon - source code and pricing"
description = "Preserve mail/files mentioning Falcon, pricing models, or the repo"
contentQuery = '(Falcon OR "pricing model" OR repo-falcon) AND (Date>=2025-12-01)'
}
Then bind sources to the hold (user sources for mailbox/OneDrive, site sources for SharePoint):
New-MgBetaSecurityCaseEdiscoveryCaseLegalHoldUserSource `
-EdiscoveryCaseId $case.Id -EdiscoveryHoldPolicyId $hold.Id `
-BodyParameter @{ "@odata.type" = "#microsoft.graph.security.userSource"; email = "dana.okafor@kloudvin.com" }
New-MgBetaSecurityCaseEdiscoveryCaseLegalHoldSiteSource `
-EdiscoveryCaseId $case.Id -EdiscoveryHoldPolicyId $hold.Id `
-BodyParameter @{ "@odata.type" = "#microsoft.graph.security.siteSource";
site = @{ webUrl = "https://kloudvin.sharepoint.com/sites/pricing" } }
A query-based hold is a preservation filter, not a deletion gate. Items that do not match the query are not held — but neither are they deleted by the hold. Do not assume a query-based hold quietly culls anything; it only adds protection to the matching set.
Hold notifications and acknowledgement are a custodian-only feature and a legal requirement in most jurisdictions (you must put custodians on notice). In the portal, open the case -> Custodians -> select custodians -> Manage hold notifications. You compose a notice (a templated email), issue it, optionally send reminders and escalations, and the system tracks acknowledgement — who received it, who opened it, who clicked “I acknowledge.” That acknowledgement log is your evidence that custodians were instructed to preserve. Set a recurring reminder cadence (e.g., every 14 days) for non-responders; the gap between issued and acknowledged is the first thing opposing counsel will probe.
4. Building search queries, statistics, and the draft collection preview
This is where Premium’s “draft-then-commit” discipline matters. A collection in Premium is not a search that dumps results into a bucket — it is a two-phase operation: first an estimate (a non-destructive preview), then a commit (ingestion into a review set). You iterate on the estimate until the scope is right, then commit once.
Create the collection estimate as a search over the case’s custodial sources (DataSourceScopes) plus any extra sources:
$search = New-MgSecurityCaseEdiscoveryCaseSearch -EdiscoveryCaseId $case.Id -BodyParameter @{
displayName = "Falcon collection v1"
description = "Estimate before commit"
contentQuery = '(Falcon OR "pricing model") AND (Sent>=2025-12-01 AND Sent<=2026-06-02)'
dataSourceScopes = "allCaseCustodians" # all custodians added to the case
}
Add a non-custodial source (a site, group, or extra user) to the same search:
New-MgSecurityCaseEdiscoveryCaseSearchAdditionalSource `
-EdiscoveryCaseId $case.Id -EdiscoverySearchId $search.Id `
-BodyParameter @{ "@odata.type" = "#microsoft.graph.security.siteSource";
site = @{ webUrl = "https://kloudvin.sharepoint.com/sites/pricing" } }
The query language is KQL (Keyword Query Language) — the same property:value syntax used across Microsoft Search. High-value operators for investigations:
participants:dana.okafor@kloudvin.com # any party on the message (to/from/cc/bcc)
sent>=2025-12-01 AND sent<=2026-06-02 # date-bounded
kind:email OR kind:docs # restrict item kinds
attachmentnames:"falcon_pricing.xlsx" # by attachment name
NOT subject:"out of office" # exclude noise
Run the estimate, then read the statistics — do not skip this. The estimate returns: total data volume, the content locations that contained hits, and hit counts per query condition. Statistics are how you tell a $50/document review from a $5,000 one. If the estimate says 40 GB across 600 locations, your query is too broad — tighten the date range or add NOT clauses and rerun. The estimate is cheap; the review is not. Iterate the search and rerun until the numbers are defensible and proportional.
In the portal you also get a sample preview of actual hit documents on the estimate — eyeball ten of them. If your top hits are all calendar invites and OOF replies, your keywords are wrong, and you just learned that for free instead of after committing 40 GB.
5. Committing collections to a review set, with cloud attachments and versions
When the estimate is right, commit it to a review set. Commit is the irreversible-ish step: it ingests content, reprocesses it (text extraction, OCR of images, deep/partial-index handling), and makes it static. A committed collection cannot be edited or rerun — only copied or deleted — which is exactly the immutability you want for a defensible record.
At commit time you choose what related content to pull in. These options are the difference between a complete production and a sanctionable gap:
- Cloud attachments — modern attachments are links to files in OneDrive/SharePoint, not embedded bytes. If you do not check this, you collect an email that says “see attached” with a URL and none of the actual document. The collection resolves the link and pulls the file as it existed at send time. Almost always required.
- Document versions — SharePoint/OneDrive keep version history. Check this to collect all versions, not just the current one. Relevant when the edit history is the evidence.
- Conversation transcripts / contextual Teams messages — for a responsive Teams message, also pull the surrounding conversation so a reviewer sees context, not a decontextualized line.
- Partially indexed items — files Microsoft 365 could not fully index (corrupt, unsupported, encrypted, oversized). They are invisible to the keyword query, so if you exclude them you may miss responsive content hiding in an unindexable file. Defensible practice for a serious matter: collect them and let analytics/review sort them out.
During commit, child items (inline images, signatures) are extracted, OCR’d, and their text folded into the parent rather than added as separate review items — keeping the review set from drowning in immaterial logos.
After commit, the collection page retains a permanent summary: the query used, the locations collected from, deep-index statistics, and item counts. That summary is chain-of-custody gold — it is the contemporaneous record of exactly what you collected and how.
6. Review-set analytics: near-duplicate detection, email threading, themes
A raw review set is unreviewable at scale — duplicates, near-duplicates, and 12-deep reply chains where every email repeats the one below it. Analytics culls that without losing information. First configure analytics settings for the case (Settings -> Search & analytics), then in the review set: Analytics -> Run document & email analytics. Watch progress on the case Jobs tab.
Three engines run:
- Near-duplicate detection groups documents that are textually similar. One document per group is the pivot; the rest are scored by similarity. You review the pivot and spot-check the rest, instead of reading 40 versions of the same contract.
- Email threading analyzes reply chains and marks the inclusive emails. An email is
IsInclusive=Truewhen it contains all unique content from the thread up to that point — i.e., the most complete message in the branch. Review the inclusive emails and you have read the entire conversation without opening every reply individually. Branches (forks, differing cc lists) get their own inclusive messages. - Themes clusters documents by topic using ML, so you can navigate the set conceptually and assign reviewers by subject matter.
The payoff is an auto-generated filter named [AutoGen] For Review. Select it from Saved filter queries and the review set collapses to representative, unique, inclusive items only. Its actual query is worth understanding:
(((FileClass="Email") AND (IsInclusive="True") AND (MarkAsRepresentative="Unique"))
OR ((FileClass="Attachment") AND (MarkAsRepresentative="Unique"))
OR ((FileClass="Document") AND (MarkAsRepresentative="Unique"))
OR ((FileClass="Conversation") AND (MarkAsRepresentative="Unique"))
OR ((FileClass="Attachment" OR FileClass="Conversation" OR FileClass="Document" OR FileClass="Email")
AND NOT (Exists:MarkAsRepresentative)))
That last clause is the important safety net: it includes items analytics could not score (no MarkAsRepresentative value). Those are not silently dropped — a human still has to look at them. Reducing a 200,000-item set to ~30,000 “For Review” items is a routine, defensible 80-85% cull.
7. Tagging, redaction, and predictive coding to cull the review set
Now the human review, structured so it is auditable.
Tagging. Build a tag panel (Manage tags) before reviewers start — a mutually-exclusive group for the responsiveness call (Responsive / Not Responsive / Needs Further Review) and independent toggles for issues (Privileged, Confidential, Hot Doc). Mutually-exclusive groups stop a document being both Responsive and Not Responsive. Reviewers tag from the document viewer; tags are queryable, so a lead can pull “everything marked Privileged” instantly and tags travel into the export load file.
Redaction. Open a document in the viewer and apply redactions; eDiscovery generates a redacted PDF alongside the native. At export you choose to replace the redacted natives with the converted PDFs so privileged content never leaves in native form. This is how you produce a partially-privileged document without handing over the privileged passages.
Predictive coding (ML-assisted relevance). For large sets, do not brute-force-review everything. Create a predictive coding model in the review set, hand-label a training set (a few hundred documents as relevant / not relevant), and the model scores every remaining document 0-100 for relevance. Sort by score and review high-confidence-relevant first; defensibly down-prioritize the long low-score tail. The model improves as you label more rounds. Predictive coding plus the For Review filter is what makes a million-document matter tractable on a human budget — and the labeling rounds are logged, which is your defensibility for why you stopped reviewing at a given score.
8. Export, load files, and chain of custody
Export produces a package for outside counsel or a third-party review platform (Relativity, etc.). In the review set: select items (or apply a filter) -> Action -> Export. Key options:
- Export these documents — Selected documents only, All filtered documents, or All documents in the review set.
- Expand selection — Include associated family items (pull the parent email of a responsive attachment, and siblings, via shared
FamilyId) and Include associated conversation items (the full Teams conversation via sharedConversationId). Almost always include family items; producing an orphaned attachment without its transmittal email is a defensibility problem. - Output options — Reports only (summary + load file, no content), Loose files and PSTs (email goes into PSTs, files in native directory structure), or Condensed directory structure (everything flat with a load file pointing to each file’s location). Condensed is what review platforms ingest.
- Include — Tags (fold your responsiveness/privilege calls into the load file), Text files (extracted text for the platform’s search index), and Replace redacted natives with converted PDFs.
The package structure (condensed):
Falcon-export 1 of 2.zip
├── Export_load_file_1 of 2.csv # metadata load file (one row/doc, columns of fields, native + text paths)
├── Warnings and errors 1 of 2.csv # items that failed to export - READ THIS
├── NativeFiles/ # native docs (or redacted PDFs if you chose replacement)
├── Extracted_text_files/ # per-document extracted text
└── Error_files/ # ExtractionError / ProcessingError items
Summary.csv # Total / Actual / Errors / Skipped Processing / Export Containers
The load file (Export_load_file_*.csv) is the spine — a metadata CSV (the Relativity/Concordance-style DAT equivalent) with one row per document, columns for every metadata field, and the relative paths to each native and text file so the review platform can rebuild the set. Note two operational facts: exports partition into ZIPs at a maximum 75 GB uncompressed each (a >75 GB export yields multiple ZIPs that recombine on extraction), and you have 30 days to download after the job completes, though the job record is retained for the life of the case.
Chain of custody is not one feature; it is the sum of artifacts you have accumulated:
- the hold acknowledgement log (custodians were told to preserve, with timestamps);
- the committed collection summary (the exact query, locations, and counts — immutable);
- the
Summary.csvreconciliation (TotalvsActualvsErrorsvsSkipped— these numbers must tie out and be explainable); - the warnings-and-errors file (what failed and why — never ship an export without reading it);
- the underlying Purview audit log, which records every eDiscovery action (case created, hold placed, collection committed, export run) with actor and timestamp.
Remember Azure Storage export was retired on August 31, 2025. The supported paths are now direct download (Reports only / Loose files and PSTs / Condensed directory structure). Do not architect a pipeline around an export-to-blob handoff — it no longer exists.
Verify
Run these to confirm the investigation is sound before you hand anything over.
# 1. Case exists and is open
Get-MgSecurityCaseEdiscoveryCase -EdiscoveryCaseId $case.Id |
Select-Object displayName, status, externalId
# 2. Every intended custodian is associated AND held
Get-MgSecurityCaseEdiscoveryCaseCustodian -EdiscoveryCaseId $case.Id |
Select-Object email, status, holdStatus
# 3. The legal hold is applied (not just created)
Get-MgBetaSecurityCaseEdiscoveryCaseLegalHold -EdiscoveryCaseId $case.Id |
Select-Object displayName, status, contentQuery
Then in the portal:
- Custodians -> hold notifications: every custodian shows Acknowledged, or you have a reminder cadence and a documented reason for any gap.
- Collections: the committed collection shows a final summary with non-zero, plausible counts; the saved query matches what you intended.
- Review set -> Jobs: analytics completed; the
[AutoGen] For Reviewfilter exists and the culled count is sane. - Export -> Summary.csv:
Actualreconciles againstTotalminusErrors/Skipped, and you can explain every error.Privileged-tagged items did not export in native form. - Audit (Purview -> Audit): search
RecordTypefor eDiscovery activities and confirm case create, hold, commit, and export are all logged with the right actors.
Enterprise scenario
A platform/security team at a mid-size SaaS company got a litigation hold notice on a Friday for a departed senior engineer suspected of taking pricing models and source-code references to a competitor. The constraint was brutal on two axes: the custodian’s mailbox was under an aggressive 30-day delete retention (Legal had not yet been looped in when he left), and a large share of the responsive evidence was cloud attachments — Teams and Outlook messages linking to OneDrive/SharePoint files rather than embedding them. An earlier, panicked Content Search had returned ~38 GB across 400+ locations and nobody could afford to review that.
What went wrong on the first pass: they ran a collection and committed it without checking Cloud attachments, so the review set was full of messages reading “see the model attached” with a dead link and no file. They had also waited four days before placing any hold while “scoping,” and two days of the custodian’s deleted-items had already aged past the 30-day window.
The fix, redone properly:
- Preserve first. Add the custodian with
applyHoldToSources = $trueimmediately, before any further scoping. (The lesson: never scope an at-risk custodian unheld.) - Tighten the estimate, do not commit blind. Iterate the search to a defensible scope using statistics, then commit with the right options:
# Add and HOLD the custodian in one shot - stop the bleeding
New-MgSecurityCaseEdiscoveryCaseCustodian -EdiscoveryCaseId $case.Id -BodyParameter @{
email = "former.eng@saas.example"
applyHoldToSources = $true
}
# Scoped, date-bounded estimate over all case custodians + the pricing site
$s = New-MgSecurityCaseEdiscoveryCaseSearch -EdiscoveryCaseId $case.Id -BodyParameter @{
displayName = "Falcon scoped v3"
contentQuery = '(Falcon OR "pricing model" OR repo-falcon) AND (sent>=2025-12-01 AND sent<=2026-06-02)'
dataSourceScopes = "allCaseCustodians"
}
- Commit with Cloud attachments, document versions, and partially indexed items checked, so the linked OneDrive files (and prior versions showing the edit trail) actually landed in the review set.
- Run analytics +
[AutoGen] For Review, which collapsed the scoped set from ~210,000 items to ~24,000 inclusive/representative items, then predictive coding to rank the remainder so reviewers hit the high-relevance pricing documents first.
The outcome: the two lost days of mailbox content were genuinely unrecoverable and had to be disclosed (the cost of the four-day delay). But the cloud-attachment evidence — the actual pricing spreadsheets and their version history — was fully preserved and produced, and the review budget dropped by roughly an order of magnitude versus reviewing the raw collection. The two durable lessons that went into their runbook: hold the custodian the moment the case opens, and Cloud attachments is non-optional in any modern M365 collection — without it you are collecting envelopes, not letters.