Secret sprawl is not a single incident; it is a steady-state condition. Tokens land in commit history, get echoed into CI logs, bake into container layers, and sit in Terraform state long after anyone remembers putting them there. The dangerous mental model is “we’ll scan and clean it up.” The correct model is a program: a left-to-right pipeline that blocks new leaks at the keyboard, detects the ones that slip through, and rotates the credential the moment it is exposed. This walkthrough builds that program with pre-commit scanning, GitHub push protection and secret scanning, pipeline and artifact scanning, and a remediation runbook whose first and only goal is revoke and rotate - not a heroic git filter-repo.
One principle frames everything below: once a secret reaches a place you do not fully control - a shared remote, a CI runner, a pulled image - it is compromised and must be rotated. Cleaning history is hygiene; rotation is the fix. Treat them as separate, non-substitutable steps.
1. The secret-sprawl problem: where credentials leak and why rotation is non-negotiable
Credentials leak from more surfaces than most inventories capture. The ones that actually bite:
| Surface | Typical leak | Why it persists |
|---|---|---|
| Commit history | AWS_SECRET_ACCESS_KEY removed in a later commit |
Still reachable via the old SHA and every fork/clone |
| CI/CD logs | echo $TOKEN or a tool printing config |
Logs are retained, cached, and often world-readable in the org |
| Container layers | ENV API_KEY=... or a copied .env |
Squashing the final layer does not remove earlier ones |
| IaC + state | Provider creds in .tf; secrets in terraform.tfstate |
State is plaintext JSON, frequently committed by accident |
| Build caches / artifacts | .npmrc, settings.xml, signing keys |
Attached to pipeline runs and package registries |
The reason rotation is non-negotiable: you cannot prove a negative. Once a key is in a clone, a fork, a CI cache, or a screen-share, you have no way to confirm it was never copied. The only deterministic control is to make the leaked value worthless by revoking it. Everything else in this article exists to (a) stop new secrets from reaching those surfaces and (b) shrink the time between exposure and rotation.
2. Deploying pre-commit secret scanning with custom regex and entropy detection
The cheapest place to stop a leak is before it becomes a commit object. Two tools dominate here: gitleaks (Go, fast, great rule engine) and detect-secrets (Python, baseline-driven, strong entropy heuristics). I run gitleaks as the org default and reach for detect-secrets where teams want an auditable baseline file.
Wire it through the pre-commit framework so it is reproducible and CI-enforceable:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/gitleaks/gitleaks
rev: v8.18.4
hooks:
- id: gitleaks
- repo: https://github.com/Yelp/detect-secrets
rev: v1.5.0
hooks:
- id: detect-secrets
args: ["--baseline", ".secrets.baseline"]
pip install pre-commit
pre-commit install # installs the git hook
pre-commit run --all-files # one-time sweep of the existing tree
The default rules catch cloud keys, but your highest-value leaks are usually internal tokens with predictable shapes. Add custom rules with regex and Shannon-entropy gating so you catch them without drowning in noise:
# .gitleaks.toml
title = "kloudvin custom rules"
[extend]
useDefault = true # keep all the built-in cloud/provider rules
[[rules]]
id = "kloudvin-internal-pat"
description = "KloudVin internal platform PAT"
# Anchored prefix beats a loose entropy match: fewer false positives.
regex = '''kv_pat_[0-9a-zA-Z]{40}'''
keywords = ["kv_pat_"]
entropy = 3.5
[[rules]]
id = "generic-high-entropy-assignment"
description = "High-entropy value assigned to a secret-ish variable"
regex = '''(?i)(api[_-]?key|secret|token|passwd|password)\s*[:=]\s*['"]?([0-9a-zA-Z\/+=_-]{32,})['"]?'''
entropy = 4.3
[allowlist]
description = "Known-safe test fixtures and docs"
paths = ['''(.*)?_test\.go$''', '''testdata/''', '''docs/examples/''']
For detect-secrets, generate and commit a baseline so existing (audited) findings do not block every commit, while new ones still fail:
detect-secrets scan --all-files > .secrets.baseline
detect-secrets audit .secrets.baseline # mark each as true/false positive
Pre-commit hooks are advisory by design - a developer can
git commit --no-verify. That is fine. The hook exists to catch honest mistakes fast and cheaply. The enforcement boundary is push protection and server-side scanning, covered next. Never treat a local hook as a security control you can audit.
3. Enabling push protection and repository secret scanning at organization scale
GitHub Advanced Security (now licensed as GitHub Secret Protection) provides two server-side controls that a developer cannot bypass with --no-verify:
- Secret scanning - scans the full history and ongoing pushes for known partner patterns and your custom patterns, and raises alerts.
- Push protection - rejects a push at the server when it contains a detectable secret, so the credential never lands on the remote.
Turn both on for every new repo at the org level rather than chasing them one by one:
# Enable org-wide defaults for newly created repos
gh api --method PATCH /orgs/KloudVin \
-F secret_scanning_push_protection_enabled_for_new_repositories=true \
-F secret_scanning_enabled_for_new_repositories=true
For existing repos, enable per repository via the security-and-analysis settings:
gh api --method PATCH /repos/KloudVin/platform-core \
-f 'security_and_analysis[secret_scanning][status]=enabled' \
-f 'security_and_analysis[secret_scanning_push_protection][status]=enabled'
Add custom patterns for your internal token shapes at the org level so every repo inherits them. Define the secret format, optionally with before/after context to cut false positives, and use the dry run before publishing so you can see the blast radius against real history:
Secret format: kv_pat_[0-9a-zA-Z]{40}
Before secret: (?:authorization|token|pat)\s*[:=]\s*["']?
Test string: token=kv_pat_AbC123... (40 alnum chars)
Run every new custom pattern in dry run first. A sloppy pattern published live can push-protect an entire monorepo into a standstill and flood the security tab with thousands of historical alerts. Validate the match count, then publish.
When push protection blocks a developer, GitHub gives them an inline path: remove the secret, or bypass with a reason (it’s a test value, a false positive, or “I’ll fix it later”). Route every bypass to your security team. The bypass reason is the single best signal of where your patterns are too aggressive or where a real secret almost shipped.
4. Scanning CI/CD pipeline logs, container layers, and IaC for embedded secrets
Repository scanning misses three high-leak surfaces: pipeline logs, build artifacts, and what your IaC actually provisions. Cover them explicitly.
Pipeline logs and the working tree. Add a scan step that fails the build, scoped to the diff on PRs so it stays fast, and full-history on the default branch. GitHub provides a first-party action; for cross-platform use, run the binary directly:
# .github/workflows/secret-scan.yml
name: secret-scan
on: [pull_request, push]
permissions:
contents: read
jobs:
gitleaks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # full history so the scan sees prior commits
- uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
The most common own-goal is a pipeline that prints a secret. Two defenses: register secrets so the runner masks them, and never echo environment in CI.
- name: Mask a derived secret
run: echo "::add-mask::$DERIVED_TOKEN"
Container layers. A secret added in an early layer survives even if you delete the file in a later one - docker history and a layer extractor will surface it. Scan the built image, not just the Dockerfile. Trivy walks every layer’s filesystem for secrets:
trivy image --scanners secret --exit-code 1 \
ghcr.io/kloudvin/api:${GIT_SHA}
The structural fix is to never let the secret enter a layer. Use BuildKit secret mounts so the value is available at build time but is not persisted in any layer:
# syntax=docker/dockerfile:1.7
FROM node:20-slim
RUN --mount=type=secret,id=npmtoken \
NPM_TOKEN="$(cat /run/secrets/npmtoken)" npm ci
docker build --secret id=npmtoken,env=NPM_TOKEN -t kloudvin/api:dev .
IaC and state. Scan .tf for hardcoded provider credentials, and treat terraform.tfstate as radioactive - it stores resource attributes, including secrets, in plaintext JSON. Filesystem scanning catches both:
trivy fs --scanners secret,misconfig .
Then make committing state impossible and move it to an encrypted, locked backend:
# backend.tf - remote state, never local/committed
terraform {
backend "azurerm" {
resource_group_name = "rg-tfstate"
storage_account_name = "stkloudvintfstate"
container_name = "state"
key = "platform-core.tfstate"
use_azuread_auth = true # identity-based, no storage key in config
}
}
# .gitignore - belt and braces
*.tfstate
*.tfstate.*
*.tfvars
.terraform/
5. Triaging alerts: validity checking, ownership routing, and false-positive tuning
A scanner that fires a thousand alerts no one triages is worse than no scanner - it trains people to ignore the channel. Three disciplines keep the signal alive.
Validity checking first. GitHub’s secret scanning performs validity checks for many partner patterns, calling the provider to label an alert active, inactive, or unknown. Triage active first - those are live, working credentials. Pull the queue programmatically and sort:
gh api "/repos/KloudVin/platform-core/secret-scanning/alerts?state=open" \
--jq '.[] | {number, secret_type, validity, created_at}'
Ownership routing. Map the file path or repo to a team via CODEOWNERS and your service catalog, then auto-assign. An alert with no owner is an alert that ages forever. The routing rule is simple: the team that owns the code owns the rotation, with the security team owning the SLA and the audit trail.
False-positive tuning. Every false positive is a tax on attention. Resolve it in a way that suppresses the class, not just the instance: refine the custom-pattern’s before/after context, add an allowlist path in .gitleaks.toml, or mark it in the detect-secrets baseline. Track your false-positive rate as a first-class metric; if it climbs above roughly 20% for a given rule, the rule - not the developers - needs fixing.
6. The remediation runbook: revoke, rotate, and verify without history rewrites
When a real secret is confirmed exposed, follow a fixed order. The instinct to “scrub the history first” is the wrong first move - it is slow, breaks every open PR and clone, and does nothing to a credential that is already copied.
Order of operations: Revoke -> Rotate -> Verify -> (optionally) Purge history. The credential’s exposure ends at revoke, not at history rewrite. Do the math on attacker dwell time: the key is exploitable from the second it hits the remote until the second you revoke it. Optimize that window, nothing else.
-
Revoke / disable the exposed credential at the provider immediately. This is the only step that actually stops the bleed.
# AWS access key - disable first (instant), delete after rotation confirmed aws iam update-access-key --access-key-id AKIA... --status Inactive \ --user-name svc-platform -
Rotate - issue a fresh credential, deliver it to the workload through your vault (next section), and confirm the application is healthy on the new value.
-
Verify the old credential is dead. Do not assume
Inactivemeans gone - prove it fails:AWS_ACCESS_KEY_ID=AKIA... AWS_SECRET_ACCESS_KEY=... \ aws sts get-caller-identity # expect: InvalidClientTokenId -
Close the alert with a resolution of
revoked, linking the rotation change. This is your audit trail.gh api --method PATCH \ /repos/KloudVin/platform-core/secret-scanning/alerts/42 \ -f state=resolved -f resolution=revoked -
History purge is optional and last. If compliance requires the value gone from history, use
git filter-repo- but only after rotation, and with full awareness that it rewrites SHAs and forces every collaborator to re-clone.git filter-repo --replace-text <(echo 'AKIA...==>REDACTED')
The takeaway worth internalizing: a rotated credential in your history is an annoyance; a live credential in your history is an incident. Spend your urgency accordingly.
7. Migrating discovered secrets into a vault with managed-identity references
Each remediated secret is an opportunity to delete the category of leak, not just the instance. The end state is no secret value in source, CI, or app settings - only a reference resolved at runtime by a managed identity.
Store the rotated value in Key Vault:
az keyvault secret set \
--vault-name kv-kloudvin-prod \
--name svc-platform-db-password \
--value "$NEW_PASSWORD"
Reference it from an App Service / Function App by managed identity, so the deployed configuration holds a pointer, never the secret. The platform resolves the reference at startup using the app’s identity:
az webapp config appsettings set \
--name app-kloudvin-api --resource-group rg-platform-prod \
--settings "DbPassword=@Microsoft.KeyVault(VaultName=kv-kloudvin-prod;SecretName=svc-platform-db-password)"
In IaC, this is a Key Vault reference wired to the app’s identity - the secret value never appears in the template:
resource site 'Microsoft.Web/sites@2023-12-01' = {
name: 'app-kloudvin-api'
location: location
identity: { type: 'SystemAssigned' }
properties: {
siteConfig: {
appSettings: [
{
name: 'DbPassword'
value: '@Microsoft.KeyVault(VaultName=${kvName};SecretName=svc-platform-db-password)'
}
]
}
}
}
For CI/CD, close the loop completely with workload identity federation (OIDC) so the pipeline holds no stored secret to leak in the first place - the runner exchanges a short-lived OIDC token for cloud access at job time. That removes the most-rotated, most-leaked credential class (pipeline service-principal passwords) from the board entirely.
Verify
Prove the controls work before you call the program done.
# 1. Push protection actually blocks - test with a known partner pattern
# (use a documented test token, never a real one), expect rejection:
git commit -am "test push protection" && git push # expect: push rejected
# 2. Pre-commit catches a planted internal token
printf 'token=kv_pat_%040d\n' 1 > /tmp/leak.txt && cp /tmp/leak.txt leak.txt
git add leak.txt && pre-commit run gitleaks --files leak.txt # expect: FAIL
# 3. Container scan finds a secret baked into a layer
trivy image --scanners secret --exit-code 1 ghcr.io/kloudvin/api:test
# 4. No open, active secret-scanning alerts remain
gh api "/repos/KloudVin/platform-core/secret-scanning/alerts?state=open" \
--jq '[.[] | select(.validity=="active")] | length' # expect: 0
# 5. A revoked credential is genuinely dead (provider-specific)
aws sts get-caller-identity # with the old key set: expect InvalidClientTokenId
Confirm in the GitHub Security tab that push-protection bypasses are flowing to your security team, and check your metrics dashboard that mean-time-to-rotate is being recorded per alert.
Measuring mean-time-to-rotate and driving secrets out of source control
What you do not measure regresses. Two metrics drive the program:
- Mean-time-to-rotate (MTTR-secret) - from alert creation to the alert being resolved as
revoked. This is your real exposure-window proxy. Drive it down relentlessly; a healthy target is hours, not days. - Secret density - active leaks per 1,000 repos, trending toward zero as coverage and prevention mature.
Pull the rotation latency straight from the alert lifecycle:
gh api "/repos/KloudVin/platform-core/secret-scanning/alerts?state=resolved" \
--jq '.[] | select(.resolution=="revoked")
| {number, created_at, resolved_at}'
If you centralize alerts into a SIEM, compute the distribution rather than eyeballing it. In KQL against a custom table:
SecretAlerts_CL
| where Resolution_s == "revoked"
| extend RotateMinutes = datetime_diff('minute', ResolvedAt_t, CreatedAt_t)
| summarize p50=percentile(RotateMinutes, 50),
p90=percentile(RotateMinutes, 90),
count() by Repo_s
| sort by p90 desc
Watch p90, not the mean - the long tail is where a live credential sat exploitable for days. That tail is the work.
Enterprise scenario
A platform team running roughly 1,400 repos across a single GitHub org enabled custom-pattern secret scanning and immediately hit a wall: their default-branch protection required PRs, but push protection was rejecting pushes to long-lived release branches that legitimately carried encrypted, sealed-secret manifests. The sealed values were high-entropy base64 blobs that tripped a generic entropy rule, blocking every release. The constraint was hard: they could not disable push protection (a SOC 2 control), and they could not rewrite the sealed-secrets workflow on a deadline.
The fix was surgical, not a blanket exception. They scoped the noisy rule out by path - sealed secrets live in a known directory and carry a known wrapper annotation - rather than turning protection off:
[[rules.allowlist]]
description = "Bitnami sealed-secrets ciphertext is safe to commit"
# Only suppress inside the sealed-secrets directory, nowhere else.
paths = ['''(.*/)?sealed-secrets/.*\.ya?ml$''']
regexes = ['''AgB[0-9A-Za-z+/=]{100,}'''] # sealed-secrets ciphertext prefix
For the server-side GitHub custom pattern, they added a must-not-match condition on the same AgB... ciphertext prefix so the published pattern stopped flagging sealed values org-wide, validated via dry run against the release branches before publishing. Result: zero false positives on sealed secrets, push protection stayed fully on, and the one genuine plaintext token the dry run did surface in an unrelated repo got rotated the same afternoon. The lesson the team wrote into their runbook: when push protection fights a legitimate workflow, narrow the rule to the exact safe artifact - never widen the bypass.