Most teams reach for Spinnaker when one CI tool’s “deploy” step stops being enough: you are shipping the same artifact to a Kubernetes account in GCP, an EKS cluster in AWS, and a legacy ASG, each with its own approvers, metrics backend, and blast-radius rules. Spinnaker’s value is not that it deploys – anything deploys – it is that it makes promotion a governed, observable state machine. A pipeline bakes once, fans out across clouds, pauses on a manual judgment where policy demands a human, runs automated canary analysis against real telemetry, and rolls back on the evidence rather than on a pager. This article builds that pipeline end to end: providers, the stage model, the three deployment strategies, Kayenta wired to Prometheus/Stackdriver/Datadog, templated triggers, Fiat authorization, and rollback automation.
1. The microservice architecture and why it matters operationally
Spinnaker is a set of JVM microservices behind one UI. You cannot operate it well without knowing which service owns which failure mode.
| Service | Responsibility | When it is the culprit |
|---|---|---|
| Deck | Angular UI | Nothing renders; check Gate connectivity |
| Gate | API gateway (all UI/CLI traffic) | 401/403s, auth proxy issues |
| Orca | Orchestration engine; runs pipelines stage by stage | Pipelines stuck “RUNNING”, stage transitions |
| Clouddriver | All mutating calls to cloud providers; caches deployed resources | Deploys fail, stale infra in the UI |
| Front50 | Persists applications, pipelines, projects, notifications | Pipelines vanish, save failures (S3/GCS bucket) |
| Rosco | Bakery; produces images via Packer, and Helm/Kustomize manifest bakes | Bake stage failures |
| Igor | Polls CI (Jenkins/Travis) and registries; emits triggers | Triggers do not fire on new images/builds |
| Echo | Eventing bus; notifications, webhooks, cron triggers | No Slack/email, scheduled pipelines silent |
| Fiat | Authorization (accounts, applications, roles) | Users see nothing, “not authorized” on accounts |
| Kayenta | Automated canary analysis (ACA) | Canary stage errors, metric queries fail |
The single most useful operational fact: Clouddriver indexes the world on a cache cycle. A resource created out-of-band can be invisible until the next caching agent run, and a pipeline that “cannot find” a cluster is frequently caching lag, not a real failure. Scale Clouddriver and tune its cache intervals before you blame the cloud.
2. Configuring cloud providers with Halyard or the Operator
There are two config models. Halyard (hal) is the original CLI that owns a halconfig and renders service settings; it is functional but now in maintenance, so new installs increasingly prefer the Spinnaker Operator (a Kubernetes operator that reconciles a SpinnakerService CR). Pick one and stay on it – mixing them corrupts state.
With Halyard, enabling providers and adding accounts is declarative through the hal command tree:
# Kubernetes provider: enable, then add one account per cluster/context
hal config provider kubernetes enable
hal config provider kubernetes account add gke-prod-us \
--provider-version v2 \
--context gke_my-project_us-central1_prod \
--only-spinnaker-managed true
# AWS provider with a separate account (assume-role into a target account)
hal config provider aws enable
hal config provider aws account add aws-prod \
--account-id 111122223333 \
--assume-role role/spinnakerManaged \
--regions us-east-1,eu-west-1
# Google (GCE) provider
hal config provider google enable
hal config provider google account add gcp-prod \
--project my-gcp-project \
--json-path /home/spinnaker/.gcp/spinnaker-sa.json
# Set the deployment topology to distributed on Kubernetes, then apply
hal config deploy edit --type distributed --account-name gke-prod-us
hal deploy apply
Inspect what Halyard will render before you apply, and keep the generated config under review:
hal config # show the full deployment config
hal config provider kubernetes account list
hal config generate # render service .yml files without applying
hal deploy apply # roll the config out to the running services
With the Operator, the same intent is a versioned CR you commit to Git. Accounts can live in the CR or in a kustomize-managed secret, and the operator reconciles drift:
apiVersion: spinnaker.io/v1alpha2
kind: SpinnakerService
metadata:
name: spinnaker
spec:
spinnakerConfig:
config:
version: 1.35.1
providers:
kubernetes:
enabled: true
accounts:
- name: gke-prod-us
context: gke_my-project_us-central1_prod
onlySpinnakerManaged: true
# aws / google providers follow the same accounts[] shape
deploymentEnvironment:
size: SMALL
type: Distributed
Treat accounts as your multi-cloud boundary. An “account” is one credential into one cloud/cluster, and it is the unit Fiat authorizes against (section 7). Name them by environment and cloud (
gke-prod-us,aws-prod-eu) so permissions and pipeline targeting read cleanly.
3. The application, pipeline, and stage model
Spinnaker’s top-level object is an application – a logical service that owns its clusters, firewalls, load balancers, and pipelines across every account. Inside it, a pipeline is an ordered graph of stages:
- Bake – Rosco turns source (a Packer template, or a Helm/Kustomize chart) into an immutable artifact: an AMI, a GCE image, or a rendered Kubernetes manifest.
- Deploy – Clouddriver creates a new server group (an ASG, a managed instance group, or a Kubernetes ReplicaSet) using a chosen strategy.
- Manual Judgment – pauses until an authorized human selects continue/stop. Your governed gate.
- Canary Analysis – runs Kayenta ACA and yields a score (sections 4-5).
- Webhook – calls an external system (change-management API, synthetic test runner) and optionally polls for completion.
Pipelines are JSON. You edit most of it in the UI, but the JSON is the source of truth and what you template (section 6). A trimmed deploy-then-judge skeleton:
{
"application": "checkout",
"name": "promote-to-prod",
"stages": [
{
"refId": "1",
"type": "deployManifest",
"name": "Deploy canary (GKE)",
"account": "gke-prod-us",
"cloudProvider": "kubernetes",
"moniker": { "app": "checkout" },
"source": "text",
"manifests": [ "<rendered manifest injected by bake artifact>" ]
},
{
"refId": "2",
"requisiteStageRefIds": ["1"],
"type": "manualJudgment",
"name": "Release approval",
"instructions": "Confirm canary score and change ticket before promotion.",
"judgmentInputs": [ { "value": "promote" }, { "value": "halt" } ],
"failPipeline": true
}
]
}
requisiteStageRefIds is the DAG edge: stage 2 runs only after 1. Branches sharing no requisiteStageRefIds run in parallel – precisely how you fan a single bake out to multiple cloud accounts at once.
4. Deployment strategies: red/black, rolling, and highlander
Spinnaker bakes the deployment pattern into the Deploy stage as a strategy on the server group. The three you must know differ entirely in how they treat the previous server group:
| Strategy | What happens | Rollback | Cost during deploy |
|---|---|---|---|
redblack (blue/green) |
New server group created; once healthy, old one is disabled (removed from the LB) but kept | Re-enable the old group – instant | 2x capacity until cleanup |
rollingredblack |
New group is scaled up and old group scaled down in increments (one or two at a time) | Reverse the roll; partial exposure | ~1x + increment |
highlander |
New server group created; once healthy, old one is destroyed | None automatic – old group is gone | 1x; cheapest |
A red/black Kubernetes deploy stage with an explicit rollback policy:
{
"type": "deployManifest",
"name": "Deploy prod (red/black)",
"account": "aws-prod",
"cloudProvider": "kubernetes",
"strategy": "redblack",
"maxRemainingAsgs": 2,
"trafficManagement": {
"enabled": true,
"options": {
"enableTraffic": true,
"services": ["service checkout"]
}
},
"rollback": { "onFailure": true }
}
Choose by failure shape. Use redblack for instant rollback when you can afford double capacity briefly. Use rollingredblack when 2x capacity is too expensive or the workload cannot tolerate a sudden full cutover, accepting partial exposure mid-roll. Use highlander only where capacity is precious and fast rollback is not a requirement – it leaves you nothing to roll back to.
maxRemainingAsgs(the KubernetesmaxRemainingequivalent) controls how many old server groups Spinnaker keeps before cleanup. Set it to at least 2 for red/black so the deploy that creates N+1 still leaves N as a rollback target.
5. Automated Canary Analysis with Kayenta
Kayenta compares a control (baseline) against an experiment (canary) – ideally two freshly deployed server groups of the same size, so the only variable is the code, not warm caches or instance age. It runs your queries against both, scores the divergence with a judge, and returns a numeric canary score the pipeline thresholds.
A canary config declares metrics, the groups they belong to, per-group weights, and the scoreThresholds that decide pass/marginal/fail:
{
"name": "checkout-aca",
"judge": { "name": "NetflixACAJudge-v1.0", "judgeConfigurations": {} },
"metrics": [
{
"name": "error-rate",
"query": {
"type": "prometheus",
"serviceType": "prometheus",
"metricName": "http_requests_total",
"customInlineTemplate": "PromQL:sum(rate(http_requests_total{job=\"checkout\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"checkout\"}[5m]))"
},
"groups": ["quality"],
"analysisConfigurations": {
"canary": { "direction": "increase", "critical": true }
},
"scopeName": "default"
},
{
"name": "p99-latency",
"query": {
"type": "prometheus",
"serviceType": "prometheus",
"customInlineTemplate": "PromQL:histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"checkout\"}[5m])) by (le))"
},
"groups": ["latency"],
"analysisConfigurations": { "canary": { "direction": "increase" } },
"scopeName": "default"
}
],
"classifier": {
"groupWeights": { "quality": 60, "latency": 40 },
"scoreThresholds": { "marginal": 75, "pass": 95 }
}
}
The mechanics that matter:
directiontells the judge which way is bad.increasefails the metric when the canary rises above baseline (error rate, latency);decreasefor throughput;eitherfor metrics that should track baseline both ways.critical: truemakes one failing metric fail the whole analysis regardless of weighted score – use it for error rate, never a noisy gauge.groupWeightsmust sum to 100. The judge scores each group, weights them, and compares toscoreThresholds: belowmarginalfails, at/abovepasssucceeds, in between is marginal (treated as fail in a gated pipeline).
In the pipeline, the Canary Analysis stage runs this config repeatedly over a window, pointing each interval at the two scopes:
{
"type": "kayentaCanary",
"name": "ACA: checkout",
"canaryConfig": {
"metricsAccountName": "prometheus-prod",
"storageAccountName": "gcs-canary",
"canaryConfigId": "checkout-aca",
"scopes": [
{
"scopeName": "default",
"controlScope": "checkout-baseline",
"experimentScope": "checkout-canary",
"step": 60,
"extendedScopeParams": { "type": "cluster" }
}
],
"scoreThresholds": { "marginal": "75", "pass": "95" },
"lifetimeDuration": "PT1H",
"analysisIntervalMins": 15
}
}
That config samples for one hour (PT1H) in four 15-minute judgments, each scoring checkout-canary against checkout-baseline. If any judgment falls below marginal, the stage fails and – wired correctly – triggers rollback (section 8).
6. Wiring Prometheus, Stackdriver, and Datadog as metric sources
Kayenta needs two account types: a metrics account (reads telemetry) and a storage account (S3/GCS/MinIO, persists configs and results). Enable the canary feature and add the metrics backend you run. With Halyard:
hal config canary enable
# Prometheus
hal config canary prometheus enable
hal config canary prometheus account add prometheus-prod \
--base-url http://prometheus.monitoring:9090
# Google Stackdriver (Cloud Monitoring)
hal config canary google enable
hal config canary google account add stackdriver-prod \
--project my-gcp-project --json-path /home/spinnaker/.gcp/sa.json \
--supported-types METRICS_STORE
# Datadog
hal config canary datadog enable
hal config canary datadog account add datadog-prod \
--base-url https://api.datadoghq.com
hal deploy apply
On the Operator, the same accounts live under spinnakerConfig.profiles.kayenta (rendered into kayenta.yml), with one account marked supportedTypes: [METRICS_STORE] for telemetry and a GCS/S3 account marked [OBJECT_STORE] for results.
The
serviceType/typein a metric’squery(section 5) must match an enabled metrics account’s provider. A Datadog query against a Spinnaker that only has Prometheus enabled fails at runtime with an opaque error – enable the provider first, then author the canary config. Datadog API and app keys go in the Kayenta profile secrets, never in the canary config JSON.
7. Authorization with Fiat: pipeline permissions and audited gates
Fiat is Spinnaker’s authorization service. It resolves a user’s roles (LDAP, SAML, GitHub teams, Google Groups), then gates which accounts a user can deploy to and which applications they can read, write, or execute. Enable it alongside an auth provider:
hal config security authz enable
hal config security authz edit --type file # or ldap / github / google
hal config security authz google edit \
--admin-username admin@acme.com \
--credential-path /home/spinnaker/.gcp/fiat-sa.json \
--domain acme.com
hal deploy apply
Permissions are set per account (a WRITE role gating who can deploy to that account, enforced at execution time) and per application. The application EXECUTE permission is the key governance lever – it decides who can run a pipeline at all:
// Per-application permissions (Front50 application config)
{
"name": "checkout",
"permissions": {
"READ": ["dev-team", "sre"],
"WRITE": ["dev-team"],
"EXECUTE": ["prod-deployers"]
}
}
For an audited gate, combine a Manual Judgment stage with notifications so the approval is enforced and recorded. Echo emits the judgment event; route it to your audit sink:
{
"type": "manualJudgment",
"name": "Prod release sign-off",
"instructions": "Approve only with a linked CHG ticket. This action is audited.",
"judgmentInputs": [ { "value": "approve" }, { "value": "reject" } ],
"notifications": [
{
"type": "slack",
"address": "release-approvals",
"when": ["manualJudgment", "manualJudgmentContinue", "manualJudgmentStop"]
}
],
"failPipeline": true
}
A manual judgment only governs anything if
EXECUTEon the application is restricted. If everyone can run the pipeline, the human gate is theater. LockEXECUTEto the approver role, and the judgment becomes a real, attributable control – Echo records who clicked which option and when.
8. Triggers, templated pipelines, and rollback automation
Triggers start pipelines automatically. The canonical “promote on new image” pattern is a docker trigger (polled by Igor) firing on a new tag matching a regex:
"triggers": [
{
"type": "docker",
"account": "dockerhub",
"organization": "acme",
"repository": "acme/checkout",
"tag": "^v\\d+\\.\\d+\\.\\d+$",
"enabled": true
}
]
A jenkins trigger (fields: master, job) fires on CI completion, and a git trigger (source: github, project, slug, branch) fires on commit – both follow the same shape.
To avoid copy-pasting pipelines across dozens of services, use Managed Pipeline Templates (spin CLI). A template declares variables and a pipeline body; each service publishes a small configuration binding the variables:
# pipeline-template.yml
schema: "v2"
id: standard-promote
variables:
- name: app
type: string
- name: targetAccount
type: string
pipeline:
name: "promote-{{ app }}"
stages:
- name: Canary
type: kayentaCanary
canaryConfig:
canaryConfigId: "{{ app }}-aca"
metricsAccountName: prometheus-prod
storageAccountName: gcs-canary
scoreThresholds: { marginal: "75", pass: "95" }
# Publish the template and a binding via the spin CLI
spin pipeline-templates save --file pipeline-template.yml
spin pipeline save --file checkout-config.json # binds app=checkout, targetAccount=gke-prod-us
Rollback automation ties it together in two layers: a rollback.onFailure on the deploy stage handles a failed deploy, and an explicit undoRolloutManifest (Kubernetes) or rollbackServerGroup stage handles a post-deploy failure caught by canary or a verification webhook. Gate that stage on the canary result:
{
"type": "undoRolloutManifest",
"name": "Rollback on canary fail",
"account": "gke-prod-us",
"cloudProvider": "kubernetes",
"location": "checkout",
"manifestName": "deployment checkout",
"numRevisionsBack": 1,
"requisiteStageRefIds": ["aca-stage"],
"stageEnabled": {
"type": "expression",
"expression": "${ #stage('ACA: checkout').status.toString() == 'TERMINAL' }"
}
}
Add a post-deploy verification stage – a webhook to a synthetic test runner that must return success before the pipeline reports green. With waitForCompletion, Spinnaker polls the returned status URL until statusJsonPath matches a success status:
{
"type": "webhook",
"name": "Synthetic smoke",
"url": "https://synthetics.acme.com/run/checkout",
"method": "POST",
"waitForCompletion": true,
"statusJsonPath": "$.status",
"successStatuses": "SUCCEEDED",
"terminalStatuses": "FAILED"
}
Enterprise scenario
A platform team ran one Spinnaker controlling deploys to GKE (gke-prod-eu) and EKS (eks-prod-us) for a checkout service. Their canary looked correct – error-rate and p99-latency, pass: 95 – yet a release with a real EU-region regression scored 96 and promoted. Two compounding mistakes. First, the canary’s controlScope pointed at the existing production server group, not a freshly deployed baseline, so the canary (cold caches, new pods) was compared against a warm, hours-old fleet; the latency delta got written off as “new pod warmup” and absorbed into the score. Second, their PromQL had no region label, so the query aggregated both clusters – healthy US traffic statistically drowned the EU regression.
The fix: deploy a dedicated baseline of the current version at the same time and size as the canary (the standard control/experiment pattern), and scope every metric to one cluster via extendedScopeParams plus an explicit label binding. They also marked error-rate as critical so an error spike could fail the analysis alone, independent of the weighted score:
{
"name": "error-rate",
"query": {
"type": "prometheus",
"serviceType": "prometheus",
"customInlineTemplate": "PromQL:sum(rate(http_requests_total{job=\"checkout\",region=\"${scope}\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"checkout\",region=\"${scope}\"}[5m]))"
},
"groups": ["quality"],
"analysisConfigurations": { "canary": { "direction": "increase", "critical": true } },
"scopeName": "default"
}
With per-cluster scoping and a same-age baseline, the next bad EU release scored 41 and auto-rolled-back via the undoRolloutManifest stage before meaningful customer impact. The lesson every Kayenta adopter learns: the judge is only as honest as your scopes. Compare like with like, and scope each metric to exactly the population the canary serves.
Verify
Confirm providers, the canary pipeline, and the rollback path actually behave as configured.
# Providers and accounts are registered (Halyard)
hal config provider kubernetes account list
hal config provider aws account list
# Clouddriver actually sees the accounts (bypasses caching-lag confusion)
curl -s http://localhost:7002/credentials | jq '.[].name'
# Kayenta metrics + storage accounts are live
curl -s http://localhost:8090/credentials | jq '.[] | {name, types: .supportedTypes}'
# Fiat resolved a user's roles and account access
curl -s http://localhost:7003/authorize/<username> | jq '.accounts[].name'
# Validate the canary's PromQL in Prometheus BEFORE trusting the gate
curl -s 'http://prometheus.monitoring:9090/api/v1/query' \
--data-urlencode 'query=sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m]))/sum(rate(http_requests_total{job="checkout"}[5m]))' \
| jq '.data.result'
A healthy setup shows every account in both Clouddriver’s and Kayenta’s /credentials, Fiat returning only the accounts a user’s roles permit, and the canary PromQL returning a value when run directly. Then exercise the failure path on a non-prod app: deploy a deliberately broken canary, watch the kayentaCanary stage drop below marginal, and confirm undoRolloutManifest fires and traffic returns to the prior revision.
Checklist
Pitfalls and next steps
The failures I see are rarely Spinnaker bugs. They are governance gaps and dishonest canaries. A manual judgment with open EXECUTE permissions is decorative; a canary comparing a cold new stack against a warm old one, or aggregating metrics across regions, will green-light regressions with a straight face. Lock execution to roles, deploy a real baseline, scope every query to the exact population under test – and validate that query in the metrics backend before a release depends on it.
Two extensions pay off quickly. Push all pipeline JSON and canary configs into Git and drive them through spin and Managed Pipeline Templates, so a new service inherits the governed promote path instead of reinventing it. And wire Echo’s events into your observability and change-management systems, so every deploy, judgment, and canary verdict is an auditable record – which is what turns Spinnaker from a deploy tool into a release-engineering control plane your auditors and on-call both trust.