Multi-Cloud Deployment Pipelines with Spinnaker and Automated Canary Analysis

Most teams reach for Spinnaker when one CI tool’s “deploy” step stops being enough: you are shipping the same artifact to a Kubernetes account in GCP, an EKS cluster in AWS, and a legacy ASG, each with its own approvers, metrics backend, and blast-radius rules. Spinnaker’s value is not that it deploys – anything deploys – it is that it makes promotion a governed, observable state machine. A pipeline bakes once, fans out across clouds, pauses on a manual judgment where policy demands a human, runs automated canary analysis against real telemetry, and rolls back on the evidence rather than on a pager. This article builds that pipeline end to end: providers, the stage model, the three deployment strategies, Kayenta wired to Prometheus/Stackdriver/Datadog, templated triggers, Fiat authorization, and rollback automation.

1. The microservice architecture and why it matters operationally

Spinnaker is a set of JVM microservices behind one UI. You cannot operate it well without knowing which service owns which failure mode.

Service	Responsibility	When it is the culprit
Deck	Angular UI	Nothing renders; check Gate connectivity
Gate	API gateway (all UI/CLI traffic)	401/403s, auth proxy issues
Orca	Orchestration engine; runs pipelines stage by stage	Pipelines stuck “RUNNING”, stage transitions
Clouddriver	All mutating calls to cloud providers; caches deployed resources	Deploys fail, stale infra in the UI
Front50	Persists applications, pipelines, projects, notifications	Pipelines vanish, save failures (S3/GCS bucket)
Rosco	Bakery; produces images via Packer, and Helm/Kustomize manifest bakes	Bake stage failures
Igor	Polls CI (Jenkins/Travis) and registries; emits triggers	Triggers do not fire on new images/builds
Echo	Eventing bus; notifications, webhooks, cron triggers	No Slack/email, scheduled pipelines silent
Fiat	Authorization (accounts, applications, roles)	Users see nothing, “not authorized” on accounts
Kayenta	Automated canary analysis (ACA)	Canary stage errors, metric queries fail

The single most useful operational fact: Clouddriver indexes the world on a cache cycle. A resource created out-of-band can be invisible until the next caching agent run, and a pipeline that “cannot find” a cluster is frequently caching lag, not a real failure. Scale Clouddriver and tune its cache intervals before you blame the cloud.

2. Configuring cloud providers with Halyard or the Operator

There are two config models. Halyard (hal) is the original CLI that owns a halconfig and renders service settings; it is functional but now in maintenance, so new installs increasingly prefer the Spinnaker Operator (a Kubernetes operator that reconciles a SpinnakerService CR). Pick one and stay on it – mixing them corrupts state.

With Halyard, enabling providers and adding accounts is declarative through the hal command tree:

# Kubernetes provider: enable, then add one account per cluster/context
hal config provider kubernetes enable
hal config provider kubernetes account add gke-prod-us \
  --provider-version v2 \
  --context gke_my-project_us-central1_prod \
  --only-spinnaker-managed true

# AWS provider with a separate account (assume-role into a target account)
hal config provider aws enable
hal config provider aws account add aws-prod \
  --account-id 111122223333 \
  --assume-role role/spinnakerManaged \
  --regions us-east-1,eu-west-1

# Google (GCE) provider
hal config provider google enable
hal config provider google account add gcp-prod \
  --project my-gcp-project \
  --json-path /home/spinnaker/.gcp/spinnaker-sa.json

# Set the deployment topology to distributed on Kubernetes, then apply
hal config deploy edit --type distributed --account-name gke-prod-us
hal deploy apply

Inspect what Halyard will render before you apply, and keep the generated config under review:

hal config                      # show the full deployment config
hal config provider kubernetes account list
hal config generate             # render service .yml files without applying
hal deploy apply                # roll the config out to the running services

With the Operator, the same intent is a versioned CR you commit to Git. Accounts can live in the CR or in a kustomize-managed secret, and the operator reconciles drift:

apiVersion: spinnaker.io/v1alpha2
kind: SpinnakerService
metadata:
  name: spinnaker
spec:
  spinnakerConfig:
    config:
      version: 1.35.1
      providers:
        kubernetes:
          enabled: true
          accounts:
            - name: gke-prod-us
              context: gke_my-project_us-central1_prod
              onlySpinnakerManaged: true
        # aws / google providers follow the same accounts[] shape
      deploymentEnvironment:
        size: SMALL
        type: Distributed

Treat accounts as your multi-cloud boundary. An “account” is one credential into one cloud/cluster, and it is the unit Fiat authorizes against (section 7). Name them by environment and cloud (gke-prod-us, aws-prod-eu) so permissions and pipeline targeting read cleanly.

3. The application, pipeline, and stage model

Spinnaker’s top-level object is an application – a logical service that owns its clusters, firewalls, load balancers, and pipelines across every account. Inside it, a pipeline is an ordered graph of stages:

Bake – Rosco turns source (a Packer template, or a Helm/Kustomize chart) into an immutable artifact: an AMI, a GCE image, or a rendered Kubernetes manifest.
Deploy – Clouddriver creates a new server group (an ASG, a managed instance group, or a Kubernetes ReplicaSet) using a chosen strategy.
Manual Judgment – pauses until an authorized human selects continue/stop. Your governed gate.
Canary Analysis – runs Kayenta ACA and yields a score (sections 4-5).
Webhook – calls an external system (change-management API, synthetic test runner) and optionally polls for completion.

Pipelines are JSON. You edit most of it in the UI, but the JSON is the source of truth and what you template (section 6). A trimmed deploy-then-judge skeleton:

{
  "application": "checkout",
  "name": "promote-to-prod",
  "stages": [
    {
      "refId": "1",
      "type": "deployManifest",
      "name": "Deploy canary (GKE)",
      "account": "gke-prod-us",
      "cloudProvider": "kubernetes",
      "moniker": { "app": "checkout" },
      "source": "text",
      "manifests": [ "<rendered manifest injected by bake artifact>" ]
    },
    {
      "refId": "2",
      "requisiteStageRefIds": ["1"],
      "type": "manualJudgment",
      "name": "Release approval",
      "instructions": "Confirm canary score and change ticket before promotion.",
      "judgmentInputs": [ { "value": "promote" }, { "value": "halt" } ],
      "failPipeline": true
    }
  ]
}

requisiteStageRefIds is the DAG edge: stage 2 runs only after 1. Branches sharing no requisiteStageRefIds run in parallel – precisely how you fan a single bake out to multiple cloud accounts at once.

4. Deployment strategies: red/black, rolling, and highlander

Spinnaker bakes the deployment pattern into the Deploy stage as a strategy on the server group. The three you must know differ entirely in how they treat the previous server group:

Strategy	What happens	Rollback	Cost during deploy
`redblack` (blue/green)	New server group created; once healthy, old one is disabled (removed from the LB) but kept	Re-enable the old group – instant	2x capacity until cleanup
`rollingredblack`	New group is scaled up and old group scaled down in increments (one or two at a time)	Reverse the roll; partial exposure	~1x + increment
`highlander`	New server group created; once healthy, old one is destroyed	None automatic – old group is gone	1x; cheapest

A red/black Kubernetes deploy stage with an explicit rollback policy:

{
  "type": "deployManifest",
  "name": "Deploy prod (red/black)",
  "account": "aws-prod",
  "cloudProvider": "kubernetes",
  "strategy": "redblack",
  "maxRemainingAsgs": 2,
  "trafficManagement": {
    "enabled": true,
    "options": {
      "enableTraffic": true,
      "services": ["service checkout"]
    }
  },
  "rollback": { "onFailure": true }
}

Choose by failure shape. Use redblack for instant rollback when you can afford double capacity briefly. Use rollingredblack when 2x capacity is too expensive or the workload cannot tolerate a sudden full cutover, accepting partial exposure mid-roll. Use highlander only where capacity is precious and fast rollback is not a requirement – it leaves you nothing to roll back to.

maxRemainingAsgs (the Kubernetes maxRemaining equivalent) controls how many old server groups Spinnaker keeps before cleanup. Set it to at least 2 for red/black so the deploy that creates N+1 still leaves N as a rollback target.

5. Automated Canary Analysis with Kayenta

Kayenta compares a control (baseline) against an experiment (canary) – ideally two freshly deployed server groups of the same size, so the only variable is the code, not warm caches or instance age. It runs your queries against both, scores the divergence with a judge, and returns a numeric canary score the pipeline thresholds.

A canary config declares metrics, the groups they belong to, per-group weights, and the scoreThresholds that decide pass/marginal/fail:

{
  "name": "checkout-aca",
  "judge": { "name": "NetflixACAJudge-v1.0", "judgeConfigurations": {} },
  "metrics": [
    {
      "name": "error-rate",
      "query": {
        "type": "prometheus",
        "serviceType": "prometheus",
        "metricName": "http_requests_total",
        "customInlineTemplate": "PromQL:sum(rate(http_requests_total{job=\"checkout\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"checkout\"}[5m]))"
      },
      "groups": ["quality"],
      "analysisConfigurations": {
        "canary": { "direction": "increase", "critical": true }
      },
      "scopeName": "default"
    },
    {
      "name": "p99-latency",
      "query": {
        "type": "prometheus",
        "serviceType": "prometheus",
        "customInlineTemplate": "PromQL:histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"checkout\"}[5m])) by (le))"
      },
      "groups": ["latency"],
      "analysisConfigurations": { "canary": { "direction": "increase" } },
      "scopeName": "default"
    }
  ],
  "classifier": {
    "groupWeights": { "quality": 60, "latency": 40 },
    "scoreThresholds": { "marginal": 75, "pass": 95 }
  }
}

The mechanics that matter:

direction tells the judge which way is bad. increase fails the metric when the canary rises above baseline (error rate, latency); decrease for throughput; either for metrics that should track baseline both ways.
critical: true makes one failing metric fail the whole analysis regardless of weighted score – use it for error rate, never a noisy gauge.
groupWeights must sum to 100. The judge scores each group, weights them, and compares to scoreThresholds: below marginal fails, at/above pass succeeds, in between is marginal (treated as fail in a gated pipeline).

In the pipeline, the Canary Analysis stage runs this config repeatedly over a window, pointing each interval at the two scopes:

{
  "type": "kayentaCanary",
  "name": "ACA: checkout",
  "canaryConfig": {
    "metricsAccountName": "prometheus-prod",
    "storageAccountName": "gcs-canary",
    "canaryConfigId": "checkout-aca",
    "scopes": [
      {
        "scopeName": "default",
        "controlScope": "checkout-baseline",
        "experimentScope": "checkout-canary",
        "step": 60,
        "extendedScopeParams": { "type": "cluster" }
      }
    ],
    "scoreThresholds": { "marginal": "75", "pass": "95" },
    "lifetimeDuration": "PT1H",
    "analysisIntervalMins": 15
  }
}

That config samples for one hour (PT1H) in four 15-minute judgments, each scoring checkout-canary against checkout-baseline. If any judgment falls below marginal, the stage fails and – wired correctly – triggers rollback (section 8).

6. Wiring Prometheus, Stackdriver, and Datadog as metric sources

Kayenta needs two account types: a metrics account (reads telemetry) and a storage account (S3/GCS/MinIO, persists configs and results). Enable the canary feature and add the metrics backend you run. With Halyard:

hal config canary enable

# Prometheus
hal config canary prometheus enable
hal config canary prometheus account add prometheus-prod \
  --base-url http://prometheus.monitoring:9090

# Google Stackdriver (Cloud Monitoring)
hal config canary google enable
hal config canary google account add stackdriver-prod \
  --project my-gcp-project --json-path /home/spinnaker/.gcp/sa.json \
  --supported-types METRICS_STORE

# Datadog
hal config canary datadog enable
hal config canary datadog account add datadog-prod \
  --base-url https://api.datadoghq.com

hal deploy apply

On the Operator, the same accounts live under spinnakerConfig.profiles.kayenta (rendered into kayenta.yml), with one account marked supportedTypes: [METRICS_STORE] for telemetry and a GCS/S3 account marked [OBJECT_STORE] for results.

The serviceType/type in a metric’s query (section 5) must match an enabled metrics account’s provider. A Datadog query against a Spinnaker that only has Prometheus enabled fails at runtime with an opaque error – enable the provider first, then author the canary config. Datadog API and app keys go in the Kayenta profile secrets, never in the canary config JSON.

7. Authorization with Fiat: pipeline permissions and audited gates

Fiat is Spinnaker’s authorization service. It resolves a user’s roles (LDAP, SAML, GitHub teams, Google Groups), then gates which accounts a user can deploy to and which applications they can read, write, or execute. Enable it alongside an auth provider:

hal config security authz enable
hal config security authz edit --type file   # or ldap / github / google
hal config security authz google edit \
  --admin-username admin@acme.com \
  --credential-path /home/spinnaker/.gcp/fiat-sa.json \
  --domain acme.com
hal deploy apply

Permissions are set per account (a WRITE role gating who can deploy to that account, enforced at execution time) and per application. The application EXECUTE permission is the key governance lever – it decides who can run a pipeline at all:

// Per-application permissions (Front50 application config)
{
  "name": "checkout",
  "permissions": {
    "READ":    ["dev-team", "sre"],
    "WRITE":   ["dev-team"],
    "EXECUTE": ["prod-deployers"]
  }
}

For an audited gate, combine a Manual Judgment stage with notifications so the approval is enforced and recorded. Echo emits the judgment event; route it to your audit sink:

{
  "type": "manualJudgment",
  "name": "Prod release sign-off",
  "instructions": "Approve only with a linked CHG ticket. This action is audited.",
  "judgmentInputs": [ { "value": "approve" }, { "value": "reject" } ],
  "notifications": [
    {
      "type": "slack",
      "address": "release-approvals",
      "when": ["manualJudgment", "manualJudgmentContinue", "manualJudgmentStop"]
    }
  ],
  "failPipeline": true
}

A manual judgment only governs anything if EXECUTE on the application is restricted. If everyone can run the pipeline, the human gate is theater. Lock EXECUTE to the approver role, and the judgment becomes a real, attributable control – Echo records who clicked which option and when.

8. Triggers, templated pipelines, and rollback automation

Triggers start pipelines automatically. The canonical “promote on new image” pattern is a docker trigger (polled by Igor) firing on a new tag matching a regex:

"triggers": [
  {
    "type": "docker",
    "account": "dockerhub",
    "organization": "acme",
    "repository": "acme/checkout",
    "tag": "^v\\d+\\.\\d+\\.\\d+$",
    "enabled": true
  }
]

A jenkins trigger (fields: master, job) fires on CI completion, and a git trigger (source: github, project, slug, branch) fires on commit – both follow the same shape.

To avoid copy-pasting pipelines across dozens of services, use Managed Pipeline Templates (spin CLI). A template declares variables and a pipeline body; each service publishes a small configuration binding the variables:

# pipeline-template.yml
schema: "v2"
id: standard-promote
variables:
  - name: app
    type: string
  - name: targetAccount
    type: string
pipeline:
  name: "promote-{{ app }}"
  stages:
    - name: Canary
      type: kayentaCanary
      canaryConfig:
        canaryConfigId: "{{ app }}-aca"
        metricsAccountName: prometheus-prod
        storageAccountName: gcs-canary
        scoreThresholds: { marginal: "75", pass: "95" }

# Publish the template and a binding via the spin CLI
spin pipeline-templates save --file pipeline-template.yml
spin pipeline save --file checkout-config.json   # binds app=checkout, targetAccount=gke-prod-us

Rollback automation ties it together in two layers: a rollback.onFailure on the deploy stage handles a failed deploy, and an explicit undoRolloutManifest (Kubernetes) or rollbackServerGroup stage handles a post-deploy failure caught by canary or a verification webhook. Gate that stage on the canary result:

{
  "type": "undoRolloutManifest",
  "name": "Rollback on canary fail",
  "account": "gke-prod-us",
  "cloudProvider": "kubernetes",
  "location": "checkout",
  "manifestName": "deployment checkout",
  "numRevisionsBack": 1,
  "requisiteStageRefIds": ["aca-stage"],
  "stageEnabled": {
    "type": "expression",
    "expression": "${ #stage('ACA: checkout').status.toString() == 'TERMINAL' }"
  }
}

Add a post-deploy verification stage – a webhook to a synthetic test runner that must return success before the pipeline reports green. With waitForCompletion, Spinnaker polls the returned status URL until statusJsonPath matches a success status:

{
  "type": "webhook",
  "name": "Synthetic smoke",
  "url": "https://synthetics.acme.com/run/checkout",
  "method": "POST",
  "waitForCompletion": true,
  "statusJsonPath": "$.status",
  "successStatuses": "SUCCEEDED",
  "terminalStatuses": "FAILED"
}

Enterprise scenario

A platform team ran one Spinnaker controlling deploys to GKE (gke-prod-eu) and EKS (eks-prod-us) for a checkout service. Their canary looked correct – error-rate and p99-latency, pass: 95 – yet a release with a real EU-region regression scored 96 and promoted. Two compounding mistakes. First, the canary’s controlScope pointed at the existing production server group, not a freshly deployed baseline, so the canary (cold caches, new pods) was compared against a warm, hours-old fleet; the latency delta got written off as “new pod warmup” and absorbed into the score. Second, their PromQL had no region label, so the query aggregated both clusters – healthy US traffic statistically drowned the EU regression.

The fix: deploy a dedicated baseline of the current version at the same time and size as the canary (the standard control/experiment pattern), and scope every metric to one cluster via extendedScopeParams plus an explicit label binding. They also marked error-rate as critical so an error spike could fail the analysis alone, independent of the weighted score:

{
  "name": "error-rate",
  "query": {
    "type": "prometheus",
    "serviceType": "prometheus",
    "customInlineTemplate": "PromQL:sum(rate(http_requests_total{job=\"checkout\",region=\"${scope}\",code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"checkout\",region=\"${scope}\"}[5m]))"
  },
  "groups": ["quality"],
  "analysisConfigurations": { "canary": { "direction": "increase", "critical": true } },
  "scopeName": "default"
}

With per-cluster scoping and a same-age baseline, the next bad EU release scored 41 and auto-rolled-back via the undoRolloutManifest stage before meaningful customer impact. The lesson every Kayenta adopter learns: the judge is only as honest as your scopes. Compare like with like, and scope each metric to exactly the population the canary serves.

Verify

Confirm providers, the canary pipeline, and the rollback path actually behave as configured.

# Providers and accounts are registered (Halyard)
hal config provider kubernetes account list
hal config provider aws account list

# Clouddriver actually sees the accounts (bypasses caching-lag confusion)
curl -s http://localhost:7002/credentials | jq '.[].name'

# Kayenta metrics + storage accounts are live
curl -s http://localhost:8090/credentials | jq '.[] | {name, types: .supportedTypes}'

# Fiat resolved a user's roles and account access
curl -s http://localhost:7003/authorize/<username> | jq '.accounts[].name'

# Validate the canary's PromQL in Prometheus BEFORE trusting the gate
curl -s 'http://prometheus.monitoring:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_requests_total{job="checkout",code=~"5.."}[5m]))/sum(rate(http_requests_total{job="checkout"}[5m]))' \
  | jq '.data.result'

A healthy setup shows every account in both Clouddriver’s and Kayenta’s /credentials, Fiat returning only the accounts a user’s roles permit, and the canary PromQL returning a value when run directly. Then exercise the failure path on a non-prod app: deploy a deliberately broken canary, watch the kayentaCanary stage drop below marginal, and confirm undoRolloutManifest fires and traffic returns to the prior revision.

Checklist

Pitfalls and next steps

The failures I see are rarely Spinnaker bugs. They are governance gaps and dishonest canaries. A manual judgment with open EXECUTE permissions is decorative; a canary comparing a cold new stack against a warm old one, or aggregating metrics across regions, will green-light regressions with a straight face. Lock execution to roles, deploy a real baseline, scope every query to the exact population under test – and validate that query in the metrics backend before a release depends on it.

Two extensions pay off quickly. Push all pipeline JSON and canary configs into Git and drive them through spin and Managed Pipeline Templates, so a new service inherits the governed promote path instead of reinventing it. And wire Echo’s events into your observability and change-management systems, so every deploy, judgment, and canary verdict is an auditable record – which is what turns Spinnaker from a deploy tool into a release-engineering control plane your auditors and on-call both trust.

Multi-Cloud Deployment Pipelines with Spinnaker and Automated Canary Analysis

1. The microservice architecture and why it matters operationally

2. Configuring cloud providers with Halyard or the Operator

3. The application, pipeline, and stage model

4. Deployment strategies: red/black, rolling, and highlander

5. Automated Canary Analysis with Kayenta

6. Wiring Prometheus, Stackdriver, and Datadog as metric sources

7. Authorization with Fiat: pipeline permissions and audited gates

8. Triggers, templated pipelines, and rollback automation

Enterprise scenario

Verify

Checklist

Pitfalls and next steps

Written by Vinod

Comments

Keep Reading

Blue-Green on Kubernetes with Argo Rollouts: Preview Services, Analysis Gates, and Automated Promotion

Standing Up Backstage as an Internal Developer Portal: Catalog, Software Templates, and TechDocs

Fast, Reproducible, Multi-Arch Builds with BuildKit Remote Cache and SBOM Attestations