Why Cloud CLI Automation From Shell Has a Specific Set of Failure Modes
Your CI script lists S3 buckets across 12 regions, tags each one, runs nightly. It worked for six months. Then one night it ran 47 minutes instead of 4, you got a ThrottlingException from AWS, the script half-finished, and the next morning half your buckets are tagged and the other half aren’t.
Or: a deploy script reads gcloud auth list to confirm the right service account is active, then calls gcloud compute instances list. CI changed its environment, the active account is now default-runner@.. instead of deploy@.., and the script lists the wrong project’s instances and continues happily.
Or: an az vm list returns 2000 VMs across subscriptions, you forget pagination, the response is 80 MB, your jq pipeline OOMs the runner.
Cloud CLI automation has six specific failure modes:
| Failure mode | Symptom | Cost |
|---|---|---|
| Wrong credential resolved | Script operates on wrong account/project | Wrong resource modified, or auth fails confusingly |
| Unpaginated listing | Truncated results, “missing” resources | Half-tagged buckets, half-deleted instances |
| Rate limit hit | Throttling / 429 errors mid-run |
Partial state, retries that re-do work |
| Wrong output format | Brittle parsers break on text/table output | Script fails with no clear error |
| Long-running session timeout | Token expires mid-paginate | “Unauthorized” 30 minutes into a 60-minute job |
| Region/zone defaulting | Script runs in default region; your resource is elsewhere | “Resource not found” in a region that has 0 of them |
This lesson is the cross-cloud pattern set. We treat AWS, Azure, and GCP as variations on the same problem, with a lib/cloud.sh that abstracts away the dialect differences.
The Credential Resolution Chain (All Three Clouds)
Every cloud CLI resolves credentials via a chain: it tries each source in order, stops at the first that yields valid credentials. Knowing the chain lets you reason about what credential is actually being used.
AWS credential chain
1. Environment vars: AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY (+ AWS_SESSION_TOKEN for STS)
2. AWS_PROFILE → ~/.aws/credentials [profile] section, plus ~/.aws/config
3. AWS_PROFILE with sso_session in ~/.aws/config → SSO cache in ~/.aws/sso/cache/*.json
4. AWS_WEB_IDENTITY_TOKEN_FILE + AWS_ROLE_ARN → STS AssumeRoleWithWebIdentity (IRSA, GitHub OIDC)
5. ECS task role (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)
6. EC2 IMDSv2 (instance metadata at http://169.254.169.254/latest/meta-data/iam/...)
The chain stops at the first source. If AWS_ACCESS_KEY_ID is set in your env, no other source is consulted, even if ~/.aws/credentials has different keys for the same profile. This is the most common confusion.
Azure credential chain (az CLI)
1. AZURE_CLIENT_ID + AZURE_CLIENT_SECRET + AZURE_TENANT_ID (service principal)
2. AZURE_CLIENT_ID + AZURE_USERNAME + AZURE_PASSWORD (resource owner password — discouraged)
3. Managed identity (when running in Azure VM/Functions/AKS)
4. ~/.azure/azureProfile.json (persisted login from `az login`)
az login writes a token cache to ~/.azure/. CI tokens expire (60–90 minutes); you must refresh or use a service principal directly.
GCP credential chain (gcloud)
1. GOOGLE_APPLICATION_CREDENTIALS → JSON key file path
2. gcloud config config-helper / active gcloud account
3. Compute metadata service (running on GCE/GKE)
For ADC (Application Default Credentials, used by SDKs): gcloud auth application-default login for humans; service-account key file for automation.
The “is the right credential active?” check
Always run a “who am I?” probe at the top of any script that uses cloud APIs:
# AWS
aws sts get-caller-identity --output json
# {
# "UserId": "AIDAEXAMPLE",
# "Account": "123456789012",
# "Arn": "arn:aws:iam::123456789012:user/build-bot"
# }
# Azure
az account show -o json
# {
# "id": "00000000-0000-0000-0000-000000000000",
# "user": { "name": "build-bot@example.onmicrosoft.com", "type": "servicePrincipal" }
# }
# GCP
gcloud config list --format=json
# {
# "core": {
# "account": "build-bot@my-project.iam.gserviceaccount.com",
# "project": "my-project"
# }
# }
A 5-line preflight at the top of every cloud script:
preflight_aws() {
local got_account got_arn want_account="${1:-}"
read -r got_account got_arn < <(aws sts get-caller-identity --query '[Account,Arn]' --output text)
echo "AWS account: $got_account, identity: $got_arn"
if [[ -n "$want_account" && "$got_account" != "$want_account" ]]; then
echo "ERROR: expected account $want_account, got $got_account" >&2
exit 1
fi
}
preflight_aws 123456789012 # fail-fast if wrong account is active
This is the single highest-leverage pattern in this lesson. Production catastrophes from “wrong account active” are entirely preventable.
Output Format Discipline: Always JSON for Automation
Every cloud CLI defaults to a human-readable format and offers JSON for automation:
| CLI | Default | JSON | Set globally |
|---|---|---|---|
aws |
JSON (most regions) or table | --output json |
export AWS_DEFAULT_OUTPUT=json or ~/.aws/config |
az |
JSON | -o json (default) |
az config set core.output=json |
gcloud |
YAML/text | --format=json |
gcloud config set core/format json |
Rule: always pass --output json (or equivalent) explicitly. Even if it’s the default, future versions or different operator profiles can change the default and your script will silently start emitting tables that break parsers.
# WRONG: relies on default; breaks if user has 'output=table' set.
aws ec2 describe-instances | grep i-
# RIGHT: explicit format, parse with jq.
aws ec2 describe-instances --output json \
| jq -r '.Reservations[].Instances[].InstanceId'
Use server-side filtering when available
Both AWS (--query) and Azure (--query) accept JMESPath; gcloud uses --filter and --format. Server-side filtering is faster and avoids paginating data you’ll throw away.
# AWS: server-side filter for running instances in a VPC, project just IDs.
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" "Name=vpc-id,Values=vpc-12345" \
--query 'Reservations[].Instances[].InstanceId' \
--output json
# Azure: same idea with --query.
az vm list --query "[?powerState=='VM running'].name" -o json
# GCP: --filter is GCE-style filter syntax; --format=value(...) for tab output.
gcloud compute instances list \
--filter='status:RUNNING AND zone:us-central1-a' \
--format='value(name)'
--query and --filter reduce the API response size (often 100x), reducing pagination, throttling risk, and parse time.
Pagination: The Single Biggest Source of Silent Truncation
By default, all three CLIs paginate, but they paginate differently:
AWS pagination
aws CLI v2 automatically paginates and concatenates results into a single response — unless you pass --max-items or set pagination config. So aws ec2 describe-instances on a 5,000-instance account returns all 5,000 in one response (which can be 30+ MB). Two failure modes:
- Memory blowup in subsequent jq pipelines.
- Timeout if the API call takes >60s due to size.
To process incrementally:
# Paginate manually with --no-paginate + NextToken handling.
list_all_instances() {
local token result_file=/tmp/instances.jsonl
: > "$result_file"
while :; do
local out
if [[ -z "${token:-}" ]]; then
out=$(aws ec2 describe-instances --no-paginate --output json)
else
out=$(aws ec2 describe-instances --no-paginate --starting-token "$token" --output json)
fi
# Append IDs to the result file (process incrementally).
jq -r '.Reservations[].Instances[].InstanceId' <<<"$out" >> "$result_file"
token=$(jq -r '.NextToken // empty' <<<"$out")
[[ -z "$token" ]] && break
done
}
Or, use --max-items (per-request page size; CLI auto-paginates) for clarity:
# Smaller pages mean faster first-byte and bounded memory per page.
aws ec2 describe-instances --max-items 100 --output json \
| jq -r '.Reservations[].Instances[].InstanceId'
Azure pagination
az CLI auto-paginates by default. To control:
# Disable auto-pagination (return only the first page + NextToken-equivalent).
az vm list --max-items 1000 -o json
Most az commands support --top for “max results” but the default is “all results.”
GCP pagination
gcloud paginates differently per command. gcloud compute instances list paginates by default; --limit N caps results; --page-size N controls page size.
gcloud compute instances list --limit 5000 --page-size 500 --format=json
For very large lists, gcloud also supports --uri for “just give me the URLs” — light and fast.
Retry and Backoff: The Must-Have Wrapper
Every cloud API rate-limits. AWS surfaces ThrottlingException, Throttling, RequestLimitExceeded. Azure uses HTTP 429 with Retry-After. GCP uses 429 with quota errors.
Always wrap cloud CLI calls in retry-with-backoff. The CLIs sometimes have built-in retry but the defaults are conservative; explicit retry gives you control and visibility.
# Generic retry with exponential backoff.
cloud_retry() {
local max=${CLOUD_RETRY_MAX:-5}
local base=${CLOUD_RETRY_BASE_MS:-500}
local attempt=1 wait_ms
while (( attempt <= max )); do
if "$@"; then
return 0
fi
local rc=$?
if (( attempt == max )); then
echo "cloud_retry: giving up after $max attempts" >&2
return "$rc"
fi
# Exponential backoff with jitter: base * 2^(attempt-1) ± 25%.
wait_ms=$(( base * (2 ** (attempt - 1)) ))
wait_ms=$(( wait_ms + (RANDOM % (wait_ms / 2)) - (wait_ms / 4) ))
echo "attempt $attempt failed (rc=$rc); retrying in ${wait_ms}ms" >&2
sleep "$(awk "BEGIN { printf \"%.3f\", $wait_ms / 1000 }")"
attempt=$(( attempt + 1 ))
done
}
# Usage:
cloud_retry aws s3api put-bucket-tagging --bucket my-bucket --tagging file://tags.json
For AWS, you can also leverage built-in retry config:
# In ~/.aws/config or env:
export AWS_RETRY_MODE=adaptive # legacy | standard | adaptive (recommended)
export AWS_MAX_ATTEMPTS=10
adaptive mode uses a token bucket that adjusts based on observed throttling — strictly better than fixed backoff for steady-state operation.
Bounded Parallelism: GNU parallel + xargs Patterns
You have 200 buckets to tag. Sequential = 200 × 200ms = 40s. Parallel with concurrency 10 = 4s. Parallel with no limit = throttled.
# WRONG: unbounded parallel; trips throttling.
aws s3api list-buckets --query 'Buckets[].Name' --output text \
| tr '\t' '\n' \
| xargs -P 0 -I {} aws s3api put-bucket-tagging --bucket {} --tagging file://tags.json
# RIGHT: bounded concurrency. -P 10 = 10 parallel workers.
aws s3api list-buckets --query 'Buckets[].Name' --output text \
| tr '\t' '\n' \
| xargs -P 10 -I {} cloud_retry aws s3api put-bucket-tagging --bucket {} --tagging file://tags.json
For more sophisticated patterns, GNU parallel:
# Process per-region with up to 5 parallel jobs, retry on failure.
aws ec2 describe-regions --query 'Regions[].RegionName' --output text \
| tr '\t' '\n' \
| parallel -j 5 --retries 3 'aws --region {} s3api list-buckets --query "Buckets[].Name" --output json'
--retries 3 retries on non-zero exit (basic; not throttle-aware). For throttle-aware retry, wrap your own retry function in the parallel command.
Concurrency limit by API quota
AWS rate limits per-region per-API. Roughly: list APIs 100 req/s, mutating APIs 5–20 req/s. Keeping -P 10 for read-only listings is safe; for mutating, use -P 4 and add retries.
Azure rate limits per-subscription with reads at ~12,000/hour and writes at ~1,200/hour. Bursts of 100+ in seconds will hit throttling.
GCP rate limits per-project per-API; quotas are visible in gcloud compute project-info describe. For most CLI operations, -P 8 with retries is a safe baseline.
A Drop-In Library: lib/cloud.sh
# lib/cloud.sh — cross-cloud helpers. Detects which CLI to use; wraps retry.
# ─── Configuration ─────────────────────────────────────────────────────────
: "${CLOUD_RETRY_MAX:=5}"
: "${CLOUD_RETRY_BASE_MS:=500}"
: "${CLOUD_PARALLEL:=8}"
# ─── Retry with exponential backoff + jitter ──────────────────────────────
cloud_retry() {
local max="$CLOUD_RETRY_MAX" base="$CLOUD_RETRY_BASE_MS" attempt=1 wait_ms rc
while (( attempt <= max )); do
"$@" && return 0
rc=$?
(( attempt == max )) && return "$rc"
wait_ms=$(( base * (2 ** (attempt - 1)) ))
wait_ms=$(( wait_ms + (RANDOM % (wait_ms / 2)) - (wait_ms / 4) ))
echo "[cloud_retry] attempt $attempt failed (rc=$rc); sleep ${wait_ms}ms" >&2
sleep "$(awk "BEGIN { printf \"%.3f\", $wait_ms / 1000 }")"
attempt=$(( attempt + 1 ))
done
}
# ─── Identity preflight (call at top of every cloud script) ────────────────
aws_whoami() {
aws sts get-caller-identity --query '[Account,Arn]' --output text
}
aws_assert_account() {
local want="$1" got
got=$(aws sts get-caller-identity --query 'Account' --output text)
if [[ "$got" != "$want" ]]; then
echo "AWS account mismatch: want=$want got=$got" >&2
return 1
fi
}
az_assert_subscription() {
local want="$1" got
got=$(az account show --query 'id' -o tsv)
if [[ "$got" != "$want" ]]; then
echo "Azure subscription mismatch: want=$want got=$got" >&2
return 1
fi
}
gcp_assert_project() {
local want="$1" got
got=$(gcloud config get-value project 2>/dev/null)
if [[ "$got" != "$want" ]]; then
echo "GCP project mismatch: want=$want got=$got" >&2
return 1
fi
}
# ─── Pagination wrappers ───────────────────────────────────────────────────
# AWS: paginated foreach. Calls $func once per page.
aws_paginate() {
local cmd_func="$1"; shift
local token=""
while :; do
local args=()
[[ -n "$token" ]] && args+=("--starting-token" "$token")
local out
out=$(cloud_retry "$@" --no-paginate --output json "${args[@]}") || return 1
"$cmd_func" "$out" || return 1
token=$(jq -r '.NextToken // empty' <<<"$out")
[[ -z "$token" ]] && break
done
}
# ─── Bounded parallel apply ────────────────────────────────────────────────
# Read items from stdin, apply $cmd to each, max $CLOUD_PARALLEL concurrent.
cloud_parallel() {
xargs -P "$CLOUD_PARALLEL" -I {} bash -c "$(declare -f cloud_retry); cloud_retry $* {}"
}
# ─── Multi-region foreach (AWS) ────────────────────────────────────────────
aws_foreach_region() {
local cmd_func="$1"
local regions
regions=$(aws ec2 describe-regions \
--query 'Regions[].RegionName' --output text)
local region
for region in $regions; do
AWS_REGION="$region" "$cmd_func"
done
}
# Parallel multi-region.
aws_foreach_region_parallel() {
local cmd_func="$1"
aws ec2 describe-regions --query 'Regions[].RegionName' --output text \
| tr '\t' '\n' \
| xargs -P "$CLOUD_PARALLEL" -I {} bash -c "AWS_REGION={} $(declare -f cloud_retry $cmd_func); $cmd_func"
}
# ─── Output validation ─────────────────────────────────────────────────────
require_jq() { command -v jq >/dev/null || { echo "jq required" >&2; exit 1; }; }
# Validate that JSON output has expected shape.
require_json_field() {
local input="$1" field="$2"
if ! jq -e "$field" <<<"$input" >/dev/null 2>&1; then
echo "missing required field: $field" >&2
return 1
fi
}
Real-World Recipes
Recipe 1: Inventory all S3 buckets across all regions
. lib/cloud.sh
aws_assert_account 123456789012 # preflight
list_buckets_in_region() {
local region="$AWS_REGION"
aws --region "$region" s3api list-buckets --output json \
| jq -r --arg r "$region" '.Buckets[] | "\($r)\t\(.Name)\t\(.CreationDate)"'
}
aws_foreach_region_parallel list_buckets_in_region | sort > all_buckets.tsv
echo "found $(wc -l < all_buckets.tsv) buckets"
Recipe 2: Tag all running EC2 instances missing a Owner tag
list_untagged_running() {
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[?!not_null(Tags[?Key==`Owner`].Value | [0])].InstanceId' \
--output text \
| tr '\t' '\n'
}
tag_instance() {
local id="$1"
cloud_retry aws ec2 create-tags --resources "$id" \
--tags Key=Owner,Value=unknown
echo "tagged $id"
}
export -f cloud_retry tag_instance
list_untagged_running \
| xargs -P 10 -I {} bash -c 'tag_instance {}'
The export -f pattern is needed so child shells (spawned by xargs) can see the function. Alternative: use parallel.
Recipe 3: Multi-cloud secret rotation (AWS Secrets Manager + Azure Key Vault)
rotate_aws_secret() {
local name="$1"
local new_value
new_value=$(openssl rand -base64 32)
cloud_retry aws secretsmanager update-secret \
--secret-id "$name" \
--secret-string "$new_value"
echo "rotated AWS secret: $name"
}
rotate_az_secret() {
local vault="$1" name="$2"
local new_value
new_value=$(openssl rand -base64 32)
cloud_retry az keyvault secret set \
--vault-name "$vault" --name "$name" --value "$new_value" \
-o none
echo "rotated Azure secret: $vault/$name"
}
rotate_aws_secret "myapp/db-password"
rotate_az_secret "myapp-vault" "db-password"
Recipe 4: Cost-attribution: spend per tag for last month
# AWS Cost Explorer.
aws ce get-cost-and-usage \
--time-period "Start=$(date -d 'first day of last month' +%F),End=$(date -d 'first day of this month' +%F)" \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=TAG,Key=CostCenter \
--output json \
| jq -r '
.ResultsByTime[].Groups[]
| "\(.Keys[0])\t\(.Metrics.BlendedCost.Amount)\t\(.Metrics.BlendedCost.Unit)"
' \
| sort -t$'\t' -k2 -n -r
Recipe 5: Drift check: declared vs actual VM count
# Fail CI if a Terraform-managed deployment has unexpected drift.
expected=$(terraform output -json instance_ids | jq -r '.[]' | sort)
actual=$(aws ec2 describe-instances \
--filters "Name=tag:ManagedBy,Values=terraform" "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].InstanceId' --output text \
| tr '\t' '\n' | sort)
if ! diff <(echo "$expected") <(echo "$actual") >/dev/null; then
echo "DRIFT detected:"
diff <(echo "$expected") <(echo "$actual")
exit 1
fi
Footgun List
-
Default region is implicit. AWS has
AWS_REGION/AWS_DEFAULT_REGION/ profile region / fallback tous-east-1. Always set it explicitly:aws --region us-west-2 ...orexport AWS_REGION=us-west-2. -
AWS_PROFILEdoes not overrideAWS_ACCESS_KEY_ID. If the env vars are set, the profile is ignored. Common when CI sets keys for one account and youaws --profile otherthinking it switches. -
Output format is per-call, not session.
--output jsonon one call doesn’t apply to the next. SetAWS_DEFAULT_OUTPUT=jsonfor the session. -
jq -rstrips quotes from null — outputs the literal string “null”. Filter:jq -r '. // empty'produces empty for null. -
aws s3 lsis not the same asaws s3api list-objects-v2. The first uses the Recursive CLI, the second is the raw API. Different output formats, different pagination behavior. -
gcloud computeis regional/zonal. Without--zoneor--region, gcloud often errors or asks interactively. In scripts, always specify. -
az loginwrites credentials to a global file. Two scripts running concurrently can race on token refresh. Use service-principal env vars for parallel automation. -
AWS_PAGER=catdisables CLI v2’s auto-pager (which breaks scripts on TTY-detection). Set this in CI:export AWS_PAGER="". -
gcloud auth print-access-tokenreturns a token butgcloudmay not refresh it automatically; long-running scripts can have the token expire. -
Rate limits are per region for AWS and per subscription for Azure. Hitting throttling in one region doesn’t necessarily fail you in another, but if you parallelize across regions, you can hit per-account limits too.
-
aws s3 syncandgsutil rsyncdo their own retry logic that you can’t easily inspect. For reliable transfers at scale, prefer dedicated tools (or wrap with cloud_retry aroundaws s3 cpfor fine-grained control). -
Tagging APIs are eventually consistent. Tag a resource, immediately list-by-tag — the list may not include the freshly tagged resource for several seconds. Don’t rely on read-your-writes.
Quick-Reference Card
┌─ CREDENTIAL CHAIN PRIORITY ───────────────────────────────────────────┐
│ AWS: env > profile > SSO > IRSA > ECS task > IMDSv2 │
│ Azure: service-principal env > managed identity > az login cache │
│ GCP: GOOGLE_APPLICATION_CREDENTIALS > gcloud account > metadata │
└────────────────────────────────────────────────────────────────────────┘
┌─ PREFLIGHT (RUN AT TOP OF EVERY CLOUD SCRIPT) ────────────────────────┐
│ aws sts get-caller-identity │
│ az account show │
│ gcloud config list │
│ Assert account/subscription/project matches expected │
└────────────────────────────────────────────────────────────────────────┘
┌─ OUTPUT FORMAT (ALWAYS EXPLICIT) ─────────────────────────────────────┐
│ aws ... --output json │
│ az ... -o json │
│ gcloud ... --format=json (or --format='value(field)' for tab) │
│ Pipe to jq for transformations │
└────────────────────────────────────────────────────────────────────────┘
┌─ SERVER-SIDE FILTERING ───────────────────────────────────────────────┐
│ aws --filters "Name=tag:Env,Values=prod" │
│ aws --query 'Reservations[].Instances[].[InstanceId,Tags]' │
│ az --query "[?location=='eastus'].name" │
│ gcloud --filter='status:RUNNING' --format='value(name)' │
└────────────────────────────────────────────────────────────────────────┘
┌─ PAGINATION ──────────────────────────────────────────────────────────┐
│ AWS: --max-items N (auto-paginates) or --no-paginate + token loop│
│ Azure: --max-items N (auto-paginates by default) │
│ GCP: --limit N --page-size N │
│ ALWAYS process incrementally; large unpaginated responses OOM │
└────────────────────────────────────────────────────────────────────────┘
┌─ RETRY & BACKOFF ─────────────────────────────────────────────────────┐
│ AWS: AWS_RETRY_MODE=adaptive AWS_MAX_ATTEMPTS=10 │
│ Generic: cloud_retry wrapper with exponential + jitter │
│ Always handle: ThrottlingException, 429, RequestLimitExceeded │
└────────────────────────────────────────────────────────────────────────┘
┌─ PARALLELISM ─────────────────────────────────────────────────────────┐
│ Read-only listing: -P 10 (xargs) │
│ Mutating ops: -P 4 with retry │
│ Multi-region: parallel -j 5 over regions │
│ Per-region: rate limits separate; parallelize by region for scale │
└────────────────────────────────────────────────────────────────────────┘
┌─ CI-SAFE ENV ─────────────────────────────────────────────────────────┐
│ export AWS_PAGER="" disable v2 pager (breaks no-TTY) │
│ export AWS_DEFAULT_OUTPUT=json consistent format │
│ export AWS_REGION=us-west-2 explicit region │
│ export AWS_RETRY_MODE=adaptive better throttle handling │
│ unset AWS_PROFILE if using env keys (prevent profile interference) │
└────────────────────────────────────────────────────────────────────────┘
What’s Next
Cloud CLIs are the operator’s keyboard for the platform. The next layer wraps your shell scripts as proper Linux services, integrated with the system: timer-based scheduling, restart-on-failure, watchdogs, logging integration. The next lesson, Writing systemd Units That Wrap Shell Scripts Properly: Type, Restart, Hardening, Watchdogs, covers the Unit/Service/Timer file structure, choosing Type=simple vs oneshot vs notify, sandboxing with ProtectSystem and PrivateTmp, watchdog integration, and the difference between “the script ran” and “the service is healthy.”