Serving an open-weight model on AKS is where a lot of platform teams discover that “just add a GPU node pool” is three problems wearing a trenchcoat: capacity you cannot get, drivers that fight your container runtime, and a bill that keeps running at 3 a.m. because nothing scales to zero. KAITO (Kubernetes AI Toolchain Operator) closes most of that gap by treating an inference deployment as a declarative Workspace that provisions its own right-sized GPU nodes. This runbook walks the full path: picking GPU SKUs and getting quota, the driver decision, tainting and scheduling, installing KAITO, deploying a preset model, raising utilization with time-slicing and MIG, and clamping cost with scale-to-zero and consolidation. Everything below is real and tested against AKS on Kubernetes 1.30+.
1. Select GPU SKUs, request quota, and choose a capacity strategy
The SKU choice is downstream of the model. A 7B-class model in fp16 needs roughly 14-16 GB of VRAM just for weights, plus KV cache headroom; a 70B model in fp16 needs ~140 GB and forces you onto multi-GPU A100/H100 nodes. Map the model to the smallest VM family that holds it.
| VM family | GPU | VRAM/GPU | Typical fit |
|---|---|---|---|
Standard_NC4as_T4_v3 |
1x T4 | 16 GB | Small models, quantized 7B, dev |
Standard_NC24ads_A100_v4 |
1x A100 80GB | 80 GB | 7B-34B fp16, MIG candidate |
Standard_NC96ads_A100_v4 |
4x A100 80GB | 320 GB | 70B fp16, tensor-parallel |
Standard_ND96isr_H100_v5 |
8x H100 | 640 GB | Large models, high throughput |
GPU cores are gated by a per-family quota that defaults to zero in most subscriptions. Check it and request increases before you write any YAML, because a quota ticket can take hours to days.
# What GPU quota do you actually have in this region/family?
az vm list-usage --location eastus2 -o table \
| grep -iE "NCADS_A100|NCASv3_T4|NDASR_H100"
# Request more cores for the A100 v4 family (cores, not VMs)
az quota update \
--resource-name standardNCADSA100v4Family \
--scope "/subscriptions/<sub-id>/providers/Microsoft.Compute/locations/eastus2" \
--limit-object value=96 limit-object-type=LimitValue
Quota is per region and per family.
Standard_NC24ads_A100_v4andStandard_NC96ads_A100_v4draw from the samestandardNCADSA100v4Familybucket, but T4 and H100 are separate buckets. Plan headroom for the autoscaler: if a workspace needs two nodes and you only have quota for one, KAITO’s provisioning silently stalls.
On-demand vs. Spot. Inference that backs a user-facing API belongs on on-demand capacity. Spot GPUs are 60-90% cheaper but get evicted with 30 seconds’ notice, which is fine for batch scoring or async queues and ruinous for synchronous serving. A common split is on-demand for the steady-state replica and a Spot pool for burst, fronted by a queue that tolerates eviction.
2. The driver decision: managed GPU image vs. NVIDIA device plugin
AKS gives you two supported ways to get CUDA drivers and the Kubernetes device plugin onto GPU nodes. Picking one and not mixing them is the difference between a clean node and a nvidia-smi: command not found page.
Option A - Managed GPU image (recommended default). AKS ships a node image with the NVIDIA driver and device plugin pre-installed and lifecycle-managed by Microsoft. You opt in per node pool. Drivers are patched with node-image upgrades, so you do not own that toil.
# Create a GPU node pool using the AKS managed GPU image + driver
az aks nodepool add \
--resource-group rg-ml \
--cluster-name aks-inference \
--name gpua100 \
--node-vm-size Standard_NC24ads_A100_v4 \
--node-count 0 \
--enable-cluster-autoscaler --min-count 0 --max-count 4 \
--node-taints sku=gpu:NoSchedule \
--labels accelerator=nvidia gpu-sku=a100 \
--gpu-driver Install
The --gpu-driver Install flag (default on GPU SKUs) requests the managed driver. Set it to None only when you intend to manage drivers yourself with the NVIDIA GPU Operator.
Option B - NVIDIA GPU Operator / device plugin (you own drivers). When you need a specific driver version, MIG-aware management, DCGM metrics, or features ahead of the AKS image, skip the managed driver and install the operator via Helm. This is the path most teams take once they need MIG or time-slicing knobs.
# Pool with NO managed driver; operator will manage it
az aks nodepool add \
--resource-group rg-ml --cluster-name aks-inference \
--name gpuop --node-vm-size Standard_NC24ads_A100_v4 \
--node-count 0 --enable-cluster-autoscaler --min-count 0 --max-count 4 \
--node-taints sku=gpu:NoSchedule --labels accelerator=nvidia \
--gpu-driver None
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--set operator.defaultRuntime=containerd \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/etc/containerd/config.toml \
--set-string daemonsets.tolerations[0].key=sku \
--set-string daemonsets.tolerations[0].operator=Equal \
--set-string daemonsets.tolerations[0].value=gpu \
--set-string daemonsets.tolerations[0].effect=NoSchedule
Do not run the managed driver and the GPU Operator’s driver on the same node. They both try to load kernel modules and you get a node that flaps between
ReadyandNotReady. Pick A or B per pool.
Either way, the contract that matters to schedulers is the nvidia.com/gpu extended resource appearing on the node. Confirm it later in the Verify section.
3. Taint GPU pools and schedule with tolerations and nodeSelectors
GPU nodes are expensive; nothing that does not need a GPU should ever land on one. The pattern is a taint on the pool plus matching tolerations and a nodeSelector on the workloads.
The pool above carries sku=gpu:NoSchedule. A pod that wants the GPU must both tolerate the taint and select the label, and request the resource:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama
spec:
replicas: 1
selector:
matchLabels: { app: vllm-llama }
template:
metadata:
labels: { app: vllm-llama }
spec:
nodeSelector:
accelerator: nvidia
gpu-sku: a100
tolerations:
- key: sku
operator: Equal
value: gpu
effect: NoSchedule
containers:
- name: server
image: vllm/vllm-openai:latest
args: ["--model", "meta-llama/Llama-3.1-8B-Instruct"]
resources:
limits:
nvidia.com/gpu: 1 # whole-GPU request
requests:
cpu: "4"
memory: 24Gi
The nvidia.com/gpu limit is integer and non-overcommittable by default - you cannot request 0.5. That constraint is exactly what time-slicing and MIG exist to relax (Step 6). KAITO writes these tolerations and selectors for you, but you will hand-author them for any non-KAITO sidecar.
4. Install the KAITO operator and read the Workspace CRD
KAITO has two controllers: the workspace controller (reconciles Workspace objects into deployments) and the gpu-provisioner or Karpenter-based node controller (creates the GPU nodes a workspace needs). On AKS the cleanest install is the managed add-on, which wires identity and node provisioning for you.
# Enable the managed KAITO add-on (AI toolchain operator)
az aks update \
--resource-group rg-ml --name aks-inference \
--enable-ai-toolchain-operator
# The add-on creates the kube-system controllers and a federated identity.
kubectl get pods -n kube-system -l app.kubernetes.io/name=kaito
If you prefer Helm (self-managed, e.g. for non-AKS or pinned versions):
helm install kaito-workspace \
oci://mcr.microsoft.com/aks/kaito/workspace \
--namespace kaito-workspace --create-namespace
The Workspace CRD is the whole point. A minimal inference workspace looks like this:
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
name: workspace-llama-3-1-8b
resource:
instanceType: "Standard_NC24ads_A100_v4"
labelSelector:
matchLabels:
apps: llama-3-1-8b
inference:
preset:
name: "llama-3.1-8b-instruct"
Three fields carry the weight. resource.instanceType is the GPU VM the provisioner will create. resource.labelSelector tags the nodes so the inference deployment binds to them. inference.preset.name references a curated, validated model image - KAITO maintains presets (Llama, Phi, Mistral, Falcon, Qwen, and more) with the right runtime, GPU count, and serving args baked in, so you are not guessing tensor-parallel degree.
Presets encode the minimum GPU footprint. If you set an
instanceTypetoo small for the preset, the workspace condition reports the resource as insufficient rather than OOM-crashing at load time. Readstatus.conditionsbefore assuming the model is wedged.
5. Deploy a preset workspace and watch nodes get provisioned
Apply the workspace and follow the reconcile. The interesting part is that you never created a node pool for this - the provisioner does it on demand.
kubectl apply -f workspace-llama.yaml
# Watch the workspace march through ResourceReady -> InferenceReady
kubectl get workspace workspace-llama-3-1-8b -w
NAME INSTANCE RESOURCEREADY INFERENCEREADY AGE
workspace-llama-3-1-8b Standard_NC24ads_A100_v4 False False 20s
workspace-llama-3-1-8b Standard_NC24ads_A100_v4 True False 5m
workspace-llama-3-1-8b Standard_NC24ads_A100_v4 True True 9m
Behind those two booleans: the provisioner files a node claim, Azure brings up the A100 VM (3-6 minutes is normal), the managed driver lands, the device plugin advertises nvidia.com/gpu, then the inference pod pulls the (large) model image and loads weights into VRAM. The first deploy is slow because of the image pull and cold node; subsequent scale-ups reuse the warm image cache on existing nodes.
KAITO exposes the model behind a ClusterIP Service with an OpenAI-compatible API. Smoke-test it from inside the cluster:
kubectl run curl --rm -it --image=curlimages/curl --restart=Never -- \
curl -s http://workspace-llama-3-1-8b.default.svc.cluster.local:80/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.1-8b-instruct","prompt":"AKS in one line:","max_tokens":32}'
6. Raise utilization with time-slicing and MIG
A single A100 80GB serving an 8B model leaves enormous capacity idle. Two mechanisms reclaim it; they are mutually exclusive on a given GPU.
Time-slicing lets multiple pods share one physical GPU by round-robining the CUDA context. There is no memory isolation - pods can OOM each other - so it suits dev, bursty low-traffic models, and CI. Configure it through the GPU Operator with a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
a100: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # 1 physical GPU advertised as 4
# Point the operator at the config and label the node pool
kubectl patch clusterpolicy/cluster-policy -n gpu-operator --type merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"a100"}}}}'
After this, the node advertises 4x nvidia.com/gpu, so four pods each requesting one GPU schedule onto one card. Throughput per pod drops and tail latency rises under contention - measure it.
MIG (Multi-Instance GPU) is the production answer on A100/H100. It hardware-partitions one GPU into isolated instances with dedicated memory and compute, so a noisy tenant cannot starve a neighbor. A100 80GB supports profiles like 1g.10gb (7 instances), 2g.20gb, 3g.40gb. Enable it via the operator’s MIG manager:
# Single MIG layout across the whole GPU (7 x 1g.10gb on A100)
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb --overwrite
The node then advertises nvidia.com/mig-1g.10gb: 7 and pods request that resource instead of nvidia.com/gpu:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
Rule of thumb: MIG for multi-tenant serving where isolation and predictable SLOs matter; time-slicing for trusted, bursty, cost-sensitive dev. Never time-slice a customer-facing inference endpoint - one bad request pattern degrades every co-tenant.
7. Scale to zero, cost guardrails, and pool consolidation
The cluster autoscaler scales a GPU pool down to its --min-count, and for GPUs that floor must be 0. With KAITO’s provisioner, idle workspaces release their nodes automatically; for hand-rolled pools, set the floor explicitly and tune the scale-down timer so a node does not idle at A100 prices.
az aks nodepool update \
--resource-group rg-ml --cluster-name aks-inference --name gpua100 \
--update-cluster-autoscaler --min-count 0 --max-count 4
# Aggressive scale-down so idle GPU nodes die quickly
az aks update --resource-group rg-ml --name aks-inference \
--cluster-autoscaler-profile \
scale-down-unneeded-time=5m \
scale-down-delay-after-add=5m \
skip-nodes-with-system-pods=false
Two cost traps to engineer around:
- Scale-to-zero adds cold-start latency. First request after a scale-down pays 3-6 minutes for node boot plus weight load. If your SLO cannot absorb that, keep one warm replica on a small Spot GPU and let on-demand handle burst, or use a
PodDisruptionBudgetplus a scheduled scale-up before peak. - Daemonsets pin nodes alive. Any DaemonSet without a GPU toleration that nonetheless lands on the node, or a long-grace-period pod, blocks scale-down. Audit with
kubectl get pods --field-selector spec.nodeName=<node>before blaming the autoscaler.
For steady, predictable load, an Azure Reservation or savings plan on the GPU family cuts 30-60% off on-demand, but only commit once utilization is proven - reserving idle A100s is the most expensive mistake in this stack.
Verify
Walk these in order; each gates the next.
# 1. Node advertises the GPU resource (managed image or operator working)
kubectl get nodes -l accelerator=nvidia \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
# 2. Driver actually loaded inside a pod on the node
kubectl run gpu-test --rm -it --restart=Never \
--overrides='{"spec":{"nodeSelector":{"accelerator":"nvidia"},"tolerations":[{"key":"sku","operator":"Equal","value":"gpu","effect":"NoSchedule"}]}}' \
--image=nvidia/cuda:12.4.1-base-ubuntu22.04 -- nvidia-smi
# 3. KAITO workspace fully ready
kubectl get workspace -o wide
kubectl describe workspace workspace-llama-3-1-8b | grep -A5 Conditions
# 4. Inference endpoint returns tokens (latency sanity)
time kubectl exec deploy/curl -- curl -s \
http://workspace-llama-3-1-8b.default.svc.cluster.local/v1/models
For throughput, drive concurrent load and read vLLM’s own metrics. The serving container exposes Prometheus metrics on /metrics; the numbers that matter for capacity planning are vllm:num_requests_running, vllm:gpu_cache_usage_perc, and time-to-first-token. If KV cache usage pins at ~100% you are KV-bound and need MIG-off, more VRAM, or a smaller context window - not more replicas.
// Driver / OOM failure patterns from container logs (Log Analytics)
ContainerLogV2
| where TimeGenerated > ago(1h)
| where LogMessage has_any ("CUDA out of memory",
"NVML", "no CUDA-capable device",
"Failed to initialize NVML")
| summarize count() by PodName, tostring(LogMessage)
| order by count_ desc
Common failure modes
no CUDA-capable device is detected: device plugin not running or managed driver and operator are both installed. Checkkubectl get pods -n kube-system | grep nvidia-device-pluginand ensure only one driver source per pool.CUDA out of memoryat load time:instanceTypetoo small for the preset, or KV cache too large. Move up a VM size or cap--max-model-len.- Workspace stuck
ResourceReady: False: GPU quota exhausted in the family, or theinstanceTypeis not available in the region. Re-checkaz vm list-usage. - Node never scales down: a non-tolerating pod or long terminationGracePeriod is pinning it. Inspect with
--field-selector spec.nodeName.
Enterprise scenario
A fintech platform team ran an internal document-Q&A service on a 34B model behind a single, always-on Standard_NC96ads_A100_v4 pool (4x A100). The constraint was brutal economics: the service saw heavy traffic 08:00-18:00 on weekdays and near-zero otherwise, yet the four-GPU node ran 24x7 because the model spanned all four cards via tensor parallelism and could not scale below one node. Monthly GPU spend was dominated by ~110 idle hours a week.
They restructured around two ideas. First, they split traffic: the synchronous “ask a question” path stayed on a guaranteed on-demand workspace, while overnight bulk re-indexing moved to a Spot GPU pool feeding an async queue, tolerating eviction. Second - the bigger win - they decomposed the workload. The 34B summarizer genuinely needed multi-GPU, but the high-volume embedding model was a 1.5B that had been wastefully sharing the A100s. They carved the A100s into MIG 3g.40gb instances so the embedder ran in an isolated partition with guaranteed memory, freeing whole cards, and put the summarizer workspace on cluster-autoscaler scale-to-zero with a scheduled pre-warm at 07:45 so the first user of the day never hit a cold node.
# Scheduled pre-warm: scale the GPU pool up before business hours,
# let the autoscaler take it back to zero after 18:00.
az aks nodepool update -g rg-ml --cluster-name aks-inference \
--name gpua100 --update-cluster-autoscaler --min-count 0 --max-count 4
# CronJob bumps a warm replica at 07:45 weekdays (cluster-local time)
kubectl create cronjob prewarm --schedule="45 7 * * 1-5" \
--image=bitnami/kubectl -- \
kubectl scale deploy/vllm-summarizer --replicas=1
The result: the summarizer paid for GPUs only during business hours, the embedder stopped stealing A100 capacity, and the Spot pool absorbed re-indexing at a fraction of on-demand cost. Net GPU spend fell roughly 55% with no change to user-facing latency, because the pre-warm hid every cold start behind the morning ramp.