When five product teams all want GPT-4o “by next sprint,” the failure mode is predictable: a sprawl of Cognitive Services accounts, public endpoints, hard-coded API keys in app settings, and a single noisy team draining the regional token quota at 9am every day. This guide builds the alternative — a shared Azure OpenAI platform with private networking, an API Management (APIM) gateway for auth and throttling, quota planning, and per-team chargeback — the way it’s done in a regulated enterprise.
The shared-service problem
Three constraints make Azure OpenAI different from a normal PaaS rollout:
- Quota is regional and per-model. Tokens-per-minute (TPM) and requests-per-minute (RPM) are allocated to your subscription per region, per model, per deployment. There is no infinite tap. If team A deploys
gpt-4owith 300K TPM in East US, that draws from the same regional pool team B wants. - 429s are a capacity signal, not a bug. Pay-as-you-go deployments throttle hard under burst. Without a gateway and a multi-region strategy, every consumer feels another consumer’s spike.
- Keys leak. A model that ships its own account key to every app is a credential-sprawl incident waiting to happen, and it makes chargeback impossible — you cannot attribute spend to a key everyone shares.
The platform pattern that solves all three: a small number of Azure OpenAI accounts (one or two per region for redundancy), private-only, fronted by one APIM instance that authenticates callers, enforces token budgets, load-balances across backends, and emits usage telemetry per consumer.
Mental model: Azure OpenAI accounts are capacity. APIM is the control plane — auth, quota, routing, observability. Application teams never touch the accounts directly; they hold an APIM subscription key (or, better, a token) and call the gateway.
Step 1 — Private-only Azure OpenAI accounts
Create the account, disable public access, and reach it only over a private endpoint. The kind is OpenAI; the SKU is S0.
RG=rg-aoai-platform
LOC=eastus
ACCT=aoai-platform-eus
az cognitiveservices account create \
--name "$ACCT" \
--resource-group "$RG" \
--location "$LOC" \
--kind OpenAI \
--sku S0 \
--custom-domain "$ACCT" \
--yes
A custom subdomain (--custom-domain) is mandatory for private endpoints and for Entra ID (AAD) token auth — the default *.cognitiveservices.azure.com shared domain does not support either. Now lock it down:
az resource update \
--resource-group "$RG" \
--name "$ACCT" \
--resource-type "Microsoft.CognitiveServices/accounts" \
--set properties.publicNetworkAccess=Disabled \
properties.networkAcls.defaultAction=Deny
Private endpoint and DNS
The account now resolves to a private IP inside your hub-and-spoke network. Two pieces: a private endpoint in a subnet, and a Private DNS zone so the account’s FQDN resolves privately. The correct zone for Cognitive Services / Azure OpenAI is privatelink.openai.azure.com.
resource "azurerm_private_endpoint" "aoai" {
name = "pe-aoai-platform-eus"
resource_group_name = azurerm_resource_group.platform.name
location = "eastus"
subnet_id = azurerm_subnet.privatelink.id
private_service_connection {
name = "psc-aoai"
private_connection_resource_id = azurerm_cognitive_account.aoai.id
subresource_names = ["account"]
is_manual_connection = false
}
}
resource "azurerm_private_dns_zone" "aoai" {
name = "privatelink.openai.azure.com"
resource_group_name = azurerm_resource_group.platform.name
}
resource "azurerm_private_dns_zone_virtual_network_link" "aoai" {
name = "link-aoai-hub"
resource_group_name = azurerm_resource_group.platform.name
private_dns_zone_name = azurerm_private_dns_zone.aoai.name
virtual_network_id = azurerm_virtual_network.hub.id
}
resource "azurerm_private_endpoint" "aoai_dns" {
# attach the zone to the endpoint so the A record is auto-created
name = azurerm_private_endpoint.aoai.name
resource_group_name = azurerm_resource_group.platform.name
location = "eastus"
subnet_id = azurerm_subnet.privatelink.id
private_service_connection {
name = "psc-aoai"
private_connection_resource_id = azurerm_cognitive_account.aoai.id
subresource_names = ["account"]
is_manual_connection = false
}
private_dns_zone_group {
name = "default"
private_dns_zone_ids = [azurerm_private_dns_zone.aoai.id]
}
}
Pitfall: if you centralize DNS in a connectivity-hub subscription (the CAF pattern), do not also create a
private_dns_zone_groupon the spoke — link the zone once in the hub and resolve via the hub’s DNS. Two A records for the same FQDN in different zones is the single most common “it worked in dev, times out in prod” failure with private OpenAI.
Step 2 — Deploy models and plan quota
A Cognitive Services account is empty until you create a deployment — a named instance of a model with an allocated capacity. The --sku-capacity value depends on the SKU type.
az cognitiveservices account deployment create \
--resource-group "$RG" \
--name "$ACCT" \
--deployment-name gpt-4o-prod \
--model-name gpt-4o \
--model-version "2024-08-06" \
--model-format OpenAI \
--sku-name Standard \
--sku-capacity 100
For a Standard (pay-as-you-go) deployment, --sku-capacity is expressed in units of 1,000 TPM — so 100 here means 100K tokens-per-minute, drawn from your regional quota for that model. RPM is derived from TPM at a fixed ratio per model.
PTU vs. pay-as-you-go
| Dimension | Standard (PayGo) | Provisioned (PTU) |
|---|---|---|
| Billing | Per token consumed | Per provisioned unit per hour |
| Latency | Variable under load | Predictable, reserved |
| 429 behavior | Throttles on burst vs. quota | Throttles only past provisioned throughput |
| Best for | Spiky / dev / unpredictable | Steady, latency-sensitive production |
| Commitment | None | Reservation gives the real discount |
Provisioned Throughput Units (PTUs) reserve dedicated capacity. You pay hourly whether or not you use it, so PTUs only pay off for predictable, sustained workloads — and the steep discount comes from a reservation (monthly or yearly), not the on-demand PTU price. Use Microsoft’s capacity calculator in the AI Foundry / portal to convert “this workload does ~X input + Y output tokens at Z RPM” into a PTU count; do not guess, because the minimum deployable PTU and the per-PTU throughput differ by model.
A pragmatic split for a multi-team platform:
- PTU deployment for the latency-sensitive production path (
gpt-4o-prodabove, but--sku-name ProvisionedManaged). - Standard deployment as spillover for burst and for dev/test, so a spike does not exhaust the reserved PTUs.
We will wire APIM to fail over from PTU to Standard in Step 5.
Step 3 — The gateway: API Management in front of OpenAI
APIM is the seam. Deploy it into the VNet (Developer or Premium SKU for VNet injection; Standard v2 supports VNet integration) so it can reach the private endpoint, then import the Azure OpenAI REST surface.
Give APIM a system-assigned managed identity and grant it the Cognitive Services OpenAI User role on the account — this is how the gateway calls OpenAI without any key:
APIM=apim-aoai-platform
APIM_MI=$(az apim show -g "$RG" -n "$APIM" --query identity.principalId -o tsv)
ACCT_ID=$(az cognitiveservices account show -g "$RG" -n "$ACCT" --query id -o tsv)
az role assignment create \
--assignee-object-id "$APIM_MI" \
--assignee-principal-type ServicePrincipal \
--role "Cognitive Services OpenAI User" \
--scope "$ACCT_ID"
Inside the API, the inbound policy authenticates to the backend with that managed identity and forwards the token as a bearer credential. The Azure AD audience for Cognitive Services is https://cognitiveservices.azure.com:
<inbound>
<base />
<authentication-managed-identity
resource="https://cognitiveservices.azure.com"
output-token-variable-name="aoai-token" />
<set-header name="Authorization" exists-action="override">
<value>@("Bearer " + (string)context.Variables["aoai-token"])</value>
</set-header>
</inbound>
Now no application — and no developer — ever holds an Azure OpenAI key. The key surface area of the whole platform collapses to APIM subscription keys, which we can rotate and attribute.
Step 4 — Per-team chargeback and token throttling
Model each consuming team as an APIM product with its own subscription. The subscription key identifies the caller; the product carries the policy (quota, priority). This is the unit of chargeback.
APIM ships a purpose-built policy for LLM cost control: llm-token-limit (and the matching emit policy below). It throttles on tokens, not just requests, which is what actually maps to OpenAI spend.
<inbound>
<base />
<!-- per-subscription token budget; counter keyed by the caller's sub key -->
<llm-token-limit
counter-key="@(context.Subscription.Id)"
tokens-per-minute="20000"
estimate-prompt-tokens="true"
remaining-tokens-header-name="x-tokens-remaining"
tokens-consumed-header-name="x-tokens-consumed" />
</inbound>
estimate-prompt-tokens="true" lets APIM enforce the limit before dispatching to the backend, returning a 429 from the gateway instead of burning model capacity. Each team’s product gets a different tokens-per-minute, which is your throttling-as-allocation lever: team A buys 20K TPM of platform budget, team B buys 60K.
Why this beats native quota: Azure OpenAI’s own TPM quota is per-deployment and blind to who is calling. APIM’s token limit is per-subscription, so one team can never starve another even though they share the same backend deployment.
Step 5 — Load-balancing across regions to dodge 429s
The point of multiple accounts (East US + Sweden Central, say) is resilience against regional throttling. APIM expresses this with a backend pool plus a retry policy. Define backends with circuit-breaker rules, group them into a pool, and let APIM round-robin / fail over.
# create individual backends (one per region/deployment) with a circuit breaker,
# then a load-balanced pool referencing them
az apim backend create -g "$RG" -n "$APIM" \
--backend-id aoai-eus \
--url "https://aoai-platform-eus.openai.azure.com/openai" \
--protocol http
az apim backend create -g "$RG" -n "$APIM" \
--backend-id aoai-swc \
--url "https://aoai-platform-swc.openai.azure.com/openai" \
--protocol http
Group them into a Pool-type backend (priority + weight are set on the pool members so PTU is tried first, Standard/secondary region is the fallback). Then the inbound policy targets the pool and the backend section retries on 429/5xx:
<inbound>
<base />
<set-backend-service backend-id="aoai-pool" />
</inbound>
<backend>
<retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)"
count="2" interval="1" first-fast-retry="true">
<forward-request buffer-request-body="true" />
</retry>
</backend>
When aoai-eus returns a 429, the circuit breaker trips that backend, the pool routes to aoai-swc, and the retry re-issues the request — invisibly to the caller. Honor the Retry-After header OpenAI returns rather than retrying instantly in a tight loop; interval="1" with a small count is a sane default that respects backend backpressure.
Pitfall: retries multiply load. Cap
countat 2-3 and keepfirst-fast-retryfor the transient case only. An aggressive retry storm across a backend pool turns one region’s throttling into a self-inflicted outage across all of them.
Step 6 — Guardrails: no keys, content safety, PII
Three controls turn a working gateway into a governed one:
- Managed identity, no keys (done in Step 3). As a belt-and-braces measure, disable local (key) auth on the account so even an account key cannot be used:
az resource update -g "$RG" -n "$ACCT" \
--resource-type "Microsoft.CognitiveServices/accounts" \
--set properties.disableLocalAuth=true
- Content safety. Azure OpenAI applies a default content filter; for an enterprise platform, attach a custom content filter / RAI policy to each deployment tuned to your risk tolerance (e.g. stricter on hate/self-harm, with annotate-and-block on jailbreak attempts). APIM can additionally enforce structural guardrails (max prompt size, blocked patterns) at the edge before a request ever reaches the model.
- PII and prompt logging. Decide deliberately what you log. Logging full prompts/completions is gold for debugging and abuse investigation but is a data-residency and privacy liability. A common stance: log metadata and token counts always, log content only to a restricted workspace with short retention, and run an APIM policy to redact obvious PII (emails, card numbers) from anything that lands in general logs.
Step 7 — Observability and FinOps
Spend is invisible until you measure tokens per consumer. APIM’s emit-token-metric (a.k.a. azure-openai-emit-token-metric) publishes prompt/completion/total tokens to Application Insights, dimensioned by whatever you choose — make one dimension the subscription/team so the same data drives the chargeback report.
<inbound>
<base />
<emit-token-metric namespace="openai">
<dimension name="Team" value="@(context.Subscription.Name)" />
<dimension name="Deployment" value="@(context.Request.MatchedParameters.GetValueOrDefault("deployment-id",""))" />
<dimension name="ApiId" value="@(context.Api.Id)" />
</emit-token-metric>
</inbound>
Pair that with diagnostic settings shipping APIM and Cognitive Services logs to Log Analytics, and a budget with alerts so Finance is paged before, not after, the overrun:
az monitor diagnostic-settings create \
--name diag-aoai \
--resource "$ACCT_ID" \
--workspace "$LAW_ID" \
--logs '[{"categoryGroup":"audit","enabled":true},{"categoryGroup":"allLogs","enabled":true}]' \
--metrics '[{"category":"AllMetrics","enabled":true}]'
az consumption budget create \
--budget-name bd-aoai-platform \
--amount 15000 \
--category Cost \
--time-grain Monthly \
--start-date 2026-06-01 \
--end-date 2026-12-31
A token-per-team dashboard (KQL over the emitted metric in App Insights) is the artifact that ends the “who is spending all the money” argument. Group by your Team dimension, sum total tokens, multiply by the published per-1K price for the model, and you have a defensible internal bill.
Enterprise scenario
A retail bank ran the exact pattern above: private accounts in East US 2 + Sweden Central behind one Premium APIM, a gpt-4o PTU deployment for the production fraud-summarization path, Standard as spillover. Two weeks after go-live, the production path started seeing intermittent 429s even though the PTU dashboard showed utilization under 60%. The retry policy then failed over every throttled request to the Standard backend, which quietly inflated the monthly bill by ~40%.
The gotcha: their set-backend-service targeted the pool, but the pool members had equal weight and no priority, so APIM round-robined PTU and Standard instead of treating Standard as fallback. Worse, the PTU “429s” were actually the deployment’s own dynamic spillover behavior interacting with a too-low tokens-per-minute on the llm-token-limit policy — the gateway was rejecting bursts the PTU could have absorbed.
Fix was two-part. Set explicit priority on the pool so PTU is exhausted before Standard is touched:
az apim backend create -g "$RG" -n "$APIM" \
--backend-id aoai-pool --type Pool \
--pool-services '[{"id":"/backends/aoai-ptu-eus2","priority":1,"weight":100},
{"id":"/backends/aoai-std-swc","priority":2,"weight":100}]'
Then they raised the per-team llm-token-limit to match real PTU throughput and added azure-openai-emit-token-metric split by Backend so spillover became visible. After that, Standard carried only genuine overflow, the 40% leak disappeared, and the fraud path held p95 latency under its SLA. Priority-based pools plus a backend dimension on token metrics are non-negotiable when PTU and PayGo share a gateway.
Verify
Walk the data path end to end:
# 1. From a peered VNet host, the account FQDN must resolve to a PRIVATE IP.
nslookup aoai-platform-eus.openai.azure.com
# -> 10.x.x.x (a public IP here means DNS is wrong)
# 2. Public access is refused from outside the VNet (run from your laptop):
curl -s -o /dev/null -w "%{http_code}\n" \
https://aoai-platform-eus.openai.azure.com/openai/deployments?api-version=2024-10-21
# -> 403 (Forbidden / blocked by network ACL)
# 3. A team calls THROUGH apim with only its subscription key (no AOAI key):
curl -s https://apim-aoai-platform.azure-api.net/openai/deployments/gpt-4o-prod/chat/completions?api-version=2024-10-21 \
-H "api-key: $TEAM_SUBSCRIPTION_KEY" \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"ping"}]}' -i | grep -i "x-tokens-remaining"
# -> x-tokens-remaining: <number> (proves token throttling + gateway path)
Then confirm telemetry: in App Insights, the customMetrics namespace openai should show token counts split by your Team dimension within a minute or two of the call above.
Platform readiness checklist
Pitfalls and next steps
The three mistakes that sink these platforms: double DNS records for the private FQDN (Step 1’s callout), retry storms amplifying a single region’s throttling into a global one (Step 5), and logging full prompts everywhere until a privacy review forces an emergency rollback (Step 6). Design around all three from day one.
From here, treat the gateway as the place to add capability without touching consumers: semantic caching in APIM to cut token spend on repeated prompts, prompt-shield / jailbreak detection policies, and an internal model catalog so teams choose a logical name (chat-default, embeddings) that the gateway maps to the current best deployment. The accounts are capacity; APIM is where your platform’s policy, economics, and safety actually live.