Azure AI/ML

Deploy Your First Azure OpenAI Model: Resource, Deployment, and Calling GPT-4o from REST and the SDK

You have a working OpenAI curl command and an API key, and now someone has said “but it has to run on Azure.” That sentence changes more than the hostname. On Azure OpenAI — Microsoft’s hosted offering of the OpenAI models, billed through your Azure subscription and governed by Azure identity, networking and policy — you do not call a model by its name. You call a deployment: a named instance of a specific model (say gpt-4o, version 2024-11-20) that you create inside your resource, with your quota and region. The endpoint is your resource’s hostname, the auth is a resource key or a Microsoft Entra ID token, and the URL embeds the deployment name, not the model. Get those four things right — resource, deployment, endpoint, auth — and the first 200 OK comes back in under fifteen minutes. Muddle them and you stare at DeploymentNotFound or 401, wondering why the payload that worked on api.openai.com fails here.

This guide takes you from an empty subscription to a working GPT-4o chat call three ways — a raw REST call with curl, the Python SDK, and the JavaScript SDK — with both the simple api-key header and the production-correct keyless Microsoft Entra ID token. We build the resource and deployment in all three of the portal, the az CLI, and Bicep. Every option that matters — deployment types, the tokens-per-minute (TPM) quota, model versions, the GA 2024-10-21 inference API, the RBAC roles — sits in a scannable table beside the commands, so you can debug the next person’s 404 too.

The mental shift to internalise up front: on Azure OpenAI the deployment name is the unit of everything. It is what goes in the URL, carries the quota, and what RBAC, the playground and your code all reference. The model is what you put into a deployment. Lose that distinction and nothing lines up; hold it and the service clicks into place.

What problem this solves

Teams reach for Azure OpenAI for reasons that have nothing to do with the model weights, which are identical to OpenAI’s. They want an Azure service’s data-handling posture (your prompts are not used to train models; residency is controllable), RBAC and SSO instead of a shared key, private networking so the endpoint never touches the public internet, Azure Policy and cost governance over who deploys what, and one consolidated bill. The capability is the same; the control plane is the reason to be here.

That control plane is what trips up a first deployment. On api.openai.com you authenticate with one bearer key, name the model in the body, and you are done. On Azure none of that holds: there is no “the API key” — keys belong to a resource you create first; the model name in the body is ignored in favour of a deployment name in the URL; the endpoint is your resource’s unique host, not a shared one; and the most common failure is not a bug but a quota of zero tokens-per-minute, or a key-based call against an org that has disabled keys. None of this is hard, but each is a place a newcomer loses an hour.

Who hits it: every developer porting a prototype from OpenAI, every platform team standing up a governed AI landing zone, and anyone whose security team said “no API keys in app settings” and now needs managed-identity auth. The fix is always the same — understand the four-part contract (resource, deployment, endpoint, auth), then express it in code or IaC.

To frame the field before we build, here is the four-part contract and where each piece comes from:

Piece What it is Where it comes from The mistake it causes
Resource An Microsoft.CognitiveServices account, kind OpenAI You create it (portal / az / Bicep) None yet — but no resource, no endpoint or keys
Deployment A named instance of one model + version + capacity You create it inside the resource Putting the model name (gpt-4o) in the URL → DeploymentNotFound
Endpoint Your resource’s hostname https://<name>.openai.azure.com Generated when the resource is created Reusing api.openai.com → connection/401 errors
Auth api-key header or Entra ID Authorization: Bearer <token> Resource keys, or an RBAC role assignment Key when keys are disabled, or a token without the role → 401/403

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You need an Azure subscription with permission to create resources in a resource group (Contributor on the RG is enough; we note where extra roles matter), the az CLI signed in or Cloud Shell (which has az, python and node preinstalled), and for the SDK steps Python 3.8+ or Node.js 18+. Comfort with HTTP, JSON and a terminal is assumed; no ML background is needed — this is an integration task.

One real prerequisite is access to Azure OpenAI itself. Most subscriptions now have it by default, but some (certain trial/sponsored ones) do not, and you discover this when resource creation fails — that gate, not your command, is usually why.

This sits at the start of the AI/ML on Azure track and underpins every later topic: once you can call a deployment, you add retrieval with Azure AI Search: Vector, Hybrid & Semantic Ranking for RAG, lock the endpoint down with Azure Private Link & Private DNS for PaaS, keep keys out of config with Azure Key Vault: Secrets, Keys & Certificates, and govern the estate as an Azure OpenAI Enterprise Landing Zone. A quick map of who owns what during a first rollout:

Concern Lives in Usually owned by What it blocks if wrong
Subscription access to Azure OpenAI Subscription / Microsoft Cloud platform team Resource creation outright
Resource + region choice Resource group You / app team Endpoint, model availability
TPM quota for the model Subscription, per region Platform / FinOps Deployment capacity (429 if zero)
RBAC role for keyless auth Resource IAM Security / platform Entra ID calls (403 without the role)
Private networking VNet / Private DNS Network team Reachability if keys/public access locked
Content filter / responsible AI Resource (Foundry) AI governance Whether prompts/responses are blocked

Core concepts

Five ideas make every command in this guide obvious.

The resource is a Cognitive Services account, not a special “OpenAI” object. An Azure OpenAI resource is Microsoft.CognitiveServices/accounts with kind: OpenAI and SKU S0 — which is why the CLI verb is az cognitiveservices account create, not az openai. The resource owns the endpoint (https://<name>.openai.azure.com), two keys, the managed-identity options, the network rules, and the content filters.

A deployment binds one model + one version + one capacity. A deployment is a Microsoft.CognitiveServices/accounts/deployments object where you pick the model (gpt-4o), version (2024-11-20), deployment type (the sku.name, e.g. GlobalStandard), and capacity (TPM in thousands), and give it a name — often the same as the model for clarity. That name is the deployment id in the URL. You can run several deployments of one model (a gpt-4o-prod and a gpt-4o-canary) with different quotas.

The URL embeds the deployment; the body does not name the model. A chat call is POST https://<name>.openai.azure.com/openai/deployments/<deployment-id>/chat/completions?api-version=2024-10-21. The model is implied by the deployment id in the path — unlike OpenAI’s API, a model field in the JSON body is not how routing happens. This one difference is behind most “worked on OpenAI, not Azure” confusion.

Auth is one of two modes, and orgs increasingly forbid the easy one. Either put a resource key in the api-key header (trivial, but a long-lived shared secret), or present a Microsoft Entra ID token in Authorization: Bearer <token> for scope https://cognitiveservices.azure.com/.default, where the caller — a user or managed identity — holds the Cognitive Services OpenAI User role. Keyless is production-correct, and many subscriptions disable key auth entirely, so know both.

Capacity is tokens-per-minute, and the default can be zero. A Standard deployment’s throughput is a TPM quota granted per subscription, per region, per model; you assign a deployment some of it as capacity (in thousands — capacity 30 ≈ 30,000 TPM). Requests-per-minute (RPM) is derived from TPM, not set separately. If the quota is exhausted or never granted, every call returns 429 Too Many Requests — the most common “why won’t it work” after the URL mistake.

The vocabulary in one table

Pin these down before the steps; the glossary repeats them for lookup.

Term One-line definition Where it lives Why it matters
Resource (account) Microsoft.CognitiveServices account, kind OpenAI, SKU S0 Resource group Owns endpoint, keys, identity, network
Endpoint https://<name>.openai.azure.com Generated with the resource The host every call targets
Model The weights (gpt-4o, gpt-4o-mini) Microsoft’s catalogue What a deployment serves
Model version Dated snapshot (2024-11-20) Chosen at deploy time Pins behaviour/features (e.g. Structured Outputs)
Deployment Named instance of model+version+capacity Inside the resource The id in the URL; carries quota
Deployment type (SKU) Standard / GlobalStandard / provisioned / batch sku.name on the deployment Scope, billing, latency profile
TPM capacity Tokens-per-minute allotment (in thousands) On the deployment Throughput; zero → 429
api-key Resource key in an HTTP header Resource → Keys and Endpoint Simple auth (a shared secret)
Entra ID token Bearer token for cognitiveservices.azure.com Acquired via a credential Keyless auth (needs RBAC role)
api-version Query param selecting the API contract The request URL GA 2024-10-21; preview dates change shapes

Resource, model, deployment: getting the relationship right

Why the deployment name, not the model name, is in the URL

On OpenAI’s API you write "model": "gpt-4o" in the body and the platform routes to its shared gpt-4o. Azure inverts this: you first deploy gpt-4o into your resource under a name you choose, then address that name in the path. The body’s model field is irrelevant to routing — Azure already knows the model from the deployment id. Name the deployment gpt-4o and the two match, hiding the distinction; name it chat-prod and it becomes vivid — the URL reads .../deployments/chat-prod/... yet still reaches a GPT-4o model.

This is why DeploymentNotFound is the signature first error: a developer copies an OpenAI snippet, puts gpt-4o in the path expecting it to mean the model, but never created a deployment named gpt-4o. The fix is never in the body — deploy the model and use the deployment’s exact, case-sensitive name in the URL.

Picking the model and version for a first chat app

For a first general-purpose chat or assistant, gpt-4o (multimodal, fast, strong) or the cheaper gpt-4o-mini are the right starting points — both take text and images and support JSON Mode and tool calling. Versions are dated snapshots; pin one explicitly rather than drift. The GPT-4o lineage you choose between:

Model Version Context (input) Max output Notable additions Pick it when
gpt-4o 2024-05-13 128,000 4,096 First GPT-4o; text+image, JSON Mode, parallel tools You need the original 4o behaviour
gpt-4o 2024-08-06 128,000 16,384 Adds Structured Outputs; larger output You want schema-guaranteed JSON
gpt-4o 2024-11-20 128,000 16,384 Latest 4o; better writing/accuracy Default for new chat apps
gpt-4o-mini 2024-07-18 128,000 16,384 Cheap, fast; text+image, JSON Mode, tools High volume / cost-sensitive

A note on currency: Microsoft ships newer flagship families over time, and your resource’s model catalogue (az cognitiveservices account list-models or the portal) is the source of truth for what your region can deploy today. The mechanics here — deploy a name, call the path — are identical whichever chat model you pick; GPT-4o is simply the broadly available, well-documented choice to learn on.

Choosing a deployment type

The deployment type (the sku.name on the deployment) decides where data is processed, how you pay, and your latency profile. For learning and most pay-as-you-go workloads, GlobalStandard is the default: highest quota, broadest availability, pay-per-token. Use data-zone or regional types only when compliance pins processing to a geography, and provisioned/batch only at scale. Every type, side by side:

Deployment type sku.name Data processed Billing Use it for
Global Standard GlobalStandard Any Azure region Pay-per-token General workloads; highest quota (default)
Standard (regional) Standard The deployment region Pay-per-token Single-region data residency, low volume
Data Zone Standard DataZoneStandard Within US or EU data zone Pay-per-token EU/US zone compliance, higher quota than regional
Global Provisioned GlobalProvisionedManaged Any Azure region Reserved PTU Predictable high throughput, low latency variance
Regional Provisioned ProvisionedManaged The deployment region Reserved PTU Region-pinned + guaranteed throughput
Data Zone Provisioned DataZoneProvisionedManaged US or EU data zone Reserved PTU Zone compliance + guaranteed throughput
Global Batch GlobalBatch Any Azure region ~50% off, 24-hr async Large offline jobs (no real-time SLA)
Data Zone Batch DataZoneBatch US or EU data zone ~50% off, 24-hr async Large offline jobs with zone compliance

The decision rule as a table — match your constraint to the type:

If your constraint is… Choose Why
“Just get me running, cheapest to start” GlobalStandard Highest quota, pay-per-token, broadest models
“Data must stay in the EU (or US)” DataZoneStandard (EU/US region) Processing pinned to the data zone
“Single Azure region, regional residency” Standard Processed in the deployment’s region only
“Steady high volume, predictable latency” GlobalProvisionedManaged Reserved PTUs guarantee throughput
“Millions of rows overnight, cost-sensitive” GlobalBatch ~50% cheaper, async 24-hr turnaround

The endpoint and authentication contract

The two auth modes, in detail

Once the deployment exists, a call needs the endpoint, the deployment id, an api-version, and auth — the part with two faces. The trade-offs:

Aspect api-key header Microsoft Entra ID (keyless)
What you send api-key: <resource key> Authorization: Bearer <token>
Secret to manage A long-lived shared key None — token minted on demand, short-lived
Who can call Anyone holding the key A principal with the right RBAC role
Token scope n/a https://cognitiveservices.azure.com/.default
Required role n/a Cognitive Services OpenAI User (to infer)
Rotation Manual (two keys to rotate) Automatic (tokens expire ~1 hr)
Works if local auth disabled No Yes
Best for Quick local tests Production, CI, managed identities

Learn the keyless path properly — it is what production uses and what disableLocalAuth: true forces. The shape is always: acquire a token for the Cognitive Services scope via a credential (your az login identity locally; a managed identity in Azure), and the SDK attaches it as a bearer token. The caller must hold an RBAC role granting the inference data-action.

The RBAC roles that matter

Azure OpenAI has a small set of built-in roles. The two you use constantly are Cognitive Services OpenAI User (call the model and playground; cannot see keys or create deployments) and Cognitive Services Contributor (create the resource and read keys; but, crucially, cannot infer with Entra ID). That asymmetry surprises people — the role that builds the resource is not the role that calls it keyless. The map:

Role Call inference (Entra ID) View/regenerate keys Create/edit deployments Create the resource View quota
Cognitive Services OpenAI User
Cognitive Services OpenAI Contributor
Cognitive Services Contributor ✅ (via API/Foundry)
Cognitive Services Usages Reader ✅ (subscription scope)

So: an app’s managed identity that calls GPT-4o needs OpenAI User (not Contributor — it cannot infer); a platform engineer building resources needs Cognitive Services Contributor plus OpenAI Contributor to create deployments; and viewing TPM quota needs Usages Reader assigned at subscription scope, not the resource.

API versions: GA vs preview, and the new /openai/v1/ path

The api-version query parameter selects the contract. For stable chat completions, use the GA 2024-10-21. Preview versions (dated 2025-…-preview) unlock features earlier but can change shapes between releases — fine to experiment with, risky to pin in production. Microsoft also offers a next-generation /openai/v1/ path that mirrors OpenAI’s API style and reduces constant api-version bumps; know it exists, but the 2024-10-21 deployment-path call here is the dependable baseline. The versions you will meet:

api-version Status Use it for
2024-10-21 GA (data plane) Stable chat completions — the baseline this guide uses
2025-…-preview Preview (data plane) Newest features early; expect shape changes
2025-06-01 GA (control plane) Resource/deployment management (ARM), not inference
/openai/v1/ (GA) Next-gen data-plane path OpenAI-style surface; fewer version bumps

Architecture at a glance

Read the diagram left to right as the life of one chat request. On the left, a caller — your laptop running curl or an app running the SDK — holds one of two credentials: a resource key for the api-key header, or the production path, a short-lived Microsoft Entra ID token minted from its managed identity. The request hits your Azure OpenAI resource at https://<name>.openai.azure.com on the path /openai/deployments/<id>/chat/completions?api-version=2024-10-21. The resource is the control point: it validates the credential (key, or token plus Cognitive Services OpenAI User role), checks the content filter, and looks up the deployment named in the path. That deployment — the box that matters — is bound to a model (gpt-4o), a version, and a slice of TPM quota. Only after auth, filter and quota pass does the model run and stream tokens back, with a usage block tallying what you are billed.

The key thing the picture teaches: the deployment sits inside the resource and is what the URL addresses — the model hangs off the deployment, not the reverse. The numbered badges show where a first call dies: a wrong path (1) never finds the deployment; a bad/disabled key or a token missing the role (2) fails auth at the resource boundary; zero TPM (3) throttles every call; and a blocked prompt (4) is stopped by the content filter before the model sees it. Same path, four failure points — and the error string tells you which.

Left-to-right Azure OpenAI request architecture: a caller zone holds two credential paths — a resource api-key and a keyless Microsoft Entra ID token minted from a managed identity — flowing into an Azure OpenAI resource at https://name.openai.azure.com that authenticates the request, applies the content filter, and routes by deployment id on the path /openai/deployments/id/chat/completions to a GPT-4o deployment bound to a model version and a tokens-per-minute quota, which runs the model and returns choices plus a usage token tally; numbered badges mark the four first-call failure points — wrong URL path causing DeploymentNotFound, bad or disabled key or token missing the Cognitive Services OpenAI User role causing 401 or 403, zero TPM quota causing 429, and a prompt blocked by the content filter

Real-world scenario

Saral Health, a 60-person telemedicine startup in Bengaluru, runs a patient-triage assistant that summarises symptom intake and drafts a clinician note. The prototype used OpenAI’s public API with a key pasted into the app’s environment. Signing an enterprise hospital customer triggered a security review with two hard requirements: no third-party data egress (patient text stays in Azure under their tenant) and no static API keys in any config. Two engineers had a week.

Day one went sideways. They created the resource in Central India, copied their OpenAI curl, swapped the hostname — and got DeploymentNotFound. An hour later they realised they had never deployed a model, assuming gpt-4o in the body would route as it did on OpenAI. They created a deployment named gpt-4o, and the api-key call worked — but they had just hardcoded a key, the exact thing forbidden.

Day two hit the quota wall. Moved to a containerised App Service API, a load test returned 429 on every third call. The deployment had been created with capacity 1 (≈1,000 TPM), the default offered, and long intake transcripts blew through it instantly. Raising capacity to 30 against their GlobalStandard quota cleared it.

The real work was keyless auth. They enabled a system-assigned managed identity on the App Service, set disableLocalAuth: true to satisfy the “no keys” rule, and found that their instinct — granting the app Cognitive Services Contributor — produced 403 on every call. The fix was the roles-table asymmetry: Contributor builds resources but cannot infer with Entra ID; the app needed Cognitive Services OpenAI User. After assigning that role and switching the SDK to DefaultAzureCredential, the app called GPT-4o with zero secrets in config.

The week ended clean: resource in Central India, one gpt-4o (2024-11-20) GlobalStandard deployment at 30K TPM, local auth disabled, the App Service identity holding Cognitive Services OpenAI User, and — week two — a Private Endpoint removing public network access entirely. Spend was usage-driven, roughly ₹14,000, dominated by output tokens on the summaries. The runbook lesson: “You deploy a name and call the name. Keys are a crutch; the role that builds the resource is not the role that calls it; quota is a number you set, and its default is too small.” All four day-one mistakes map to a badge in the diagram above.

Advantages and disadvantages

The Azure-hosted model — same weights, Azure control plane — is the right call for governed, compliance-bound, identity-centric workloads, and overhead you would skip for a weekend hack. Weigh it honestly:

Advantages Disadvantages
Enterprise data handling: your prompts/completions are not used to train models; residency is controllable via deployment type More moving parts than OpenAI’s single key + model name — a steeper first deployment
Keyless auth via Entra ID + managed identity — no secrets in config, automatic rotation The deployment-vs-model distinction trips newcomers (DeploymentNotFound)
RBAC and Azure Policy govern who can deploy and call what Two different roles for building vs calling the resource — easy to mis-assign
Private networking (Private Endpoint) keeps the model endpoint off the public internet Quota (TPM) is per-subscription-per-region and can default low/zero → 429
One Azure bill; cost lands in Cost Management with your other resources Model/version availability varies by region; the newest models land on Azure slightly later
Regional + data-zone options for sovereignty needs Provisioned throughput (PTUs) for scale adds capacity-planning complexity

When each side matters: the advantages dominate for anything customer-facing, regulated, or running inside an enterprise tenant — which is most reasons you are on Azure at all. The disadvantages are almost entirely first-time friction (the four day-one mistakes in the scenario) plus the genuine, ongoing need to manage quota as you scale. None are blockers; they are the things this guide exists to pre-empt.

Hands-on lab

This is the centerpiece: from an empty subscription to a streaming GPT-4o chat call, validated at every step, then torn down. Do it once in the portal and once with the az CLI (they are alternatives — pick either to create the resource, both shown), add the Bicep version for repeatability, then call the deployment from curl, Python and JavaScript, with both auth modes. It is pay-as-you-go cheap: a handful of test calls cost a few rupees, and teardown removes everything. Run it in Cloud Shell (Bash) unless noted.

Part A — Create the resource and deployment in the Azure portal

Step A1 — Open Azure OpenAI. In the Azure portal, search Azure OpenAI and select + Create. Expected: the create blade with Basics.

Step A2 — Fill Basics. Choose your subscription, a resource group (create rg-openai-lab), a region (e.g. Central India or East US — regions differ in model availability), a globally meaningful Name (e.g. oai-lab-<yourinitials>), and Pricing tier Standard S0. Expected: validation passes; if the region greys out the model later, switch regions.

Step A3 — Network and finish. On Network, leave All networks for the lab (you would pick a Private Endpoint in production). Skip tags, Review + create, then Create. Expected: deployment completes in 1–2 minutes; Go to resource.

Step A4 — Open Foundry and deploy a model. On the resource, click Go to Azure AI Foundry portal (or Explore/Model deployments → Manage Deployments). In Foundry, go to Deployments → + Deploy model → Deploy base model, pick gpt-4o, and select Confirm. Expected: the deploy dialog showing model, version, and deployment-type fields.

Step A5 — Name the deployment and set capacity. Set Deployment name to gpt-4o (this becomes the URL id), Model version to 2024-11-20, Deployment type to Global Standard, and Tokens per Minute Rate Limit to a small value like 30K. Click Deploy. Expected: the deployment appears with state Succeeded; note the Target URI and that the deployment name is gpt-4o.

Step A6 — Grab the endpoint and key. Back on the resource, open Keys and Endpoint. Expected: an Endpoint like https://oai-lab-xxx.openai.azure.com/ and KEY 1 / KEY 2. Copy the endpoint and KEY 1 for Part C.

Step A7 — Smoke-test in the playground. In Foundry, open Chat playground, confirm your gpt-4o deployment is selected, type “Say hello in one short sentence,” and Send. Expected: a one-line reply. This proves the deployment works before any code.

What you just built, mapped to the four-part contract:

Step Portal action Contract piece it created Validates
A2–A3 Create resource (S0, region) Resource + endpoint Resource exists, endpoint minted
A4–A5 Deploy gpt-4o named gpt-4o Deployment (model+version+TPM) The URL id and quota
A6 Read Keys and Endpoint Auth (key) + endpoint host Credentials for the call
A7 Playground chat End-to-end path Model responds at all

Part B — Same thing with the az CLI (and Bicep)

This is the repeatable path. It assumes az login is done and the cognitiveservices commands are available (they ship with the CLI).

Step B1 — Variables and resource group.

RG=rg-openai-lab
LOC=eastus                      # a region with gpt-4o availability
ACCT=oai-lab-$RANDOM            # globally-unique resource name
DEP=gpt-4o                      # the deployment id you will call
az group create -n $RG -l $LOC -o table

Step B2 — Create the Azure OpenAI resource. It is a Cognitive Services account, kind OpenAI, SKU S0:

az cognitiveservices account create \
  --name $ACCT --resource-group $RG --location $LOC \
  --kind OpenAI --sku S0 \
  --custom-domain $ACCT \
  --yes -o table

Expected: a JSON/table row with provisioningState: Succeeded. The --custom-domain makes the endpoint https://$ACCT.openai.azure.com.

Step B3 — Confirm the endpoint and that gpt-4o is available here.

az cognitiveservices account show -n $ACCT -g $RG \
  --query "properties.endpoint" -o tsv

# List deployable models in this region; confirm gpt-4o is present
az cognitiveservices account list-models -n $ACCT -g $RG \
  --query "[?contains(name,'gpt-4o')].{model:name, version:version, format:format}" -o table

Expected: the endpoint URL, and a row for gpt-4o. If gpt-4o is absent, your region lacks it — recreate in another region (e.g. swedencentral).

Step B4 — Create the GPT-4o deployment. Bind model + version + type + capacity:

az cognitiveservices account deployment create \
  --name $ACCT --resource-group $RG \
  --deployment-name $DEP \
  --model-name gpt-4o --model-version "2024-11-20" --model-format OpenAI \
  --sku-name GlobalStandard --sku-capacity 30 \
  -o table

Expected: a deployment row, provisioningState: Succeeded, sku.name: GlobalStandard, sku.capacity: 30 (≈30,000 TPM). If you get a quota error, lower --sku-capacity, switch region, or request a quota increase.

Step B5 — Verify the deployment.

az cognitiveservices account deployment list -n $ACCT -g $RG \
  --query "[].{name:name, model:properties.model.name, version:properties.model.version, sku:sku.name, tpm:sku.capacity}" -o table

Expected: one row, name: gpt-4o. That name is your URL id.

Step B6 — The Bicep version (idempotent, review-friendly). Save as openai.bicep:

@description('Azure OpenAI account name (also the endpoint subdomain)')
param accountName string
param location string = resourceGroup().location
@description('Disable api-key auth; require Entra ID (set true for production)')
param disableLocalAuth bool = false

resource account 'Microsoft.CognitiveServices/accounts@2024-10-01' = {
  name: accountName
  location: location
  kind: 'OpenAI'
  sku: { name: 'S0' }
  identity: { type: 'SystemAssigned' }          // for the resource's own MI if needed
  properties: {
    customSubDomainName: accountName             // makes <name>.openai.azure.com
    disableLocalAuth: disableLocalAuth           // true → keys off, Entra ID only
    publicNetworkAccess: 'Enabled'               // 'Disabled' + Private Endpoint in prod
  }
}

resource gpt4o 'Microsoft.CognitiveServices/accounts/deployments@2024-10-01' = {
  parent: account
  name: 'gpt-4o'                                 // the deployment id used in the URL
  sku: { name: 'GlobalStandard', capacity: 30 }  // ≈30,000 TPM
  properties: {
    model: { format: 'OpenAI', name: 'gpt-4o', version: '2024-11-20' }
    versionUpgradeOption: 'OnceNewDefaultVersionAvailable'
  }
}

output endpoint string = account.properties.endpoint
output deploymentName string = gpt4o.name

Deploy and capture the outputs:

az deployment group create -g $RG \
  --template-file openai.bicep \
  --parameters accountName=$ACCT \
  --query "properties.outputs" -o json

Expected: endpoint and deploymentName outputs. Re-running is a no-op if nothing changed — that idempotency is the point of IaC. Set disableLocalAuth=true to do the whole lab keyless from the start.

Part C — Call it with curl (api-key)

Step C1 — Export endpoint, key, deployment.

ENDPOINT=$(az cognitiveservices account show -n $ACCT -g $RG --query "properties.endpoint" -o tsv)
API_KEY=$(az cognitiveservices account keys list -n $ACCT -g $RG --query "key1" -o tsv)
DEP=gpt-4o
API_VERSION=2024-10-21

Step C2 — Make the chat call. The deployment id is in the path; the key is in the api-key header:

curl -sS "${ENDPOINT}openai/deployments/${DEP}/chat/completions?api-version=${API_VERSION}" \
  -H "Content-Type: application/json" \
  -H "api-key: ${API_KEY}" \
  -d '{
        "messages": [
          {"role": "system", "content": "You are a terse assistant."},
          {"role": "user", "content": "Name three Azure regions in India."}
        ],
        "max_tokens": 100,
        "temperature": 0.2
      }'

Expected: a JSON body with choices[0].message.content listing regions (Central India, South India, West India), a finish_reason of stop, and a usage object with prompt_tokens, completion_tokens, total_tokens. That usage block is your bill in miniature.

Step C3 — Read the response shape. The fields you care about:

Field Meaning Watch for
choices[0].message.content The model’s reply text Empty + content_filter → blocked prompt
choices[0].finish_reason Why it stopped length = hit max_tokens (raise it)
usage.prompt_tokens Tokens you sent Long context = higher cost
usage.completion_tokens Tokens generated The pricier half on most models
usage.total_tokens Sum (billed) Multiply by per-token price for cost
model The serving model+version Confirms which version answered

Part D — Call it from Python (api-key, then keyless)

Step D1 — Install the SDK.

pip install openai azure-identity

Step D2 — api-key version. The openai package ships an AzureOpenAI client. Save chat_key.py:

import os
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],   # https://<name>.openai.azure.com/
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-10-21",
)

resp = client.chat.completions.create(
    model="gpt-4o",                                        # the DEPLOYMENT name, not the model
    messages=[
        {"role": "system", "content": "You are a terse assistant."},
        {"role": "user", "content": "Explain a deployment in Azure OpenAI in one sentence."},
    ],
    max_tokens=120,
    temperature=0.2,
)
print(resp.choices[0].message.content)
print("tokens:", resp.usage.total_tokens)

Run it:

export AZURE_OPENAI_ENDPOINT=$ENDPOINT
export AZURE_OPENAI_KEY=$API_KEY
python chat_key.py

Expected: one sentence printed, then a token count. Note the SDK quirk: the model= argument is the deployment name — the Azure client maps it onto the URL path for you.

Step D3 — Keyless (Entra ID) version — the production path. First, grant your own signed-in user the inference role so DefaultAzureCredential works locally:

ACCT_ID=$(az cognitiveservices account show -n $ACCT -g $RG --query id -o tsv)
ME=$(az ad signed-in-user show --query id -o tsv)
az role assignment create --assignee $ME \
  --role "Cognitive Services OpenAI User" \
  --scope $ACCT_ID

Expected: a role-assignment JSON. (Propagation can take a minute.) Now chat_keyless.py:

import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Mint Entra ID tokens for the Cognitive Services scope; no key anywhere.
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default",
)

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_ad_token_provider=token_provider,                # ← instead of api_key
    api_version="2024-10-21",
)

resp = client.chat.completions.create(
    model="gpt-4o",                                        # deployment name
    messages=[{"role": "user", "content": "Say 'keyless works' and nothing else."}],
    max_tokens=20,
)
print(resp.choices[0].message.content)

Run it (note: no key exported):

export AZURE_OPENAI_ENDPOINT=$ENDPOINT
python chat_keyless.py

Expected: keyless works. If you get 403, the role has not propagated yet or you assigned the wrong role (Contributor cannot infer — assign OpenAI User). In Azure, swap DefaultAzureCredential for a managed identity and the same code runs with zero secrets.

Part E — Call it from JavaScript

Step E1 — Install.

npm install openai @azure/identity

Step E2 — Keyless chat.mjs (the recommended path; api-key shown in a comment):

import { AzureOpenAI } from "openai";
import { DefaultAzureCredential, getBearerTokenProvider } from "@azure/identity";

const scope = "https://cognitiveservices.azure.com/.default";
const azureADTokenProvider = getBearerTokenProvider(new DefaultAzureCredential(), scope);

const client = new AzureOpenAI({
  endpoint: process.env.AZURE_OPENAI_ENDPOINT,   // https://<name>.openai.azure.com/
  azureADTokenProvider,                          // keyless; or: apiKey: process.env.AZURE_OPENAI_KEY
  apiVersion: "2024-10-21",
  deployment: "gpt-4o",                           // the deployment id
});

const resp = await client.chat.completions.create({
  messages: [{ role: "user", content: "Reply with a single word: ready." }],
  max_tokens: 10,
});
console.log(resp.choices[0].message.content);
console.log("tokens:", resp.usage.total_tokens);

Run it:

export AZURE_OPENAI_ENDPOINT=$ENDPOINT
node chat.mjs

Expected: ready and a token count.

Part F — Turn on streaming

For chat UX you want tokens as they arrive. In Python, add stream=True and iterate (chat_stream.py):

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "List 5 uses for GPT-4o, one per line."}],
    max_tokens=200,
    stream=True,
)
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Expected: text printing incrementally rather than all at once. Streaming chunks carry delta.content instead of a full message; the final chunk has the finish_reason.

Validation checklist

You proved the whole contract end to end. The lab steps and what each one demonstrates:

Step What you did What it proves
A4–A5 / B4 Deploy gpt-4o named gpt-4o The deployment is the URL id; quota is a number you set
A7 Playground chat Path works before any code
C2 curl with api-key The raw HTTP contract (path + header)
D2 / E2 SDK call, model = deployment SDK maps deployment → path for you
D3 Keyless with DefaultAzureCredential + OpenAI User Production auth: zero secrets, role-gated
B6 Bicep deploy Repeatable, idempotent IaC
F stream=True Token-by-token UX

Teardown

Delete the resource group to stop all charges and remove the deployment, resource and role assignment scoped to it:

az group delete -n $RG --yes --no-wait

Expected: the command returns immediately; deletion completes in the background. Cost note: pay-as-you-go means you were billed only per token — a dozen tiny test calls are a few rupees. There is no idle/hourly charge for a Standard deployment, but deleting cleans up and prevents accidental future usage.

Common mistakes & troubleshooting

The first-deployment failure modes, as a scannable table, then the detail for the ones that bite hardest. Each is symptom → root cause → confirm → fix.

# Symptom Root cause Confirm Fix
1 404 DeploymentNotFound Model name in URL instead of the deployment name (or wrong case) az ...deployment list shows the real name Use the exact deployment id in the path
2 401 Unauthorized (Access denied…api key) Wrong/rotated key, or wrong resource’s endpoint az ...keys list; check endpoint matches resource Use the matching key+endpoint, or switch to Entra ID
3 401/403 on a keyless call Token missing the role, or local auth disabled and you sent a key az role assignment list --assignee <id> Assign Cognitive Services OpenAI User to the caller
4 403 though you’re an Owner-ish role You hold Contributor, which cannot infer with Entra ID Check the role; it lacks the inference DataAction Add Cognitive Services OpenAI User explicitly
5 429 Too Many Requests from the first call Deployment TPM capacity too low / quota exhausted Deployment sku.capacity; quota blade Raise capacity, switch region, or request quota
6 400 content filter / empty content Prompt or completion hit the content filter Response finish_reason: content_filter Rephrase; adjust filter (with approval)
7 model field “ignored”, wrong model answers Body model doesn’t route on Azure; the deployment does Which deployment id is in the URL Point the URL at the intended deployment
8 SDK error: unknown api_version shape Mismatched/old preview api-version for the feature The api_version string Use GA 2024-10-21 (or the right preview)
9 Could not create resource at step 1 Subscription lacks Azure OpenAI access, or region full Portal error; try another region Use an enabled subscription/region
10 finish_reason: length, reply truncated max_tokens too small for the answer The finish_reason value Raise max_tokens (≤ model’s max output)
11 DefaultAzureCredential fails locally Not signed in / no credential in the chain az account show az login; or set a service-principal env credential
12 Works locally, 403 from App Service Managed identity lacks the role (your user had it) Identity’s role assignments on the resource Grant the identity OpenAI User, not just your user

1. 404 DeploymentNotFound — you put the model name in the path expecting OpenAI-style routing, but no deployment by that exact, case-sensitive name exists (or you named it chat-prod and used gpt-4o). Confirm with az cognitiveservices account deployment list -n $ACCT -g $RG -o table; fix by using the deployment name in the path. The body’s model field does not route on Azure — the URL does.

3 & 4. Keyless 401/403 and the Contributor trap — Entra ID auth needs a role with the inference DataAction. OpenAI User has it; Cognitive Services Contributor does not (it builds the resource and reads keys but cannot infer with a token) — the classic mis-assignment. Confirm with az role assignment list --assignee <principalId> --scope <accountId>; fix with az role assignment create --assignee <id> --role "Cognitive Services OpenAI User" --scope <accountId> and wait for propagation.

5. 429 on the very first request — the deployment’s TPM capacity is too small (a default of 1 ≈ 1,000 TPM is easy to exhaust with a long prompt), or the subscription’s quota for that model/region is used up. Check sku.capacity and the Quotas blade (used vs available TPM); fix by raising --sku-capacity, picking a free-quota region, switching to GlobalStandard, or requesting an increase — and add backoff retry regardless.

6. Content filter blocks a prompt or completion — the content filters (hate, sexual, violence, self-harm) flag inputs/outputs, returning a policy error or an empty completion with finish_reason: content_filter. Confirm via that finish_reason or the named category; fix by rephrasing, or for genuine false-positives request a tuned filter through approval — never disable safety for one prompt.

12. Works on your laptop, 403 in Azure — locally DefaultAzureCredential used your user (which has OpenAI User); in Azure it uses the app’s managed identity, which you never granted the role. Confirm the identity’s role assignments on the resource; fix by assigning Cognitive Services OpenAI User to the managed identity, not just your account.

Best practices

Security notes

Cost & sizing

A Standard/Global Standard deployment is purely usage-based — no hourly charge for existing. You pay per 1,000 tokens, separately for input (prompt) and output (completion, materially pricier); gpt-4o-mini is several times cheaper than gpt-4o. The levers are which model, tokens per call (prompt length + max_tokens), and call volume. Provisioned (PTU) flips this to a fixed reserved cost — worth it only at steady high volume; Batch is ~50% cheaper for async jobs tolerating a 24-hour turnaround. The cost drivers:

Cost driver What you pay for How to control it Watch-out
Input (prompt) tokens Tokens you send, per 1K Trim context; summarise history; cache Long RAG context inflates every call
Output (completion) tokens Tokens generated, per 1K (pricier) Cap max_tokens; ask for concise output finish_reason: length = you capped too low or paid the cap
Model choice 4o vs 4o-mini rate Use gpt-4o-mini where quality allows Over-using the flagship for trivial tasks
Deployment type Pay-per-token vs PTU vs batch Standard to start; PTU only at scale; batch for bulk PTUs are a fixed monthly commitment
Call volume Number of calls × tokens Cache, dedupe, batch A retry storm on 429 multiplies cost

For sizing: there is no free tier, but the floor is effectively zero — an idle Standard deployment costs nothing, so a learning resource left deployed (without calls) does not bill. A light internal assistant might run ₹3,000–15,000/month depending on traffic and whether it is gpt-4o or gpt-4o-mini; Saral Health’s clinical-summary workload landed near ₹14,000 (output-token-heavy). Right-sizing is mostly prompt hygiene (shorter context, capped output, the smaller model where it suffices) before it is anything architectural. Set a Cost Management budget + alert on the resource so a runaway loop surfaces fast.

Interview & exam questions

1. What goes in the request URL — the model name or the deployment name, and why? The deployment name. The path /openai/deployments/<name>/chat/completions routes by that name; the model is implied by the deployment. A model field in the body does not select the model as it does on OpenAI’s API — which is why DeploymentNotFound is the classic first error.

2. Difference between a resource, a model, and a deployment? The resource is a Microsoft.CognitiveServices account (kind OpenAI, SKU S0) owning the endpoint, keys and network. A model (gpt-4o) is the weights in Microsoft’s catalogue. A deployment is a named instance binding one model + version + capacity inside the resource — the unit you call, carrying the TPM quota.

3. Name the two authentication modes and when to use each. The api-key header (a long-lived shared secret, fine for quick tests) and Microsoft Entra ID bearer tokens (keyless, short-lived, role-gated — production). Keyless is mandatory when disableLocalAuth: true, and needs the Cognitive Services OpenAI User role and a token for scope https://cognitiveservices.azure.com/.default.

4. A Contributor on the resource gets 403 calling the model with a token. Why? Cognitive Services Contributor can build the resource and read keys but cannot infer with Entra ID — it lacks the inference DataAction. Assign Cognitive Services OpenAI User (or OpenAI Contributor) to the caller. The role that builds the resource is not the role that calls it.

5. What unit is quota measured in, and what if it is too low? Tokens-per-minute (TPM), granted per subscription/region/model; a deployment gets some as capacity (in thousands), and RPM derives from it. Too low or exhausted → 429 Too Many Requests (the most common failure after the URL mistake). Raise capacity, change region, or request an increase.

6. Sensible default deployment type for a new pay-as-you-go chat app, and why? GlobalStandard — pay-per-token, highest default quota, broadest model availability, data processed in any Azure region. Choose DataZoneStandard/Standard only when compliance pins a geography, and provisioned/batch only at scale.

7. What is the GA data-plane inference API version, and where does /openai/v1/ fit? 2024-10-21 is the stable data-plane version for chat completions. The next-generation /openai/v1/ path mirrors OpenAI’s API style and reduces frequent api-version bumps. Preview api-version values unlock features earlier but can change shapes.

8. In the SDKs, what do you pass as the model argument? The deployment name, not the model id — the AzureOpenAI client maps it onto the deployment path (the JS client also takes deployment). A frequent point of confusion when porting OpenAI code.

9. How do you call Azure OpenAI with no secrets at all? Enable a managed identity, grant it Cognitive Services OpenAI User on the resource, and use DefaultAzureCredential with a bearer-token provider for the Cognitive Services scope — the SDK mints and attaches Entra ID tokens automatically. Combine with disableLocalAuth: true.

10. What does the usage object tell you, and why care? It reports prompt_tokens, completion_tokens and total_tokens — the exact basis of your bill (output tokens cost more). Logging it per call tracks and forecasts spend; a finish_reason of length warns the answer was truncated by max_tokens.

11. You ported a working OpenAI curl and got DeploymentNotFound. Walk through the fix. The snippet named the model in the body and hit a shared host. On Azure: (a) target your endpoint https://<name>.openai.azure.com, (b) create a deployment of gpt-4o, and © put that deployment’s exact name in the path with ?api-version=2024-10-21. The body’s model field is not how routing works.

12. Standard vs Provisioned vs Batch — billing, one line each. Standard/GlobalStandard: pay-per-token, best-effort latency, bursty workloads. Provisioned (PTU): reserved capacity at fixed cost, guaranteed throughput, steady high volume. Batch: ~50% cheaper for async jobs with a 24-hour turnaround and no real-time SLA.

These map most directly to AI-102 (Azure AI Engineer Associate)plan and manage an Azure AI solution; implement generative AI solutions with Azure OpenAI — and the fundamentals appear in AI-900. The identity and networking angles (managed identity, RBAC, Private Endpoint) touch AZ-204 and AZ-500. A compact cert map:

Question theme Primary cert Objective area
Resource/model/deployment, REST/SDK calls AI-102 Implement Azure OpenAI solutions
Deployment types, TPM quota, versions AI-102 Provision & manage Azure OpenAI
Keyless auth, managed identity, RBAC AI-102 / AZ-500 Secure AI services; manage identity
Generative-AI fundamentals on Azure AI-900 Generative AI workloads
Private networking for the endpoint AZ-500 / AZ-700 Secure & connect Azure services

Quick check

  1. On Azure OpenAI, you call POST .../openai/deployments/<X>/chat/completions. Is <X> the model name or the deployment name?
  2. Your keyless call returns 403, yet your account has Contributor on the resource. What role do you actually need?
  3. The first request to a brand-new deployment returns 429. What is the most likely cause and one fix?
  4. In the Python/JS SDK, what value do you pass as the model argument?
  5. Name the two authentication modes, and which one still works when disableLocalAuth is true.

Answers

  1. The deployment name — the one you chose when you deployed the model. The model is implied by the deployment; a model field in the body does not route on Azure.
  2. Cognitive Services OpenAI User (or OpenAI Contributor). Plain Contributor can build the resource and read keys but cannot infer with Entra ID — it lacks the inference DataAction.
  3. The deployment’s TPM capacity is too low / quota exhausted. Fix by raising --sku-capacity within available quota (or switching to a region/GlobalStandard with free quota, or requesting a quota increase), and add backoff retry.
  4. The deployment name, not the model id. The Azure client maps it onto the URL path (and the JS client can also take deployment on the client).
  5. The api-key header and Microsoft Entra ID bearer tokens. Only Entra ID works when local (key) auth is disabled.

Glossary

Next steps

You can now stand up Azure OpenAI and call a GPT-4o deployment three ways with both auth modes. Build outward:

AzureAzure OpenAIGPT-4oAI/MLREST APIPython SDKManaged IdentityCognitive Services
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading