Build Your First Data Factory Pipeline: Copy Data from Blob Storage into Azure SQL

A CSV lands in Blob Storage every night — an export from a partner, a dump from an old system, a file someone drops in a folder — and somebody needs it inside a database table so a report can run against it. The hand-rolled answer is a script on a VM with a cron job, a connection string in a config file, and a pager when it silently stops. The managed answer is Azure Data Factory (ADF) — Azure’s cloud ETL/ELT (extract-transform-load) and data-integration service — and its single most important building block, the Copy activity, which moves data with no servers for you to run. This article builds exactly that: a pipeline that copies a CSV from Azure Blob Storage into a table in Azure SQL Database, the “hello world” of data engineering on Azure.

We build it three ways, so you understand what each piece is, not just where to click. First in the Azure portal (the ADF Studio visual canvas), because seeing the boxes connect makes the model concrete. Then with the az CLI to script it. Then as Bicep so the whole thing — factory, linked services, datasets, pipeline — is infrastructure-as-code you commit and redeploy. Along the way you meet the four nouns every ADF pipeline is made of: a linked service (a connection to a store), a dataset (a named pointer at data inside it), the Copy activity (the verb that moves bytes), and the integration runtime (the compute that does the copying). Get these four right once and every future pipeline is a variation on the theme.

By the end you will have a working pipeline you can trigger on demand or on a schedule, know how to confirm a run succeeded and read its rows-copied count, and be able to debug the four failures that bite every beginner: a firewall blocking the copy, a wrong-credential auth error, a CSV-to-table schema mismatch, and a malformed source file. Get the first copy right and the mapping data flows, parameterised pipelines and triggers all follow.

What problem this solves

Moving data between stores sounds trivial until you do it in production. The script-on-a-VM approach breaks in slow, expensive ways: the VM reboots mid-copy, the connection string sits in plain text, there is no retry on a transient blip, no record of what ran when, no alert when last night’s load silently produced zero rows, and somebody must keep the VM alive forever. Multiply that by a dozen partner feeds and you have a fragile cottage industry of cron jobs nobody fully understands.

Data Factory replaces all of it with a managed, serverless service. You declare what to copy and where; Azure runs it on compute it manages (the integration runtime), retries transient failures, records every run with row counts and durations, alerts on failure, and keeps credentials in managed identity or Key Vault. You pay per activity run and per data-movement-hour, not for an idle VM. The same Copy activity spans 90+ connectors — Blob, ADLS Gen2, SQL, Cosmos DB, Oracle, S3, Snowflake, SFTP, REST — so the pattern generalises to almost any source-to-sink.

Who hits this: every data engineer, every analytics team, every “we just need this file in the warehouse” request. CSV-to-SQL is the most common starting point because it is the most common real need — a flat file becoming rows in a relational table before anything downstream can use it. Learn this one shape and you have the skeleton for ingestion across the platform.

Learning objectives

By the end of this article you can:

Explain the four core ADF building blocks — linked service, dataset, Copy activity, integration runtime — and how they fit.
Author a Copy pipeline moving a CSV from Blob Storage into an Azure SQL table in the portal, the az CLI, and Bicep.
Choose the right authentication for each linked service (managed identity vs key vs SQL auth) and grant least privilege.
Configure the source and sink datasets: delimited-text settings, first-row-as-header, and table mapping.
Run the pipeline with Debug and a trigger, then confirm success and read the rows-read / rows-copied metrics from the monitoring view and CLI.
Diagnose the four beginner failures — firewall block, auth failure, schema/type mismatch, malformed source — using the exact error string and precise fix.
Add a schedule trigger, understand what each option costs, and tear the lab down so it stops billing.

Prerequisites & where this fits

You need an Azure subscription with rights to create resources (Contributor on a resource group), the az CLI or Cloud Shell, and a SQL client for verification — sqlcmd, Azure Data Studio, or the portal’s Query editor. Basic comfort with a CSV (rows, a header, comma-separated values) and a relational table (columns with types) is assumed. You do not need to know ADF already — that is the point.

This sits at the start of the Data engineering track. Upstream are the two stores it connects: Azure Storage Account Fundamentals: Blobs, Files, Queues and Tables for the source, and Your First Azure SQL Database: Create, Configure Firewall Rules and Connect Securely for the sink — read those first if either is unfamiliar, because half the beginner failures here are really storage or SQL firewall problems in an ADF costume. Downstream, this Copy pipeline is the foundation for mapping data flows, parameterised pipelines, and event/schedule triggers. For sizing the target later, Azure SQL Database Purchasing Models: DTU vs vCore and How to Pick Without Overpaying covers it. Everything you build — factory, linked services, datasets, pipeline — lives inside the Data Factory in your resource group; the two stores it connects (the Blob container and the SQL table) live in their own services and are usually owned by the source-system and DBA teams respectively.

Core concepts

Four nouns and one verb make up every ADF pipeline. Internalise these and the rest is detail.

A linked service is a connection string with a name. It tells ADF how to reach a store — the storage account or SQL server, plus the credential — and you reuse it across many datasets. The credential can be a managed identity (the factory’s own Azure AD identity — the key-free default), an account key or SAS (Blob), or SQL authentication (username/password, SQL). This is the single most common place auth and firewall problems surface.

A dataset is a named pointer at data inside a linked service. Where the linked service says “this storage account,” the dataset says “this container, this file, comma-delimited with a header row”; for SQL it says “this table.” It references exactly one linked service. You create two: a source (the CSV in Blob) and a sink (the table in SQL).

The Copy activity is the verb that moves bytes from a source dataset to a sink dataset. It reads the source, optionally maps columns, writes the sink, and is where you set column mapping, write behaviour (append vs upsert), batching and fault tolerance. It lives inside a pipeline (an ordered collection of activities — here, just the one Copy). The vast majority of ingestion is “a Copy activity from A to B.”

The integration runtime (IR) is the compute that performs the copy. For cloud-to-cloud you rarely create one — the AutoResolveIntegrationRuntime (Azure-managed, serverless) handles Blob-to-SQL automatically. You only need a Self-Hosted Integration Runtime (SHIR) when a source sits on-premises or in a private network ADF cannot reach over the internet. The IR burns the data-movement compute you pay for.

A trigger is what starts a pipeline run. Run on demand (Debug or a manual trigger), on a schedule (nightly at 02:00), on a tumbling window (fixed contiguous slices — good for backfills), or on an event (a blob is created). You run yours manually first, then add a schedule trigger.

The four building blocks side by side

Pin the vocabulary down as a grid before the build:

Building block	Definition	References	Example	Why it matters
Linked service	A connection + credential to a store	(the root)	`ls_blob_source`, `ls_sql_sink`	Where auth/firewall errors live
Dataset	A named pointer at data in a store	One linked service	`ds_csv_orders`, `ds_sql_orders`	Where schema/format settings live
Copy activity	Moves data source → sink	Two datasets	`CopyOrdersCsvToSql`	The verb; mapping lives here
Pipeline	Ordered set of activities	The activities	`pl_copy_orders`	The unit you run and schedule
Integration runtime	Compute that runs activities	(used by activities)	`AutoResolveIntegrationRuntime`	What you pay for; default is fine
Trigger	What starts a run	The pipeline	manual, then `trg_daily_2am`	On-demand vs scheduled vs event

How a Copy activity actually runs

Knowing the order of a run lets you place a failure on the right step. The trigger creates a pipeline run; the IR connects and authenticates to the source (reading the CSV) and the sink (writing rows), then reports rows read/copied, volume and duration. The takeaway: every failure is at the source (cannot reach Blob, file missing, malformed row) or the sink (firewall, auth, type mismatch) — the error names which side — and the IR needs network and credential access to both stores, so the classic mistake is fixing one side and forgetting the other. The phases, what fails in each, and where you confirm:

Phase	What happens	Can fail because	Where you see it
Trigger	Run created with a run ID	Trigger disabled / bad schedule	Monitor → Trigger runs
Source connect	IR authenticates to Blob	Wrong key/SAS; network rule blocks IR	Activity error: source side
Source read	CSV parsed row by row	File missing, bad delimiter, malformed row	Activity error: `DelimitedText`
Sink connect	IR authenticates to SQL	Firewall blocks IR; bad credential	`SqlFailedToConnect`
Sink write	Rows inserted in batches	Type mismatch, table missing, not-null	`SqlOperationFailed`
Complete	Metrics reported	(success)	Monitor → activity output

Choosing authentication for each linked service

Auth is where most first pipelines stall, so decide it deliberately. For Blob: an account key, a SAS token, or managed identity. For SQL: SQL authentication or managed identity mapped to a database user. The secure default is managed identity on both sides — no secrets — needing two grants (Storage Blob Data Reader on the account, a contained DB user in SQL). Key/SQL-auth is faster for a first lab; this article shows that path, and the managed-identity grants in Security notes. The options and when to pick which:

Store	Auth option	Secret?	Effort	When to use
Blob	Account key	Yes	Lowest	Quick labs
Blob	SAS token	Yes (expires)	Low	Scoped, time-bound access
Blob	Managed identity	No	Medium	Production default
SQL	SQL authentication	Yes	Lowest	Quick labs; legacy apps
SQL	Managed identity	No	Medium	Production default
Either	Key Vault-backed	In vault	Medium	Centralised secret governance

The single rule: never paste a key-bearing connection string into pipeline JSON that lands in git. Use managed identity or reference Key Vault. The lab keeps the key inside the linked service (ADF stores it encrypted); the Bicep references the storage key via listKeys() at deploy time rather than hardcoding it.

The source and sink datasets in detail

The two datasets carry the format and schema decisions, and two settings cause the most confusing beginner failures.

Source (CSV) dataset (type DelimitedText): the column delimiter (comma default), the quote character ("), the encoding (UTF-8 default; Windows exports are often UTF-8-BOM or Latin-1), and crucially First row as header — if the first line is column names this must be checked, or ADF treats the header as a data row and types every column as text. Point a dataset at a folder and ADF copies every file in it.

Sink (SQL) dataset (type AzureSqlTable, one table): the table name (schema-qualified, dbo.Orders), whether to auto-create the table (convenient, but it infers nvarchar(max), usually unwanted), and the column mapping — matching names auto-map, differing names you map by hand. The settings most likely to cause a silent wrong result or a hard error:

Setting	Lives on	Default	When to change	Gotcha if wrong
First row as header	Source	off	CSV has a header line	Header copied as data; columns all text
Column delimiter	Source	`,`	TSV / pipe / semicolon files	Whole row in one column
Quote character	Source	`"`	Non-standard quoting	Fields with commas split wrong
Encoding	Source	UTF-8	Latin-1 / UTF-8-BOM	Garbled accents; BOM in first column
Table name	Sink	(you set it)	Always set	Wrong/missing → `SqlOperationFailed`
Auto create table	Copy sink	off	Quick load of a new table	Types inferred `nvarchar(max)`
Write behaviour	Copy sink	Insert	Upsert / overwrite needed	Duplicate rows on re-run

Architecture at a glance

The pipeline is a short left-to-right path. On the left, the source zone: a Blob Storage container holding orders.csv, reached through a linked service and read through a DelimitedText dataset. In the middle, the Data Factory: a pipeline with a single Copy activity on the AutoResolveIntegrationRuntime, the serverless compute that streams rows source to sink. On the right, the sink zone: an Azure SQL Database with the dbo.Orders table, reached through its own linked service and written through an AzureSqlTable dataset.

Follow the flow: a trigger starts a run; the Copy activity, on the Azure IR, authenticates to Blob and streams the CSV in (parsing it with the delimited-text settings); it simultaneously authenticates to SQL — where the SQL firewall must allow Azure services or the IR’s IP, the number-one beginner failure — and writes the rows in batches, finishing with rows-read and rows-copied metrics. The IR holds two independent connections, one per store, each needing its own network path and credential. The numbered badges mark the four hops where beginners get stuck — Blob auth, the SQL firewall, the SQL insert/type mapping, and the header/column mapping — and the legend says how to confirm and fix each.

Real-world scenario

Brightleaf Logistics is a freight broker in Pune with a four-person analytics team. Every night their largest carrier drops a shipments export — shipments_YYYYMMDD.csv, around 180,000 rows / 40 MB — into an SFTP folder, and a finance dashboard on Azure SQL needs those rows by 06:00 for the morning margin report. For two years this ran as a Python script on a B2s VM: paramiko to pull, pandas to read, pyodbc to insert. It worked until it didn’t.

The failures were the classic VM-script failures. The VM rebooted for patching at 02:30 one Tuesday and the load simply did not run; nobody noticed until the 09:00 standup when the margin numbers were a day stale. A month later the carrier added a column and the pandas-to-SQL insert started throwing on a column-count mismatch — no error handling, so it crash-looped and filled the disk with stack traces. And the database password sat in plain text in /etc/shipments/config.ini, flagged in a security review. The team spent more time nursing the script than analysing freight.

They rebuilt it in Data Factory in an afternoon. No Self-Hosted IR was needed — the carrier agreed to drop the file into a Blob container via SAS instead of SFTP (one connector change). The pipeline became a single Copy activity from a DelimitedText source to an AzureSqlTable sink, authenticated with the factory’s managed identity on both sides — no password anywhere. A schedule trigger fired it at 02:00, fault tolerance skipped and logged incompatible rows rather than failing the load, and an alert on the pipeline-failed metric paged the on-call. The VM was decommissioned.

The payoff was immediate. The patch-reboot failure vanished — there was no VM to reboot. When the carrier added another column three months later, the Copy activity’s explicit mapping ignored it cleanly instead of crash-looping, and the engineer added it in five minutes. Cost dropped from a continuously-running B2s VM (~₹3,200/month) to ADF’s per-run pricing — roughly 30 activity runs/month plus ~0.2 data-movement hours per run, near ₹350/month — an order of magnitude cheaper, with retries, logging, alerting and no secrets. The lesson on the wall: “The copy was never the hard part. The VM, the password and the silence on failure were — and ADF deletes all three.”

Advantages and disadvantages

The managed Copy activity is the right default for ingestion, but not free of trade-offs:

Advantages	Disadvantages
Serverless — no VM to patch, scale or babysit	Learning curve (linked service vs dataset vs IR confuses beginners)
90+ connectors; one Copy pattern moves almost anything	Per-run + per-DIU-hour pricing can surprise on high-frequency tiny copies
Built-in retries, logging, row counts and failure alerting	Row-level transformation needs mapping data flows (Spark, more cost)
Managed identity / Key Vault keep secrets out of config	On-prem / private sources need a Self-Hosted IR you maintain
Runs are recorded and auditable (run ID, duration, rows)	Debugging means reading ADF error codes, not your own logs
IaC-friendly (ARM/Bicep/Terraform); fits CI/CD	The generated JSON is verbose in a diff if hand-edited

The model wins decisively for scheduled, recurring movement between supported stores where you want operational simplicity and auditability over a hand-built script. It is a weaker fit for heavy in-flight transformation (reach for mapping data flows or Databricks), extreme cadence (thousands of tiny copies a minute — batch them), or a source unreachable from Azure without a Self-Hosted IR. For the CSV-to-SQL shape here, ADF is almost always the right answer.

Hands-on lab

This is the centerpiece. You build the full pipeline three ways — portal, az CLI, and Bicep — copying a CSV from Blob into Azure SQL, validating, and tearing it down. Everything here costs a few rupees. Do the portal walkthrough first to see the model, then the CLI/Bicep to script it. Run CLI steps in Cloud Shell (Bash).

Part 0 — Shared setup (storage, SQL, and the sample CSV)

You need a storage account with a CSV and a SQL database with a target table. Set up both first; the portal/CLI/Bicep parts reuse them.

Step 1 — Variables and resource group.

RG=rg-adf-lab
LOC=centralindia
SA=stadflab$RANDOM                 # storage account, globally unique, lowercase
SQLSRV=sql-adf-lab-$RANDOM         # SQL logical server, globally unique
SQLDB=db-orders
SQLADMIN=adflabadmin
SQLPASS='P@ssw0rd-Change-Me-2026!' # use a strong unique password
ADF=adf-lab-$RANDOM               # data factory, globally unique
az group create -n $RG -l $LOC -o table

Step 2 — Create the storage account and container.

az storage account create -n $SA -g $RG -l $LOC --sku Standard_LRS --kind StorageV2 -o table
KEY=$(az storage account keys list -n $SA -g $RG --query "[0].value" -o tsv)
az storage container create -n incoming --account-name $SA --account-key "$KEY" -o table

Expected: the account row with sku.name = Standard_LRS, then "created": true for the container.

Step 3 — Create the sample CSV and upload it as orders.csv:

cat > orders.csv <<'CSV'
OrderId,Customer,Amount,OrderDate
1001,Acme Corp,1499.50,2026-06-01
1002,Globex,250.00,2026-06-02
1003,Initech,9999.99,2026-06-03
1004,Umbrella,42.00,2026-06-04
CSV
az storage blob upload --account-name $SA --account-key "$KEY" \
  --container-name incoming --name orders.csv --file orders.csv --overwrite -o table

Expected: an upload result with the blob name and a lastModified timestamp (four data rows plus a header).

Step 4 — Create the Azure SQL logical server and database.

az sql server create -n $SQLSRV -g $RG -l $LOC \
  --admin-user $SQLADMIN --admin-password "$SQLPASS" -o table
az sql db create -s $SQLSRV -g $RG -n $SQLDB \
  --service-objective S0 --backup-storage-redundancy Local -o table

Expected: the server row, then the database row with currentServiceObjectiveName = S0 (a cheap DTU tier, fine for a lab).

Step 5 — Open the SQL firewall for Azure services AND your client. The most important step — without it the IR cannot connect (the number-one beginner error). The rule 0.0.0.0–0.0.0.0 means “allow Azure services” (including the ADF IR):

# Allow Azure services (the ADF integration runtime) to reach the server
az sql server firewall-rule create -s $SQLSRV -g $RG \
  -n AllowAzureServices --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0 -o table

# Allow YOUR current public IP so you can run sqlcmd to verify
MYIP=$(curl -s https://api.ipify.org)
az sql server firewall-rule create -s $SQLSRV -g $RG \
  -n AllowMyClient --start-ip-address $MYIP --end-ip-address $MYIP -o table

Expected: two firewall-rule rows. The AllowAzureServices rule with 0.0.0.0/0.0.0.0 is the one that lets the Copy activity connect.

Step 6 — Create the target table via the portal Query editor (log in with $SQLADMIN/$SQLPASS) or sqlcmd:

CREATE TABLE dbo.Orders (
  OrderId     INT          NOT NULL PRIMARY KEY,
  Customer    NVARCHAR(100) NOT NULL,
  Amount      DECIMAL(10,2) NOT NULL,
  OrderDate   DATE          NOT NULL
);

Expected: “Commands completed successfully.” Column names match the CSV header exactly, so ADF auto-maps.

Part A — Build it in the Azure portal (ADF Studio)

Step A1 — Create the Data Factory. Portal → Create a resource → search Data Factory → Create. Pick rg-adf-lab, a globally-unique name, Central India, V2. Leave Git config disabled for the lab. Review + create, then Open Azure Data Factory Studio.

Step A2 — Create the Blob source linked service. In Studio: Manage → Linked services → + New → Azure Blob Storage → Continue. Name it ls_blob_source. Authentication Account key, select your subscription and the stadflab… account. Test connection → Connection successful → Create.

Step A3 — Create the SQL sink linked service. + New → Azure SQL Database → Continue. Name ls_sql_sink, server sql-adf-lab-…, database db-orders. Authentication SQL authentication, user adflabadmin, password as set. Test connection → Connection successful → Create. (If it fails, the Step 5 firewall rule is missing — see troubleshooting.)

Step A4 — Create the source dataset (the CSV). Author → Datasets → + → New dataset → Azure Blob Storage → DelimitedText → Continue. Name ds_csv_orders, linked service ls_blob_source, File path container incoming / file orders.csv. Check “First row as header.” Import schema From connection/store. OK.

Step A5 — Create the sink dataset (the table). + → New dataset → Azure SQL Database → Continue. Name ds_sql_orders, linked service ls_sql_sink, table dbo.Orders, OK.

Step A6 — Build the pipeline with a Copy activity. Pipelines → + → Pipeline, name pl_copy_orders. From Activities → Move and transform → drag Copy data onto the canvas; name it CopyOrdersCsvToSql.

Source tab: dataset ds_csv_orders.
Sink tab: dataset ds_sql_orders; leave write behaviour at the default (Insert/append).
Mapping tab: click Import schemas — shared names mean all four columns auto-map. Verify Amount maps to a decimal and OrderDate to a date.

Step A7 — Validate, Debug, and read the result. Click Validate (no errors expected), then Debug to run. Watch the Output tab: the activity goes Queued → In progress → Succeeded. Click the activity’s details (glasses) icon and read Rows read: 4, Rows written: 4, Data read, Throughput, Duration.

Step A8 — Publish and add a schedule trigger. Click Publish all to save everything to the factory. Then Add trigger → New/Edit → + New → type Schedule, recurrence Every 1 Day at 02:00, your time zone → OK → Publish. The pipeline now runs nightly. (For the lab, leave it disabled or delete it in teardown.)

You built, in order: two linked services (A2/A3 — the firewall step gated A3), two datasets (A4/A5), a pipeline + Copy activity (A6), a run (A7 — 4 read, 4 written), and a schedule trigger (A8).

Part B — Build it with the `az` CLI

The CLI defines each object from a small JSON file with az datafactory, reusing Part 0’s storage, SQL and table. (Install the extension once: az extension add --name datafactory.)

Step B1 — Create the factory.

az extension add --name datafactory 2>/dev/null
az datafactory create --resource-group $RG --factory-name $ADF --location $LOC -o table

Expected: the factory row with provisioningState = Succeeded.

Step B2 — Create the Blob linked service (JSON references the storage key):

cat > ls_blob.json <<JSON
{ "type": "AzureBlobStorage",
  "typeProperties": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=$SA;AccountKey=$KEY;EndpointSuffix=core.windows.net"
  } }
JSON
az datafactory linked-service create --resource-group $RG --factory-name $ADF \
  --linked-service-name ls_blob_source --properties @ls_blob.json -o table

Step B3 — Create the SQL linked service.

cat > ls_sql.json <<JSON
{ "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "Server=tcp:$SQLSRV.database.windows.net,1433;Database=$SQLDB;User ID=$SQLADMIN;Password=$SQLPASS;Encrypt=true;TrustServerCertificate=false;Connection Timeout=30;"
  } }
JSON
az datafactory linked-service create --resource-group $RG --factory-name $ADF \
  --linked-service-name ls_sql_sink --properties @ls_sql.json -o table

Step B4 — Create the source and sink datasets.

cat > ds_csv.json <<'JSON'
{ "type": "DelimitedText",
  "linkedServiceName": { "referenceName": "ls_blob_source", "type": "LinkedServiceReference" },
  "typeProperties": {
    "location": { "type": "AzureBlobStorageLocation", "container": "incoming", "fileName": "orders.csv" },
    "columnDelimiter": ",", "firstRowAsHeader": true, "quoteChar": "\"" } }
JSON
az datafactory dataset create --resource-group $RG --factory-name $ADF \
  --dataset-name ds_csv_orders --properties @ds_csv.json -o table

cat > ds_sql.json <<'JSON'
{ "type": "AzureSqlTable",
  "linkedServiceName": { "referenceName": "ls_sql_sink", "type": "LinkedServiceReference" },
  "schema": [], "typeProperties": { "schema": "dbo", "table": "Orders" } }
JSON
az datafactory dataset create --resource-group $RG --factory-name $ADF \
  --dataset-name ds_sql_orders --properties @ds_sql.json -o table

Step B5 — Create the pipeline with the Copy activity.

cat > pl_copy.json <<'JSON'
{ "activities": [ {
    "name": "CopyOrdersCsvToSql", "type": "Copy",
    "inputs":  [ { "referenceName": "ds_csv_orders", "type": "DatasetReference" } ],
    "outputs": [ { "referenceName": "ds_sql_orders", "type": "DatasetReference" } ],
    "typeProperties": {
      "source": { "type": "DelimitedTextSource" },
      "sink":   { "type": "AzureSqlSink", "writeBehavior": "insert" },
      "enableStaging": false } } ] }
JSON
az datafactory pipeline create --resource-group $RG --factory-name $ADF \
  --name pl_copy_orders --pipeline @pl_copy.json -o table

Step B6 — Run it on demand and read the result.

RUN_ID=$(az datafactory pipeline create-run --resource-group $RG --factory-name $ADF \
  --name pl_copy_orders --query runId -o tsv)
echo "Run: $RUN_ID"

# Poll the run status until it leaves InProgress/Queued
az datafactory pipeline-run show --resource-group $RG --factory-name $ADF \
  --run-id $RUN_ID --query "{status:status, start:runStart, durationMs:durationInMs}" -o table

Expected: status moves to Succeeded (re-run the show command until it does). For rows copied, query the activity run:

# Activity-level output includes rowsRead / rowsCopied
az datafactory activity-run query-by-pipeline-run --resource-group $RG --factory-name $ADF \
  --run-id $RUN_ID \
  --last-updated-after $(date -u -d '-1 hour' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --last-updated-before $(date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v+1H +%Y-%m-%dT%H:%M:%SZ) \
  --query "value[0].output.{read:rowsRead, copied:rowsCopied, dataRead:dataRead}" -o table

Expected: read = 4, copied = 4; verify with SELECT COUNT(*) FROM dbo.Orders; → 4. Same six objects as the portal, now scriptable: create, linked-service create ×2, dataset create ×2, pipeline create, then pipeline create-run.

Part C — Build it as Bicep (infrastructure-as-code)

The same objects declared declaratively, so the whole factory is reviewable and redeployable. This assumes the storage account, SQL server/database and table exist (from Part 0); it pulls the storage key at deploy time with listKeys() and takes the SQL password as a secure parameter.

@description('Existing storage account name (source).')
param storageAccountName string
@description('Existing SQL logical server name (sink).')
param sqlServerName string
param sqlDatabaseName string = 'db-orders'
param sqlAdminUser string = 'adflabadmin'
@secure()
param sqlAdminPassword string
param location string = resourceGroup().location
param factoryName string = 'adf-lab-${uniqueString(resourceGroup().id)}'

// Reference the existing storage account to read its key at deploy time
resource sa 'Microsoft.Storage/storageAccounts@2023-05-01' existing = {
  name: storageAccountName
}

resource adf 'Microsoft.DataFactory/factories@2018-06-01' = {
  name: factoryName
  location: location
  identity: { type: 'SystemAssigned' } // enable for later managed-identity hardening
}

resource lsBlob 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = {
  parent: adf
  name: 'ls_blob_source'
  properties: {
    type: 'AzureBlobStorage'
    typeProperties: {
      connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccountName};AccountKey=${sa.listKeys().keys[0].value};EndpointSuffix=${environment().suffixes.storage}'
    }
  }
}

resource lsSql 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = {
  parent: adf
  name: 'ls_sql_sink'
  properties: {
    type: 'AzureSqlDatabase'
    typeProperties: {
      connectionString: 'Server=tcp:${sqlServerName}${environment().suffixes.sqlServerHostname},1433;Database=${sqlDatabaseName};User ID=${sqlAdminUser};Password=${sqlAdminPassword};Encrypt=true;TrustServerCertificate=false;Connection Timeout=30;'
    }
  }
}

resource dsCsv 'Microsoft.DataFactory/factories/datasets@2018-06-01' = {
  parent: adf
  name: 'ds_csv_orders'
  properties: {
    type: 'DelimitedText'
    linkedServiceName: { referenceName: lsBlob.name, type: 'LinkedServiceReference' }
    typeProperties: {
      location: { type: 'AzureBlobStorageLocation', container: 'incoming', fileName: 'orders.csv' }
      columnDelimiter: ','
      firstRowAsHeader: true
      quoteChar: '"'
    }
  }
}

resource dsSql 'Microsoft.DataFactory/factories/datasets@2018-06-01' = {
  parent: adf
  name: 'ds_sql_orders'
  properties: {
    type: 'AzureSqlTable'
    linkedServiceName: { referenceName: lsSql.name, type: 'LinkedServiceReference' }
    typeProperties: { schema: 'dbo', table: 'Orders' }
  }
}

resource pl 'Microsoft.DataFactory/factories/pipelines@2018-06-01' = {
  parent: adf
  name: 'pl_copy_orders'
  properties: {
    activities: [
      {
        name: 'CopyOrdersCsvToSql'
        type: 'Copy'
        inputs:  [ { referenceName: dsCsv.name, type: 'DatasetReference' } ]
        outputs: [ { referenceName: dsSql.name, type: 'DatasetReference' } ]
        typeProperties: {
          source: { type: 'DelimitedTextSource' }
          sink:   { type: 'AzureSqlSink', writeBehavior: 'insert' }
          enableStaging: false
        }
      }
    ]
  }
}

output factory string = adf.name
output principalId string = adf.identity.principalId // grant this Storage Blob Data Reader to go key-free

Deploy, then trigger a run:

az deployment group create -g $RG --template-file adf.bicep \
  --parameters storageAccountName=$SA sqlServerName=$SQLSRV sqlAdminPassword="$SQLPASS" -o table

# Then run it exactly as in Part B, step B6 (create-run + poll + read rows)

Expected: a successful deployment, then a run that copies 4 rows. The principalId output is the factory’s managed identity — grant it Storage Blob Data Reader and create a contained SQL user for it to drop the keys entirely (see Security notes).

Validation checklist and teardown

Validate the whole thing end to end:

# 1) The run reported success and copied 4 rows (Part B/C step 6 output)
# 2) The rows are actually in SQL — run in Query editor or sqlcmd:
#    SELECT COUNT(*) FROM dbo.Orders;   -> 4
#    SELECT * FROM dbo.Orders ORDER BY OrderId;  -> the four orders

What to check	How	Expected
Run succeeded	Monitor / `pipeline-run show`	status = Succeeded
Rows copied	activity output `rowsCopied`	4
Rows in SQL	`SELECT COUNT(*) FROM dbo.Orders`	4
Types correct	`SELECT * FROM dbo.Orders`	Amount decimal, OrderDate a date
Trigger created	Monitor → Trigger runs	the schedule trigger listed

Teardown — stop all billing. Delete the whole resource group when done:

az group delete -n $RG --yes --no-wait

This removes the factory, storage, SQL server/database and any trigger in one shot. ADF has no idle charge, but the SQL S0 database and stored CSV bill while they exist, so delete the group. Helper files (*.json, adf.bicep, orders.csv) you rm separately.

Common mistakes & troubleshooting

These four (plus a few) account for nearly every failed first pipeline. Scan the table, then read the detail for your row — each gives the exact error string and the precise fix.

#	Symptom / error	Root cause	Confirm	Fix
1	`SqlFailedToConnect` / network error to SQL	SQL firewall blocks the ADF IR	SQL server → Networking; no `0.0.0.0` allow-Azure rule	`az sql server firewall-rule create … --start-ip 0.0.0.0 --end-ip 0.0.0.0`
2	`Login failed for user` / Blob `403`	Wrong credential in the linked service	Linked service → Test connection names the side	Re-enter password / refresh key; or managed identity
3	`… cannot be converted to type Decimal` / `SqlOperationFailed`	Type mismatch (CSV text doesn’t fit the column)	Activity error names the column + target type	Fix column type or value; map explicitly on Mapping tab
4	All columns text / header is a data row	First row as header unchecked	Source dataset → Connection → checkbox off	Check it; re-import schema; re-run
5	`Column count mismatch` / extra-or-missing cols	CSV columns ≠ table columns after a source change	Compare CSV header to table; Mapping tab	Add/remove the column; enable explicit mapping
6	`BlobNotFound` / blob does not exist	Wrong container/file path, or not uploaded	Dataset path vs `az storage blob list`	Fix container/fileName; upload the file
7	Garbled accents / `?` or BOM in first column	Wrong encoding on the source dataset	Check the CSV’s encoding	Set dataset encoding (UTF-8 / UTF-8-BOM / Latin-1)
8	Duplicate rows after re-running	Write behaviour is insert (append)	Row count doubled in SQL	Upsert (set keys) or truncate via pre-copy script
9	`RequestSizeExceeded` / throttling on a big file	Batch size / DTU mismatch	Activity duration; SQL DTU pinned	Tune `writeBatchSize`; raise SQL tier; enable staging

Two of these deserve the reasoning the table can’t carry:

The firewall block (SqlFailedToConnect, number-one failure). The IR connects from Azure’s IP space, so unless “Allow Azure services and resources to access this server” is on (the 0.0.0.0–0.0.0.0 rule), every Copy to SQL fails at sink-connect. Confirm in SQL server → Networking; fix with az sql server firewall-rule create … --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0. For production prefer a Private Endpoint + managed VNet IR over opening to all Azure services — see Azure Private Endpoint vs Service Endpoint: Secure PaaS Access.

Header copied as data (everything is text). First row as header is unchecked on the source dataset, so ADF reads the column-name line as data and types every column as string. Confirm on the dataset’s Connection tab (the preview shows OrderId/Customer/... as a data row); check the box, re-import the schema, re-run. This one-checkbox bug accounts for a surprising share of “why is my data all wrong” tickets.

Best practices

Use managed identity, not keys, for both linked services beyond a quick lab — no secrets to store, rotate or leak.
Set “First row as header” deliberately on every DelimitedText dataset; it is the most common silent-wrong-result bug.
Define explicit column mappings rather than relying on name-matching, so a new source column does not silently break or pollute the load.
Prefer a real, typed target table over “auto-create table” — auto-create infers nvarchar(max) and you lose type safety.
Make re-runs idempotent: upsert with a key, or truncate/overwrite, so re-running never duplicates rows.
Turn on fault tolerance (skip + log incompatible rows) for messy external files, so one bad row doesn’t fail a 180k-row load.
Parameterise file path, table and container once you have more than one feed, so one pipeline serves many files.
Keep the factory in Git so changes are reviewed and the published JSON is versioned — never author against Live mode in production.
Co-locate the IR with your stores to cut data-movement cost and latency, and alert on pipeline failure (PipelineFailedRuns) so a silent zero-row night pages you instead of surfacing at standup.

Security notes

The secure target is zero secrets in the pipeline and least-privilege access to each store. Three moves.

Use the factory’s managed identity for both stores. Enable a system-assigned identity (the Bicep does this), then grant the minimum role: Storage Blob Data Reader on the account (read-only), and a contained database user in SQL with just db_datawriter. The grants:

# Grant the factory's identity read on the storage account (least privilege)
ADF_PRINCIPAL=$(az datafactory show -g $RG -n $ADF --query identity.principalId -o tsv)
SA_ID=$(az storage account show -n $SA -g $RG --query id -o tsv)
az role assignment create --assignee $ADF_PRINCIPAL \
  --role "Storage Blob Data Reader" --scope $SA_ID

-- In the target DB: create a user for the factory's identity and grant only write
CREATE USER [adf-lab-xxxx] FROM EXTERNAL PROVIDER;   -- use the factory name
ALTER ROLE db_datawriter ADD MEMBER [adf-lab-xxxx];

Prefer Key Vault for any credential you cannot avoid. If a connector only supports a key/secret, store it in Azure Key Vault and reference it from the linked service, so the secret lives in the vault (audited, rotatable), never in the JSON. See Azure Key Vault: Secrets, Keys and Certificates Done Right.

Lock the network down for production. The lab’s 0.0.0.0 allow-Azure SQL rule is broad — it admits any Azure tenant’s services. For production, put SQL and storage behind Private Endpoints and run the copy on a managed VNet IR, then remove the rule. Azure Private Link and Private DNS: Keeping PaaS Off the Public Internet covers the pattern. The posture, lab to production:

Control	Lab default	Production target
Blob auth	Account key	Managed identity + Storage Blob Data Reader
SQL auth	SQL username/password	Managed identity + contained DB user (least role)
SQL network	`0.0.0.0` allow-Azure rule	Private Endpoint + managed VNet IR; rule removed
Secret storage	Encrypted in the factory	Key Vault-referenced where unavoidable
Audit	Run history	Run history + Monitor alerts + diagnostic logs

Cost & sizing

ADF Copy pricing has two components, both small here: per activity run (pennies per 1,000) and per data-integration unit hour (DIU-hour) of movement (IR compute time). A DIU measures the IR’s CPU/memory/network; the Azure IR auto-allocates them (~4 to start) and you pay time × DIUs. A small CSV copy costs a fraction of a rupee; the nightly 180k-row scenario job ran near ₹350/month. There is no idle charge — a factory that runs nothing costs nothing.

The two stores cost more than the pipeline. Azure SQL at S0 (10 DTU) is ~₹1,200–1,300/month if left running; scale to Basic (~₹350) for a lab, or delete. Blob for a few small CSVs is a few rupees. The dominant lab cost is the SQL database — which is why teardown deletes the resource group. The drivers and how to control each:

Cost driver	What you pay for	Rough INR / month	How to control it
ADF activity runs	Per pipeline/activity run	a few ₹	Batch files per run; avoid tiny frequent runs
Data movement (DIU-hours)	IR time × DIUs during copy	₹50–400	Co-locate regions; right DIU
Azure SQL sink	The S0/Basic database	~₹350 (Basic) – ₹1,300 (S0)	Scale down or delete when idle
Blob storage	CSV bytes + transactions	a few ₹	Lifecycle-tier or delete old files
Self-Hosted IR (if used)	The VM you run it on	VM cost (not needed)	Only for private/on-prem sources

For the SQL target, a nightly batch into a small reporting table is comfortable on Basic/S0; heavier or concurrent loads want more DTUs or a vCore tier — Azure SQL Database Purchasing Models: DTU vs vCore and How to Pick Without Overpaying covers choosing without overspending.

Interview & exam questions

1. The four core building blocks of an ADF pipeline? A linked service (connection + credential to a store), a dataset (pointer at data inside a linked service, carrying format/schema), the Copy activity (the verb moving data source → sink), and the integration runtime (compute that runs activities). A pipeline groups activities; a trigger starts a run.

2. Difference between a linked service and a dataset? A linked service says how to reach a store; a dataset says which data and in what shape (this file, comma-delimited with a header; or this table). A dataset references exactly one linked service, and many can share one.

3. When do you need a Self-Hosted IR instead of the default Azure IR? When a source/sink is on-premises or in a private network the Azure-managed IR cannot reach over the internet. The default AutoResolveIntegrationRuntime handles cloud-to-cloud copies (Blob-to-Azure-SQL); a SHIR is one you install on a machine with line-of-sight to the private store.

4. A Copy to Azure SQL fails with a connection error. Most likely cause for a first pipeline? The SQL firewall is not allowing the ADF IR, which connects from Azure’s IP space. Fix: enable “Allow Azure services” (the 0.0.0.0–0.0.0.0 rule), or use a Private Endpoint with a managed VNet IR in production.

5. Your data has the header in the first data row and every column typed as text. What is wrong? First row as header is unchecked on the source dataset, so ADF reads the header as data and types all columns as strings. Check it, re-import the schema, re-run.

6. What does write behaviour control, and why does re-running duplicate rows? It determines how rows land — insert/append (the default) just adds rows, so re-running appends them again. Use upsert (with keys) or truncate/overwrite for idempotent re-runs.

7. How is ADF Copy priced, and does an idle factory cost money? Per activity run and per DIU-hour of data-movement compute (IR time × DIUs). An idle factory has no standing charge — you pay only when pipelines run.

8. How do you avoid storing any secret in the pipeline? Use the factory’s managed identity for both linked services — Storage Blob Data Reader on the account and a least-privilege contained DB user in SQL. For connectors needing a key, store it in Key Vault and reference it rather than embedding it.

9. A nightly load of a large CSV throttles the SQL database. What can you tune? Lower the write batch size, temporarily raise the SQL tier for the load window, enable staging (PolyBase/COPY for bulk where supported), and keep the IR in the same region as the database. The mismatch is copy throughput vs the database’s DTU/vCore headroom.

10. Which trigger types does ADF support and when do you use each? Schedule (fixed times — nightly batches), tumbling window (contiguous non-overlapping slices — backfills/dependencies), event (blob created/deleted — file-arrival ingestion), and manual/on-demand (Debug or a manual trigger).

11. Difference between Debug and triggering a published pipeline? Debug runs the in-progress, unpublished version from the canvas for quick iteration (with activity breakpoints); triggering runs the published pipeline — the version scheduled/automated runs use. Debug to develop, publish to make it real, then trigger.

These map to DP-203 (Azure Data Engineer Associate) — ingest and transform data, orchestration, and integration runtimes — and the data-movement portions of DP-900 (Azure Data Fundamentals). A compact cert mapping for revision:

Question theme	Primary cert	Objective area
Building blocks, linked service vs dataset	DP-900 / DP-203	Core data integration concepts
Integration runtimes (Azure vs SHIR)	DP-203	Design & implement data processing
Copy activity, write behaviour, mapping	DP-203	Ingest and transform data
Triggers (schedule/tumbling/event)	DP-203	Orchestrate data processing
Security (managed identity, Key Vault, firewall)	DP-203 / AZ-500	Secure data; manage identities
Cost (DIU, activity runs, no idle charge)	DP-203	Optimize and manage data solutions

Quick check

Name the four core building blocks of an ADF pipeline and what each one is for.
Your Copy to Azure SQL fails to connect even though the credentials are correct. What is the single most likely cause, and the one-line fix?
After a copy, every column is text and the header is in the first row. Which one setting fixes it?
Do you need a Self-Hosted Integration Runtime to copy from Blob Storage into Azure SQL? Why or why not?
You re-ran your pipeline and now have duplicate rows in the table. Why, and what write behaviour avoids it?

Answers

Linked service (connection + credential), dataset (pointer at data, with format/schema), Copy activity (the verb moving data source → sink), and integration runtime (the compute). A pipeline groups activities; a trigger starts a run.
The SQL firewall is blocking the ADF IR. Enable “Allow Azure services” — az sql server firewall-rule create … --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0 (or a Private Endpoint + managed VNet IR in production).
First row as header is unchecked on the source dataset, so ADF reads the header as data and types everything as text. Check it and re-import the schema.
No. The default AutoResolveIntegrationRuntime handles cloud-to-cloud copies like Blob-to-Azure-SQL. You only need a Self-Hosted IR for on-premises/private-network stores the Azure IR cannot reach.
Write behaviour is insert/append (the default), so re-running adds the rows again. Use upsert (with keys) or truncate/overwrite before the copy for idempotent re-runs.

Glossary

Azure Data Factory (ADF) — Azure’s managed, serverless ETL/ELT and data-integration service for moving data between stores.
Linked service — a named connection and credential to a store (a storage account or SQL database); reused across datasets.
Dataset — a named pointer at data inside a linked service, carrying format and schema (a delimited file with a header, or a SQL table).
Copy activity — the activity that moves data source dataset → sink dataset, with optional column mapping and write behaviour.
Integration runtime (IR) — the compute that runs activities; the AutoResolveIntegrationRuntime is the Azure-managed, serverless default.
Self-Hosted Integration Runtime (SHIR) — an IR you install on a machine to reach on-premises or private-network stores.
DelimitedText — the dataset type for CSV/TSV flat files; carries delimiter, quote, encoding and header settings.
AzureSqlTable — the dataset type pointing at a single table in an Azure SQL Database.
DIU (Data Integration Unit) — a measure of the Azure IR’s power applied to a copy; you pay DIUs × time.
Write behaviour — how the Copy activity lands rows: insert/append, upsert (with keys), or overwrite.
Trigger — what starts a run: schedule, tumbling window, event (blob created), or manual/on-demand.
Managed identity — the factory’s own Azure AD identity, used to authenticate without storing a secret.
Fault tolerance — a Copy option to skip and log incompatible/duplicate rows instead of failing the activity.
“Allow Azure services” rule — the SQL firewall 0.0.0.0–0.0.0.0 rule that lets Azure services (including the ADF IR) connect.

Next steps

You can now build, run, secure and schedule a Blob-to-SQL Copy pipeline. Build outward:

Next: Your First Azure SQL Database: Create, Configure Firewall Rules and Connect Securely — go deeper on the sink: firewall, auth and secure connection.
Related: Azure Storage Account Fundamentals: Blobs, Files, Queues and Tables — the source store, containers, access tiers and SAS.
Related: Azure SQL Database Purchasing Models: DTU vs vCore and How to Pick Without Overpaying — right-size the target so a nightly load doesn’t overpay or throttle.
Related: Azure Key Vault: Secrets, Keys and Certificates Done Right — store any unavoidable connector secret safely and reference it from linked services.
Related: Azure Private Endpoint vs Service Endpoint: Secure PaaS Access — lock the copy’s network path down for production instead of the broad allow-Azure rule.

Build Your First Data Factory Pipeline: Copy Data from Blob Storage into Azure SQL

What problem this solves

Learning objectives

Prerequisites & where this fits

Core concepts

The four building blocks side by side

How a Copy activity actually runs

Choosing authentication for each linked service

The source and sink datasets in detail

Architecture at a glance

Real-world scenario

Advantages and disadvantages

Hands-on lab

Part 0 — Shared setup (storage, SQL, and the sample CSV)

Part A — Build it in the Azure portal (ADF Studio)

Part B — Build it with the `az` CLI

Part C — Build it as Bicep (infrastructure-as-code)

Validation checklist and teardown

Common mistakes & troubleshooting

Best practices

Security notes

Cost & sizing

Interview & exam questions

Quick check

Answers

Glossary

Next steps

Written by Vinod

Comments

Keep Reading

How an AVD Session Actually Connects: Broker, Gateway, and the Reverse-Connect Transport, Step by Step

Personal vs Pooled Host Pools: A Decision Framework for Picking the Right AVD Desktop Model and Sizing It

AzCopy Essentials: Reliable Copy, Sync and Resume for Large Data Transfers

Build Your First Data Factory Pipeline: Copy Data from Blob Storage into Azure SQL

What problem this solves

Learning objectives

Prerequisites & where this fits

Core concepts

The four building blocks side by side

How a Copy activity actually runs

Choosing authentication for each linked service

The source and sink datasets in detail

Architecture at a glance

Real-world scenario

Advantages and disadvantages

Hands-on lab

Part 0 — Shared setup (storage, SQL, and the sample CSV)

Part A — Build it in the Azure portal (ADF Studio)

Part B — Build it with the az CLI

Part C — Build it as Bicep (infrastructure-as-code)

Validation checklist and teardown

Common mistakes & troubleshooting

Best practices

Security notes

Cost & sizing

Interview & exam questions

Quick check

Answers

Glossary

Next steps

Written by Vinod

Comments

Keep Reading

How an AVD Session Actually Connects: Broker, Gateway, and the Reverse-Connect Transport, Step by Step

Personal vs Pooled Host Pools: A Decision Framework for Picking the Right AVD Desktop Model and Sizing It

AzCopy Essentials: Reliable Copy, Sync and Resume for Large Data Transfers

Part B — Build it with the `az` CLI