Azure Data

Build Your First Data Factory Pipeline: Copy Data from Blob Storage into Azure SQL

A CSV lands in Blob Storage every night — an export from a partner, a dump from an old system, a file someone drops in a folder — and somebody needs it inside a database table so a report can run against it. The hand-rolled answer is a script on a VM with a cron job, a connection string in a config file, and a pager when it silently stops. The managed answer is Azure Data Factory (ADF) — Azure’s cloud ETL/ELT (extract-transform-load) and data-integration service — and its single most important building block, the Copy activity, which moves data with no servers for you to run. This article builds exactly that: a pipeline that copies a CSV from Azure Blob Storage into a table in Azure SQL Database, the “hello world” of data engineering on Azure.

We build it three ways, so you understand what each piece is, not just where to click. First in the Azure portal (the ADF Studio visual canvas), because seeing the boxes connect makes the model concrete. Then with the az CLI to script it. Then as Bicep so the whole thing — factory, linked services, datasets, pipeline — is infrastructure-as-code you commit and redeploy. Along the way you meet the four nouns every ADF pipeline is made of: a linked service (a connection to a store), a dataset (a named pointer at data inside it), the Copy activity (the verb that moves bytes), and the integration runtime (the compute that does the copying). Get these four right once and every future pipeline is a variation on the theme.

By the end you will have a working pipeline you can trigger on demand or on a schedule, know how to confirm a run succeeded and read its rows-copied count, and be able to debug the four failures that bite every beginner: a firewall blocking the copy, a wrong-credential auth error, a CSV-to-table schema mismatch, and a malformed source file. Get the first copy right and the mapping data flows, parameterised pipelines and triggers all follow.

What problem this solves

Moving data between stores sounds trivial until you do it in production. The script-on-a-VM approach breaks in slow, expensive ways: the VM reboots mid-copy, the connection string sits in plain text, there is no retry on a transient blip, no record of what ran when, no alert when last night’s load silently produced zero rows, and somebody must keep the VM alive forever. Multiply that by a dozen partner feeds and you have a fragile cottage industry of cron jobs nobody fully understands.

Data Factory replaces all of it with a managed, serverless service. You declare what to copy and where; Azure runs it on compute it manages (the integration runtime), retries transient failures, records every run with row counts and durations, alerts on failure, and keeps credentials in managed identity or Key Vault. You pay per activity run and per data-movement-hour, not for an idle VM. The same Copy activity spans 90+ connectors — Blob, ADLS Gen2, SQL, Cosmos DB, Oracle, S3, Snowflake, SFTP, REST — so the pattern generalises to almost any source-to-sink.

Who hits this: every data engineer, every analytics team, every “we just need this file in the warehouse” request. CSV-to-SQL is the most common starting point because it is the most common real need — a flat file becoming rows in a relational table before anything downstream can use it. Learn this one shape and you have the skeleton for ingestion across the platform.

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You need an Azure subscription with rights to create resources (Contributor on a resource group), the az CLI or Cloud Shell, and a SQL client for verification — sqlcmd, Azure Data Studio, or the portal’s Query editor. Basic comfort with a CSV (rows, a header, comma-separated values) and a relational table (columns with types) is assumed. You do not need to know ADF already — that is the point.

This sits at the start of the Data engineering track. Upstream are the two stores it connects: Azure Storage Account Fundamentals: Blobs, Files, Queues and Tables for the source, and Your First Azure SQL Database: Create, Configure Firewall Rules and Connect Securely for the sink — read those first if either is unfamiliar, because half the beginner failures here are really storage or SQL firewall problems in an ADF costume. Downstream, this Copy pipeline is the foundation for mapping data flows, parameterised pipelines, and event/schedule triggers. For sizing the target later, Azure SQL Database Purchasing Models: DTU vs vCore and How to Pick Without Overpaying covers it. Everything you build — factory, linked services, datasets, pipeline — lives inside the Data Factory in your resource group; the two stores it connects (the Blob container and the SQL table) live in their own services and are usually owned by the source-system and DBA teams respectively.

Core concepts

Four nouns and one verb make up every ADF pipeline. Internalise these and the rest is detail.

A linked service is a connection string with a name. It tells ADF how to reach a store — the storage account or SQL server, plus the credential — and you reuse it across many datasets. The credential can be a managed identity (the factory’s own Azure AD identity — the key-free default), an account key or SAS (Blob), or SQL authentication (username/password, SQL). This is the single most common place auth and firewall problems surface.

A dataset is a named pointer at data inside a linked service. Where the linked service says “this storage account,” the dataset says “this container, this file, comma-delimited with a header row”; for SQL it says “this table.” It references exactly one linked service. You create two: a source (the CSV in Blob) and a sink (the table in SQL).

The Copy activity is the verb that moves bytes from a source dataset to a sink dataset. It reads the source, optionally maps columns, writes the sink, and is where you set column mapping, write behaviour (append vs upsert), batching and fault tolerance. It lives inside a pipeline (an ordered collection of activities — here, just the one Copy). The vast majority of ingestion is “a Copy activity from A to B.”

The integration runtime (IR) is the compute that performs the copy. For cloud-to-cloud you rarely create one — the AutoResolveIntegrationRuntime (Azure-managed, serverless) handles Blob-to-SQL automatically. You only need a Self-Hosted Integration Runtime (SHIR) when a source sits on-premises or in a private network ADF cannot reach over the internet. The IR burns the data-movement compute you pay for.

A trigger is what starts a pipeline run. Run on demand (Debug or a manual trigger), on a schedule (nightly at 02:00), on a tumbling window (fixed contiguous slices — good for backfills), or on an event (a blob is created). You run yours manually first, then add a schedule trigger.

The four building blocks side by side

Pin the vocabulary down as a grid before the build:

Building block Definition References Example Why it matters
Linked service A connection + credential to a store (the root) ls_blob_source, ls_sql_sink Where auth/firewall errors live
Dataset A named pointer at data in a store One linked service ds_csv_orders, ds_sql_orders Where schema/format settings live
Copy activity Moves data source → sink Two datasets CopyOrdersCsvToSql The verb; mapping lives here
Pipeline Ordered set of activities The activities pl_copy_orders The unit you run and schedule
Integration runtime Compute that runs activities (used by activities) AutoResolveIntegrationRuntime What you pay for; default is fine
Trigger What starts a run The pipeline manual, then trg_daily_2am On-demand vs scheduled vs event

How a Copy activity actually runs

Knowing the order of a run lets you place a failure on the right step. The trigger creates a pipeline run; the IR connects and authenticates to the source (reading the CSV) and the sink (writing rows), then reports rows read/copied, volume and duration. The takeaway: every failure is at the source (cannot reach Blob, file missing, malformed row) or the sink (firewall, auth, type mismatch) — the error names which side — and the IR needs network and credential access to both stores, so the classic mistake is fixing one side and forgetting the other. The phases, what fails in each, and where you confirm:

Phase What happens Can fail because Where you see it
Trigger Run created with a run ID Trigger disabled / bad schedule Monitor → Trigger runs
Source connect IR authenticates to Blob Wrong key/SAS; network rule blocks IR Activity error: source side
Source read CSV parsed row by row File missing, bad delimiter, malformed row Activity error: DelimitedText
Sink connect IR authenticates to SQL Firewall blocks IR; bad credential SqlFailedToConnect
Sink write Rows inserted in batches Type mismatch, table missing, not-null SqlOperationFailed
Complete Metrics reported (success) Monitor → activity output

Choosing authentication for each linked service

Auth is where most first pipelines stall, so decide it deliberately. For Blob: an account key, a SAS token, or managed identity. For SQL: SQL authentication or managed identity mapped to a database user. The secure default is managed identity on both sides — no secrets — needing two grants (Storage Blob Data Reader on the account, a contained DB user in SQL). Key/SQL-auth is faster for a first lab; this article shows that path, and the managed-identity grants in Security notes. The options and when to pick which:

Store Auth option Secret? Effort When to use
Blob Account key Yes Lowest Quick labs
Blob SAS token Yes (expires) Low Scoped, time-bound access
Blob Managed identity No Medium Production default
SQL SQL authentication Yes Lowest Quick labs; legacy apps
SQL Managed identity No Medium Production default
Either Key Vault-backed In vault Medium Centralised secret governance

The single rule: never paste a key-bearing connection string into pipeline JSON that lands in git. Use managed identity or reference Key Vault. The lab keeps the key inside the linked service (ADF stores it encrypted); the Bicep references the storage key via listKeys() at deploy time rather than hardcoding it.

The source and sink datasets in detail

The two datasets carry the format and schema decisions, and two settings cause the most confusing beginner failures.

Source (CSV) dataset (type DelimitedText): the column delimiter (comma default), the quote character ("), the encoding (UTF-8 default; Windows exports are often UTF-8-BOM or Latin-1), and crucially First row as header — if the first line is column names this must be checked, or ADF treats the header as a data row and types every column as text. Point a dataset at a folder and ADF copies every file in it.

Sink (SQL) dataset (type AzureSqlTable, one table): the table name (schema-qualified, dbo.Orders), whether to auto-create the table (convenient, but it infers nvarchar(max), usually unwanted), and the column mapping — matching names auto-map, differing names you map by hand. The settings most likely to cause a silent wrong result or a hard error:

Setting Lives on Default When to change Gotcha if wrong
First row as header Source off CSV has a header line Header copied as data; columns all text
Column delimiter Source , TSV / pipe / semicolon files Whole row in one column
Quote character Source " Non-standard quoting Fields with commas split wrong
Encoding Source UTF-8 Latin-1 / UTF-8-BOM Garbled accents; BOM in first column
Table name Sink (you set it) Always set Wrong/missing → SqlOperationFailed
Auto create table Copy sink off Quick load of a new table Types inferred nvarchar(max)
Write behaviour Copy sink Insert Upsert / overwrite needed Duplicate rows on re-run

Architecture at a glance

The pipeline is a short left-to-right path. On the left, the source zone: a Blob Storage container holding orders.csv, reached through a linked service and read through a DelimitedText dataset. In the middle, the Data Factory: a pipeline with a single Copy activity on the AutoResolveIntegrationRuntime, the serverless compute that streams rows source to sink. On the right, the sink zone: an Azure SQL Database with the dbo.Orders table, reached through its own linked service and written through an AzureSqlTable dataset.

Follow the flow: a trigger starts a run; the Copy activity, on the Azure IR, authenticates to Blob and streams the CSV in (parsing it with the delimited-text settings); it simultaneously authenticates to SQL — where the SQL firewall must allow Azure services or the IR’s IP, the number-one beginner failure — and writes the rows in batches, finishing with rows-read and rows-copied metrics. The IR holds two independent connections, one per store, each needing its own network path and credential. The numbered badges mark the four hops where beginners get stuck — Blob auth, the SQL firewall, the SQL insert/type mapping, and the header/column mapping — and the legend says how to confirm and fix each.

Azure Data Factory Copy pipeline architecture moving a CSV from Blob Storage into Azure SQL Database: a source zone with a Blob container holding orders.csv reached via a linked service and a DelimitedText dataset; a central Data Factory zone with a pipeline containing one Copy activity running on the AutoResolveIntegrationRuntime; a sink zone with an Azure SQL Database dbo.Orders table reached via its linked service and an AzureSqlTable dataset; a trigger starts the run, the IR streams rows source to sink, and numbered badges mark the four beginner failure points — Blob auth, SQL firewall, SQL auth and schema mapping — with rows-copied metrics reported on completion

Real-world scenario

Brightleaf Logistics is a freight broker in Pune with a four-person analytics team. Every night their largest carrier drops a shipments export — shipments_YYYYMMDD.csv, around 180,000 rows / 40 MB — into an SFTP folder, and a finance dashboard on Azure SQL needs those rows by 06:00 for the morning margin report. For two years this ran as a Python script on a B2s VM: paramiko to pull, pandas to read, pyodbc to insert. It worked until it didn’t.

The failures were the classic VM-script failures. The VM rebooted for patching at 02:30 one Tuesday and the load simply did not run; nobody noticed until the 09:00 standup when the margin numbers were a day stale. A month later the carrier added a column and the pandas-to-SQL insert started throwing on a column-count mismatch — no error handling, so it crash-looped and filled the disk with stack traces. And the database password sat in plain text in /etc/shipments/config.ini, flagged in a security review. The team spent more time nursing the script than analysing freight.

They rebuilt it in Data Factory in an afternoon. No Self-Hosted IR was needed — the carrier agreed to drop the file into a Blob container via SAS instead of SFTP (one connector change). The pipeline became a single Copy activity from a DelimitedText source to an AzureSqlTable sink, authenticated with the factory’s managed identity on both sides — no password anywhere. A schedule trigger fired it at 02:00, fault tolerance skipped and logged incompatible rows rather than failing the load, and an alert on the pipeline-failed metric paged the on-call. The VM was decommissioned.

The payoff was immediate. The patch-reboot failure vanished — there was no VM to reboot. When the carrier added another column three months later, the Copy activity’s explicit mapping ignored it cleanly instead of crash-looping, and the engineer added it in five minutes. Cost dropped from a continuously-running B2s VM (~₹3,200/month) to ADF’s per-run pricing — roughly 30 activity runs/month plus ~0.2 data-movement hours per run, near ₹350/month — an order of magnitude cheaper, with retries, logging, alerting and no secrets. The lesson on the wall: “The copy was never the hard part. The VM, the password and the silence on failure were — and ADF deletes all three.”

Advantages and disadvantages

The managed Copy activity is the right default for ingestion, but not free of trade-offs:

Advantages Disadvantages
Serverless — no VM to patch, scale or babysit Learning curve (linked service vs dataset vs IR confuses beginners)
90+ connectors; one Copy pattern moves almost anything Per-run + per-DIU-hour pricing can surprise on high-frequency tiny copies
Built-in retries, logging, row counts and failure alerting Row-level transformation needs mapping data flows (Spark, more cost)
Managed identity / Key Vault keep secrets out of config On-prem / private sources need a Self-Hosted IR you maintain
Runs are recorded and auditable (run ID, duration, rows) Debugging means reading ADF error codes, not your own logs
IaC-friendly (ARM/Bicep/Terraform); fits CI/CD The generated JSON is verbose in a diff if hand-edited

The model wins decisively for scheduled, recurring movement between supported stores where you want operational simplicity and auditability over a hand-built script. It is a weaker fit for heavy in-flight transformation (reach for mapping data flows or Databricks), extreme cadence (thousands of tiny copies a minute — batch them), or a source unreachable from Azure without a Self-Hosted IR. For the CSV-to-SQL shape here, ADF is almost always the right answer.

Hands-on lab

This is the centerpiece. You build the full pipeline three ways — portal, az CLI, and Bicep — copying a CSV from Blob into Azure SQL, validating, and tearing it down. Everything here costs a few rupees. Do the portal walkthrough first to see the model, then the CLI/Bicep to script it. Run CLI steps in Cloud Shell (Bash).

Part 0 — Shared setup (storage, SQL, and the sample CSV)

You need a storage account with a CSV and a SQL database with a target table. Set up both first; the portal/CLI/Bicep parts reuse them.

Step 1 — Variables and resource group.

RG=rg-adf-lab
LOC=centralindia
SA=stadflab$RANDOM                 # storage account, globally unique, lowercase
SQLSRV=sql-adf-lab-$RANDOM         # SQL logical server, globally unique
SQLDB=db-orders
SQLADMIN=adflabadmin
SQLPASS='P@ssw0rd-Change-Me-2026!' # use a strong unique password
ADF=adf-lab-$RANDOM               # data factory, globally unique
az group create -n $RG -l $LOC -o table

Step 2 — Create the storage account and container.

az storage account create -n $SA -g $RG -l $LOC --sku Standard_LRS --kind StorageV2 -o table
KEY=$(az storage account keys list -n $SA -g $RG --query "[0].value" -o tsv)
az storage container create -n incoming --account-name $SA --account-key "$KEY" -o table

Expected: the account row with sku.name = Standard_LRS, then "created": true for the container.

Step 3 — Create the sample CSV and upload it as orders.csv:

cat > orders.csv <<'CSV'
OrderId,Customer,Amount,OrderDate
1001,Acme Corp,1499.50,2026-06-01
1002,Globex,250.00,2026-06-02
1003,Initech,9999.99,2026-06-03
1004,Umbrella,42.00,2026-06-04
CSV
az storage blob upload --account-name $SA --account-key "$KEY" \
  --container-name incoming --name orders.csv --file orders.csv --overwrite -o table

Expected: an upload result with the blob name and a lastModified timestamp (four data rows plus a header).

Step 4 — Create the Azure SQL logical server and database.

az sql server create -n $SQLSRV -g $RG -l $LOC \
  --admin-user $SQLADMIN --admin-password "$SQLPASS" -o table
az sql db create -s $SQLSRV -g $RG -n $SQLDB \
  --service-objective S0 --backup-storage-redundancy Local -o table

Expected: the server row, then the database row with currentServiceObjectiveName = S0 (a cheap DTU tier, fine for a lab).

Step 5 — Open the SQL firewall for Azure services AND your client. The most important step — without it the IR cannot connect (the number-one beginner error). The rule 0.0.0.00.0.0.0 means “allow Azure services” (including the ADF IR):

# Allow Azure services (the ADF integration runtime) to reach the server
az sql server firewall-rule create -s $SQLSRV -g $RG \
  -n AllowAzureServices --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0 -o table

# Allow YOUR current public IP so you can run sqlcmd to verify
MYIP=$(curl -s https://api.ipify.org)
az sql server firewall-rule create -s $SQLSRV -g $RG \
  -n AllowMyClient --start-ip-address $MYIP --end-ip-address $MYIP -o table

Expected: two firewall-rule rows. The AllowAzureServices rule with 0.0.0.0/0.0.0.0 is the one that lets the Copy activity connect.

Step 6 — Create the target table via the portal Query editor (log in with $SQLADMIN/$SQLPASS) or sqlcmd:

CREATE TABLE dbo.Orders (
  OrderId     INT          NOT NULL PRIMARY KEY,
  Customer    NVARCHAR(100) NOT NULL,
  Amount      DECIMAL(10,2) NOT NULL,
  OrderDate   DATE          NOT NULL
);

Expected: “Commands completed successfully.” Column names match the CSV header exactly, so ADF auto-maps.

Part A — Build it in the Azure portal (ADF Studio)

Step A1 — Create the Data Factory. Portal → Create a resource → search Data Factory → Create. Pick rg-adf-lab, a globally-unique name, Central India, V2. Leave Git config disabled for the lab. Review + create, then Open Azure Data Factory Studio.

Step A2 — Create the Blob source linked service. In Studio: ManageLinked services+ NewAzure Blob Storage → Continue. Name it ls_blob_source. Authentication Account key, select your subscription and the stadflab… account. Test connectionConnection successfulCreate.

Step A3 — Create the SQL sink linked service. + NewAzure SQL Database → Continue. Name ls_sql_sink, server sql-adf-lab-…, database db-orders. Authentication SQL authentication, user adflabadmin, password as set. Test connectionConnection successfulCreate. (If it fails, the Step 5 firewall rule is missing — see troubleshooting.)

Step A4 — Create the source dataset (the CSV). AuthorDatasets+New datasetAzure Blob StorageDelimitedText → Continue. Name ds_csv_orders, linked service ls_blob_source, File path container incoming / file orders.csv. Check “First row as header.” Import schema From connection/store. OK.

Step A5 — Create the sink dataset (the table). +New datasetAzure SQL Database → Continue. Name ds_sql_orders, linked service ls_sql_sink, table dbo.Orders, OK.

Step A6 — Build the pipeline with a Copy activity. Pipelines+Pipeline, name pl_copy_orders. From ActivitiesMove and transform → drag Copy data onto the canvas; name it CopyOrdersCsvToSql.

Step A7 — Validate, Debug, and read the result. Click Validate (no errors expected), then Debug to run. Watch the Output tab: the activity goes Queued → In progress → Succeeded. Click the activity’s details (glasses) icon and read Rows read: 4, Rows written: 4, Data read, Throughput, Duration.

Step A8 — Publish and add a schedule trigger. Click Publish all to save everything to the factory. Then Add triggerNew/Edit+ New → type Schedule, recurrence Every 1 Day at 02:00, your time zone → OK → Publish. The pipeline now runs nightly. (For the lab, leave it disabled or delete it in teardown.)

You built, in order: two linked services (A2/A3 — the firewall step gated A3), two datasets (A4/A5), a pipeline + Copy activity (A6), a run (A7 — 4 read, 4 written), and a schedule trigger (A8).

Part B — Build it with the az CLI

The CLI defines each object from a small JSON file with az datafactory, reusing Part 0’s storage, SQL and table. (Install the extension once: az extension add --name datafactory.)

Step B1 — Create the factory.

az extension add --name datafactory 2>/dev/null
az datafactory create --resource-group $RG --factory-name $ADF --location $LOC -o table

Expected: the factory row with provisioningState = Succeeded.

Step B2 — Create the Blob linked service (JSON references the storage key):

cat > ls_blob.json <<JSON
{ "type": "AzureBlobStorage",
  "typeProperties": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=$SA;AccountKey=$KEY;EndpointSuffix=core.windows.net"
  } }
JSON
az datafactory linked-service create --resource-group $RG --factory-name $ADF \
  --linked-service-name ls_blob_source --properties @ls_blob.json -o table

Step B3 — Create the SQL linked service.

cat > ls_sql.json <<JSON
{ "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "Server=tcp:$SQLSRV.database.windows.net,1433;Database=$SQLDB;User ID=$SQLADMIN;Password=$SQLPASS;Encrypt=true;TrustServerCertificate=false;Connection Timeout=30;"
  } }
JSON
az datafactory linked-service create --resource-group $RG --factory-name $ADF \
  --linked-service-name ls_sql_sink --properties @ls_sql.json -o table

Step B4 — Create the source and sink datasets.

cat > ds_csv.json <<'JSON'
{ "type": "DelimitedText",
  "linkedServiceName": { "referenceName": "ls_blob_source", "type": "LinkedServiceReference" },
  "typeProperties": {
    "location": { "type": "AzureBlobStorageLocation", "container": "incoming", "fileName": "orders.csv" },
    "columnDelimiter": ",", "firstRowAsHeader": true, "quoteChar": "\"" } }
JSON
az datafactory dataset create --resource-group $RG --factory-name $ADF \
  --dataset-name ds_csv_orders --properties @ds_csv.json -o table

cat > ds_sql.json <<'JSON'
{ "type": "AzureSqlTable",
  "linkedServiceName": { "referenceName": "ls_sql_sink", "type": "LinkedServiceReference" },
  "schema": [], "typeProperties": { "schema": "dbo", "table": "Orders" } }
JSON
az datafactory dataset create --resource-group $RG --factory-name $ADF \
  --dataset-name ds_sql_orders --properties @ds_sql.json -o table

Step B5 — Create the pipeline with the Copy activity.

cat > pl_copy.json <<'JSON'
{ "activities": [ {
    "name": "CopyOrdersCsvToSql", "type": "Copy",
    "inputs":  [ { "referenceName": "ds_csv_orders", "type": "DatasetReference" } ],
    "outputs": [ { "referenceName": "ds_sql_orders", "type": "DatasetReference" } ],
    "typeProperties": {
      "source": { "type": "DelimitedTextSource" },
      "sink":   { "type": "AzureSqlSink", "writeBehavior": "insert" },
      "enableStaging": false } } ] }
JSON
az datafactory pipeline create --resource-group $RG --factory-name $ADF \
  --name pl_copy_orders --pipeline @pl_copy.json -o table

Step B6 — Run it on demand and read the result.

RUN_ID=$(az datafactory pipeline create-run --resource-group $RG --factory-name $ADF \
  --name pl_copy_orders --query runId -o tsv)
echo "Run: $RUN_ID"

# Poll the run status until it leaves InProgress/Queued
az datafactory pipeline-run show --resource-group $RG --factory-name $ADF \
  --run-id $RUN_ID --query "{status:status, start:runStart, durationMs:durationInMs}" -o table

Expected: status moves to Succeeded (re-run the show command until it does). For rows copied, query the activity run:

# Activity-level output includes rowsRead / rowsCopied
az datafactory activity-run query-by-pipeline-run --resource-group $RG --factory-name $ADF \
  --run-id $RUN_ID \
  --last-updated-after $(date -u -d '-1 hour' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --last-updated-before $(date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v+1H +%Y-%m-%dT%H:%M:%SZ) \
  --query "value[0].output.{read:rowsRead, copied:rowsCopied, dataRead:dataRead}" -o table

Expected: read = 4, copied = 4; verify with SELECT COUNT(*) FROM dbo.Orders; → 4. Same six objects as the portal, now scriptable: create, linked-service create ×2, dataset create ×2, pipeline create, then pipeline create-run.

Part C — Build it as Bicep (infrastructure-as-code)

The same objects declared declaratively, so the whole factory is reviewable and redeployable. This assumes the storage account, SQL server/database and table exist (from Part 0); it pulls the storage key at deploy time with listKeys() and takes the SQL password as a secure parameter.

@description('Existing storage account name (source).')
param storageAccountName string
@description('Existing SQL logical server name (sink).')
param sqlServerName string
param sqlDatabaseName string = 'db-orders'
param sqlAdminUser string = 'adflabadmin'
@secure()
param sqlAdminPassword string
param location string = resourceGroup().location
param factoryName string = 'adf-lab-${uniqueString(resourceGroup().id)}'

// Reference the existing storage account to read its key at deploy time
resource sa 'Microsoft.Storage/storageAccounts@2023-05-01' existing = {
  name: storageAccountName
}

resource adf 'Microsoft.DataFactory/factories@2018-06-01' = {
  name: factoryName
  location: location
  identity: { type: 'SystemAssigned' } // enable for later managed-identity hardening
}

resource lsBlob 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = {
  parent: adf
  name: 'ls_blob_source'
  properties: {
    type: 'AzureBlobStorage'
    typeProperties: {
      connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccountName};AccountKey=${sa.listKeys().keys[0].value};EndpointSuffix=${environment().suffixes.storage}'
    }
  }
}

resource lsSql 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = {
  parent: adf
  name: 'ls_sql_sink'
  properties: {
    type: 'AzureSqlDatabase'
    typeProperties: {
      connectionString: 'Server=tcp:${sqlServerName}${environment().suffixes.sqlServerHostname},1433;Database=${sqlDatabaseName};User ID=${sqlAdminUser};Password=${sqlAdminPassword};Encrypt=true;TrustServerCertificate=false;Connection Timeout=30;'
    }
  }
}

resource dsCsv 'Microsoft.DataFactory/factories/datasets@2018-06-01' = {
  parent: adf
  name: 'ds_csv_orders'
  properties: {
    type: 'DelimitedText'
    linkedServiceName: { referenceName: lsBlob.name, type: 'LinkedServiceReference' }
    typeProperties: {
      location: { type: 'AzureBlobStorageLocation', container: 'incoming', fileName: 'orders.csv' }
      columnDelimiter: ','
      firstRowAsHeader: true
      quoteChar: '"'
    }
  }
}

resource dsSql 'Microsoft.DataFactory/factories/datasets@2018-06-01' = {
  parent: adf
  name: 'ds_sql_orders'
  properties: {
    type: 'AzureSqlTable'
    linkedServiceName: { referenceName: lsSql.name, type: 'LinkedServiceReference' }
    typeProperties: { schema: 'dbo', table: 'Orders' }
  }
}

resource pl 'Microsoft.DataFactory/factories/pipelines@2018-06-01' = {
  parent: adf
  name: 'pl_copy_orders'
  properties: {
    activities: [
      {
        name: 'CopyOrdersCsvToSql'
        type: 'Copy'
        inputs:  [ { referenceName: dsCsv.name, type: 'DatasetReference' } ]
        outputs: [ { referenceName: dsSql.name, type: 'DatasetReference' } ]
        typeProperties: {
          source: { type: 'DelimitedTextSource' }
          sink:   { type: 'AzureSqlSink', writeBehavior: 'insert' }
          enableStaging: false
        }
      }
    ]
  }
}

output factory string = adf.name
output principalId string = adf.identity.principalId // grant this Storage Blob Data Reader to go key-free

Deploy, then trigger a run:

az deployment group create -g $RG --template-file adf.bicep \
  --parameters storageAccountName=$SA sqlServerName=$SQLSRV sqlAdminPassword="$SQLPASS" -o table

# Then run it exactly as in Part B, step B6 (create-run + poll + read rows)

Expected: a successful deployment, then a run that copies 4 rows. The principalId output is the factory’s managed identity — grant it Storage Blob Data Reader and create a contained SQL user for it to drop the keys entirely (see Security notes).

Validation checklist and teardown

Validate the whole thing end to end:

# 1) The run reported success and copied 4 rows (Part B/C step 6 output)
# 2) The rows are actually in SQL — run in Query editor or sqlcmd:
#    SELECT COUNT(*) FROM dbo.Orders;   -> 4
#    SELECT * FROM dbo.Orders ORDER BY OrderId;  -> the four orders
What to check How Expected
Run succeeded Monitor / pipeline-run show status = Succeeded
Rows copied activity output rowsCopied 4
Rows in SQL SELECT COUNT(*) FROM dbo.Orders 4
Types correct SELECT * FROM dbo.Orders Amount decimal, OrderDate a date
Trigger created Monitor → Trigger runs the schedule trigger listed

Teardown — stop all billing. Delete the whole resource group when done:

az group delete -n $RG --yes --no-wait

This removes the factory, storage, SQL server/database and any trigger in one shot. ADF has no idle charge, but the SQL S0 database and stored CSV bill while they exist, so delete the group. Helper files (*.json, adf.bicep, orders.csv) you rm separately.

Common mistakes & troubleshooting

These four (plus a few) account for nearly every failed first pipeline. Scan the table, then read the detail for your row — each gives the exact error string and the precise fix.

# Symptom / error Root cause Confirm Fix
1 SqlFailedToConnect / network error to SQL SQL firewall blocks the ADF IR SQL server → Networking; no 0.0.0.0 allow-Azure rule az sql server firewall-rule create … --start-ip 0.0.0.0 --end-ip 0.0.0.0
2 Login failed for user / Blob 403 Wrong credential in the linked service Linked service → Test connection names the side Re-enter password / refresh key; or managed identity
3 … cannot be converted to type Decimal / SqlOperationFailed Type mismatch (CSV text doesn’t fit the column) Activity error names the column + target type Fix column type or value; map explicitly on Mapping tab
4 All columns text / header is a data row First row as header unchecked Source dataset → Connection → checkbox off Check it; re-import schema; re-run
5 Column count mismatch / extra-or-missing cols CSV columns ≠ table columns after a source change Compare CSV header to table; Mapping tab Add/remove the column; enable explicit mapping
6 BlobNotFound / blob does not exist Wrong container/file path, or not uploaded Dataset path vs az storage blob list Fix container/fileName; upload the file
7 Garbled accents / ? or BOM in first column Wrong encoding on the source dataset Check the CSV’s encoding Set dataset encoding (UTF-8 / UTF-8-BOM / Latin-1)
8 Duplicate rows after re-running Write behaviour is insert (append) Row count doubled in SQL Upsert (set keys) or truncate via pre-copy script
9 RequestSizeExceeded / throttling on a big file Batch size / DTU mismatch Activity duration; SQL DTU pinned Tune writeBatchSize; raise SQL tier; enable staging

Two of these deserve the reasoning the table can’t carry:

The firewall block (SqlFailedToConnect, number-one failure). The IR connects from Azure’s IP space, so unless “Allow Azure services and resources to access this server” is on (the 0.0.0.00.0.0.0 rule), every Copy to SQL fails at sink-connect. Confirm in SQL server → Networking; fix with az sql server firewall-rule create … --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0. For production prefer a Private Endpoint + managed VNet IR over opening to all Azure services — see Azure Private Endpoint vs Service Endpoint: Secure PaaS Access.

Header copied as data (everything is text). First row as header is unchecked on the source dataset, so ADF reads the column-name line as data and types every column as string. Confirm on the dataset’s Connection tab (the preview shows OrderId/Customer/... as a data row); check the box, re-import the schema, re-run. This one-checkbox bug accounts for a surprising share of “why is my data all wrong” tickets.

Best practices

Security notes

The secure target is zero secrets in the pipeline and least-privilege access to each store. Three moves.

Use the factory’s managed identity for both stores. Enable a system-assigned identity (the Bicep does this), then grant the minimum role: Storage Blob Data Reader on the account (read-only), and a contained database user in SQL with just db_datawriter. The grants:

# Grant the factory's identity read on the storage account (least privilege)
ADF_PRINCIPAL=$(az datafactory show -g $RG -n $ADF --query identity.principalId -o tsv)
SA_ID=$(az storage account show -n $SA -g $RG --query id -o tsv)
az role assignment create --assignee $ADF_PRINCIPAL \
  --role "Storage Blob Data Reader" --scope $SA_ID
-- In the target DB: create a user for the factory's identity and grant only write
CREATE USER [adf-lab-xxxx] FROM EXTERNAL PROVIDER;   -- use the factory name
ALTER ROLE db_datawriter ADD MEMBER [adf-lab-xxxx];

Prefer Key Vault for any credential you cannot avoid. If a connector only supports a key/secret, store it in Azure Key Vault and reference it from the linked service, so the secret lives in the vault (audited, rotatable), never in the JSON. See Azure Key Vault: Secrets, Keys and Certificates Done Right.

Lock the network down for production. The lab’s 0.0.0.0 allow-Azure SQL rule is broad — it admits any Azure tenant’s services. For production, put SQL and storage behind Private Endpoints and run the copy on a managed VNet IR, then remove the rule. Azure Private Link and Private DNS: Keeping PaaS Off the Public Internet covers the pattern. The posture, lab to production:

Control Lab default Production target
Blob auth Account key Managed identity + Storage Blob Data Reader
SQL auth SQL username/password Managed identity + contained DB user (least role)
SQL network 0.0.0.0 allow-Azure rule Private Endpoint + managed VNet IR; rule removed
Secret storage Encrypted in the factory Key Vault-referenced where unavoidable
Audit Run history Run history + Monitor alerts + diagnostic logs

Cost & sizing

ADF Copy pricing has two components, both small here: per activity run (pennies per 1,000) and per data-integration unit hour (DIU-hour) of movement (IR compute time). A DIU measures the IR’s CPU/memory/network; the Azure IR auto-allocates them (~4 to start) and you pay time × DIUs. A small CSV copy costs a fraction of a rupee; the nightly 180k-row scenario job ran near ₹350/month. There is no idle charge — a factory that runs nothing costs nothing.

The two stores cost more than the pipeline. Azure SQL at S0 (10 DTU) is ~₹1,200–1,300/month if left running; scale to Basic (~₹350) for a lab, or delete. Blob for a few small CSVs is a few rupees. The dominant lab cost is the SQL database — which is why teardown deletes the resource group. The drivers and how to control each:

Cost driver What you pay for Rough INR / month How to control it
ADF activity runs Per pipeline/activity run a few ₹ Batch files per run; avoid tiny frequent runs
Data movement (DIU-hours) IR time × DIUs during copy ₹50–400 Co-locate regions; right DIU
Azure SQL sink The S0/Basic database ~₹350 (Basic) – ₹1,300 (S0) Scale down or delete when idle
Blob storage CSV bytes + transactions a few ₹ Lifecycle-tier or delete old files
Self-Hosted IR (if used) The VM you run it on VM cost (not needed) Only for private/on-prem sources

For the SQL target, a nightly batch into a small reporting table is comfortable on Basic/S0; heavier or concurrent loads want more DTUs or a vCore tier — Azure SQL Database Purchasing Models: DTU vs vCore and How to Pick Without Overpaying covers choosing without overspending.

Interview & exam questions

1. The four core building blocks of an ADF pipeline? A linked service (connection + credential to a store), a dataset (pointer at data inside a linked service, carrying format/schema), the Copy activity (the verb moving data source → sink), and the integration runtime (compute that runs activities). A pipeline groups activities; a trigger starts a run.

2. Difference between a linked service and a dataset? A linked service says how to reach a store; a dataset says which data and in what shape (this file, comma-delimited with a header; or this table). A dataset references exactly one linked service, and many can share one.

3. When do you need a Self-Hosted IR instead of the default Azure IR? When a source/sink is on-premises or in a private network the Azure-managed IR cannot reach over the internet. The default AutoResolveIntegrationRuntime handles cloud-to-cloud copies (Blob-to-Azure-SQL); a SHIR is one you install on a machine with line-of-sight to the private store.

4. A Copy to Azure SQL fails with a connection error. Most likely cause for a first pipeline? The SQL firewall is not allowing the ADF IR, which connects from Azure’s IP space. Fix: enable “Allow Azure services” (the 0.0.0.00.0.0.0 rule), or use a Private Endpoint with a managed VNet IR in production.

5. Your data has the header in the first data row and every column typed as text. What is wrong? First row as header is unchecked on the source dataset, so ADF reads the header as data and types all columns as strings. Check it, re-import the schema, re-run.

6. What does write behaviour control, and why does re-running duplicate rows? It determines how rows land — insert/append (the default) just adds rows, so re-running appends them again. Use upsert (with keys) or truncate/overwrite for idempotent re-runs.

7. How is ADF Copy priced, and does an idle factory cost money? Per activity run and per DIU-hour of data-movement compute (IR time × DIUs). An idle factory has no standing charge — you pay only when pipelines run.

8. How do you avoid storing any secret in the pipeline? Use the factory’s managed identity for both linked services — Storage Blob Data Reader on the account and a least-privilege contained DB user in SQL. For connectors needing a key, store it in Key Vault and reference it rather than embedding it.

9. A nightly load of a large CSV throttles the SQL database. What can you tune? Lower the write batch size, temporarily raise the SQL tier for the load window, enable staging (PolyBase/COPY for bulk where supported), and keep the IR in the same region as the database. The mismatch is copy throughput vs the database’s DTU/vCore headroom.

10. Which trigger types does ADF support and when do you use each? Schedule (fixed times — nightly batches), tumbling window (contiguous non-overlapping slices — backfills/dependencies), event (blob created/deleted — file-arrival ingestion), and manual/on-demand (Debug or a manual trigger).

11. Difference between Debug and triggering a published pipeline? Debug runs the in-progress, unpublished version from the canvas for quick iteration (with activity breakpoints); triggering runs the published pipeline — the version scheduled/automated runs use. Debug to develop, publish to make it real, then trigger.

These map to DP-203 (Azure Data Engineer Associate)ingest and transform data, orchestration, and integration runtimes — and the data-movement portions of DP-900 (Azure Data Fundamentals). A compact cert mapping for revision:

Question theme Primary cert Objective area
Building blocks, linked service vs dataset DP-900 / DP-203 Core data integration concepts
Integration runtimes (Azure vs SHIR) DP-203 Design & implement data processing
Copy activity, write behaviour, mapping DP-203 Ingest and transform data
Triggers (schedule/tumbling/event) DP-203 Orchestrate data processing
Security (managed identity, Key Vault, firewall) DP-203 / AZ-500 Secure data; manage identities
Cost (DIU, activity runs, no idle charge) DP-203 Optimize and manage data solutions

Quick check

  1. Name the four core building blocks of an ADF pipeline and what each one is for.
  2. Your Copy to Azure SQL fails to connect even though the credentials are correct. What is the single most likely cause, and the one-line fix?
  3. After a copy, every column is text and the header is in the first row. Which one setting fixes it?
  4. Do you need a Self-Hosted Integration Runtime to copy from Blob Storage into Azure SQL? Why or why not?
  5. You re-ran your pipeline and now have duplicate rows in the table. Why, and what write behaviour avoids it?

Answers

  1. Linked service (connection + credential), dataset (pointer at data, with format/schema), Copy activity (the verb moving data source → sink), and integration runtime (the compute). A pipeline groups activities; a trigger starts a run.
  2. The SQL firewall is blocking the ADF IR. Enable “Allow Azure services” — az sql server firewall-rule create … --start-ip-address 0.0.0.0 --end-ip-address 0.0.0.0 (or a Private Endpoint + managed VNet IR in production).
  3. First row as header is unchecked on the source dataset, so ADF reads the header as data and types everything as text. Check it and re-import the schema.
  4. No. The default AutoResolveIntegrationRuntime handles cloud-to-cloud copies like Blob-to-Azure-SQL. You only need a Self-Hosted IR for on-premises/private-network stores the Azure IR cannot reach.
  5. Write behaviour is insert/append (the default), so re-running adds the rows again. Use upsert (with keys) or truncate/overwrite before the copy for idempotent re-runs.

Glossary

Next steps

You can now build, run, secure and schedule a Blob-to-SQL Copy pipeline. Build outward:

AzureData FactoryETLBlob StorageAzure SQLCopy ActivityData IntegrationPipelines
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading