Loading Multiple Tables With Azure Data Factory

There arn’t many articles out there that discuss Azure Data Factory design patterns. One big concern I’ve encountered with customers is that there appears to be a requirement to create multiple pipelines/activities for every table you need to copy. This conjures up images of massive, convoluted data factories that are a nightmare to manage.

Whilst the ForEach functionality has existed in ADF V2 for a while now, it’s recently been enhanced to enable mass copying of tables without having to specify a target Linked Services and datasets for each table.

In this simple post I’m going to show you how to get started with this new functionality.

Using the Copy Data tool (and the Pipeline Editor), you can specify your target data source, then select multiple source tables. When you specify the target data source, Data Factory generates a ForEach activity that in turn calls a Copy activity that’s parameterized for each source table.

To get started we need to have an Azure Data Factory created, along with a Source and Target. In this example, I’ve used the Azure SQL Database with the sample AdventureWorks database and Azure Blob Storage as my target. See the following for assistance in getting setup –

Create A Data Factory

Create An Azure SQL Database

Create An Azure Blob Storage Account

With that all done, we launch our newly created Data Factory from the Azure portal, and select the Copy Data wizard –



Copy Data

Which takes us to our Copy Data wizard. Fill in the the Task name and leave the rest as is.


Give The Pipeline A Name

Click Next, and click “Create New Connection” to create our Source Linked Service –


Choose A Source Data Store

For this example, choose “Azure SQL Database” –


Linked Services

With your Linked Service chosen, populate your source details (Server, Credentials etc) from the SQL Database you created earlier and click “Continue”. This takes you to the “Select Tables” screen, and select all the tables you wish to copy –


Choose Tables For Extraction

Define your destination data store in the same way as you created the source data store. In this example I’m using Azure Blob Storage as part of an ELT (Extract, Load & Transform) pipeline, and is called “staging” in my example. It’s my storage account which will act as the landing/staging area for incoming data.


Destination Data Store

In your Destination Linked Service, complete the necessary details as prompted. As I’m using Blob Storage I simply have to designate the path, which in this instance is my “raw” container within my landing area. If it was a database source this would ask for me to map source tables to target tables.


Output Destination (This Screen Will Vary Depending On Your Destination)

Finish the wizard and run the completed pipeline –adfcopy11finalscreen

Once completed, we can see in our blob storage landing area that all my files have been successfully copied.


So far, so amazing, right? The even better bit is when we go to view the Data Factory pipeline we created, and instead of seeing loads of Copy activities, we simply see one, innocuous ForEach activity –


Just One Activity – What Witchcraft Is This?

Expanding “Activities” shows us there’s just been one “Copy” activity created, and is in fact parameterized with the relevant values (source table, target destination etc) being passed as a JSON array.



ForEach Activity Parameters

Which then instructs our Copy activity what to do –


Copy Activity Source


And we’re done. Much simpleness. For those of you have gone through the pain of creating endless packages in SSIS for simple copy/1:1 data flows, and have previously been creating similar amounts of Copy activities in Data Factory, hopefully this new release will make your ETL/ELT development a much simpler process.

Happy Factory-ing!






Incremental ETL Processing With Azure Data Factory v2

Data Factory V2 was announced at Ignite 2017 and brought with it a host of new capabilities:

  • Lift your SSIS workloads into Data Factory and run using the new Integrated Runtime (IR)
  • Ability to schedule  Data Factory using wall-clock timers or on-demand via event generation
  • Introducing the first proper separation of Control Flow and Data Flow to allow more complex orchestrations such as looping,  branching and other conditional flows
  • Incremental Loads using the new Lookup Activity

And it’s this last item that today’s article is about.

It’s fair to say that in its initial incarnation, Data Factory didn’t allow for more traditional ETL workloads without some complex coding (more than you were used to if you came from the world of SSIS and similar ETL tools). For the big data focused ELT workloads where data is moved between data services (SQL Server, Blob Storage, HDInsight and so forth) and activities applied whilst the data is in place (SQL queries, Hive, USQL, Spark) Data Factory V1 really excelled, but for those who wanted to move their traditional ETL delta extracts to Data Factory, it wasn’t quite there.

In this new public preview the product team have taken great steps in remedying this, allowing ETL developers to implement proper incremental loading patterns with their relational data warehouses.  It’s also worth knowing I’ve seen where Data Factory is going, and it’s looking amazing…

The following example is based on the official tutorial here. The differences in this example are based on the scenario where you wish to perform incremental extracts from a source database to a staging area inside another database. This example uses Azure SQL Database as both the source and sink, but can be adapted for other data sources.

The solution files used in this example can be found here


This example assumes you have previous experience with Data Factory, and doesn’t spend time explaining core concepts. For an overview of Data Factory concepts, please see here.

So for today, we need the following prerequisites:

  • An Azure Subscription
  • An Azure SQL Database instance setup using the AdventureWorksLT sample database

That’s it!

Incremental Loading in Data Factory v2

The diagram demonstrates a high level overview of the process, and should be familiar to those who have built similar data flows with other ETL tools such as SSIS:


Incremental Loading Overview


There are four main steps to this process to create an end to end incremental extract.

  1. Identify the trigger column. In the Data Factory documentation this is referred to as the Watermark column so I’ll use this latter term for consistency going forwards. This column is one that can be used to filter new or updated records for each run. This column is normally an automatically updating datetime column (e.g. Modified_Date_Time) or an ever increasing integer. We use the maximum value of this column as the watermark.
  2. Create a data store to hold the watermark value. In this example we store the watermark value in our SQL Database.
  3. Create a pipeline that does the following:
    • Creates two Lookup Activities. The first looks up the last watermark value and the second retrieves the current watermark value.
    • Create a Copy activity copies rows from the source where any watermark values are greater than the old watermark value. These delta rows are then written to our target table.
  4. Create a stored procedure that updates the watermark value, ready for the next run.

Building the Solution

Create a Staging Table

In this example, we’ll use the SalesOrderDetail table in the AdventureWorksLT database:



And we will use the ModifiedDate as the watermark column.

We therefore need to create a target or “staging” table to load our delta rows into.

Run the following SQL in your created database to create a staging version of the SalesOrderDetails table:

CREATE TABLE [SalesLT].[SalesOrderDetail_Staging]


[SalesOrderID] [int] NOT NULL,

[SalesOrderDetailID] [int] NOT NULL,

[OrderQty] [smallint] NOT NULL,

[ProductID] [int] NOT NULL, [UnitPrice] [money] NOT NULL,

[UnitPriceDiscount] [money] NOT NULL,

[LineTotal] [money] NULL,

[rowguid] [uniqueidentifier] NOT NULL,

[ModifiedDate] [datetime] NOT NULL,

[ExtractDate] [datetime] Default Getdate()


Create a table for our watermark values

Run the following command in your SQL Database to create a table named watermark that will store our watermark value:

create table watermark

TableName varchar(255),
WatermarkValue datetime,

Set the default value of the watermark with the name of our source table. In this example, the table name SalesOrderDetail.

Insert into watermarktable
values (‘SalesLT.SalesOrderDetail’, ‘1-jan-2017’)

Create a stored procedure in our SQL Database:

This stored procedure is used to update the watermark table when new rows have been extracted.

CREATE PROCEDURE sp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)


UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName


Create Some Trigger Data

Run the following command on your SQL database to update a couple of rows in the source table. This is to validate the process is working correctly as it marks the rows as modified with today’s date, which is greater than the row in our watermark table. If the process is successful it will extract these changed rows.

update [SalesLT].[SalesOrderDetail]
set modifieddate = getdate()
where salesorderid = 71774

Create a Data Factory

Create a new Data Factory. For ease, do it via the portal following this guide.

(Ensure you create it using ADFv2):


Creating a Data Factory via the Azure Portal

Create your Data Factory Artifacts

You will need to create the following (I’ve included my own samples in the link at the beginning of this article):

  • AzureSQLLinkedService
  • A source SQL dataset (for our source table)
  • A target SQL dataset (for our destination table)
  • A watermark dataset (that stores our watermark value entries
  • A pipeline that coordinates the activities

Modify the files to suit your needs and setup for your database server and database.

I won’t show the different JSON here, but just to highlight, the key area is the new Lookup Activity shown below:



Create a folder on your root C Drive to hold the created JSON files

I created a folder called “adf” and put all my JSON files there:


Deploy Data Factory Artifacts

For the current preview, you cannot deploy using the GUI, so you need to deploy programmatically. For this example I’m using Powershell. Note you will need to update your Powershell install with the latest cmdlets by running the following command:

Install-Module AzureRM

After updating the AzureRM cmdlets, run the following Powershell script below (attached at the start). This will deploy the necessary artifacts to your Data Factory.


Run the pipeline

Once deployed, you can run the deployed pipeline via Powershell with the following command:

$dataFactoryName = “[Your DATA FACTORY NAME]”
$resourceGroupName = “[Your RESOURCE GROUP NAME]”

$RunId = Invoke-AzureRmDataFactoryV2Pipeline -PipelineName “mainPipeline” -ResourceGroup $resourceGroupName -dataFactoryName $dataFactoryName

This fires off the pipeline. You can monitor its progress using Monitor & Manage:


Monitor Data Factory

And this allows you to see the status of the running pipelines:


With the pipeline processed successfully,  I’ll check my database tables to check everything has updated correctly:


Delta Rows Extracted


And the watermark table has been updated successfully:


Updated Watermark Table

This completes this article covering a simple ETL incremental extraction using the new Data Factory. The roadmap for this product is very exciting, and I encourage you to check out the links below for further reading:

Introduction to Data Factory v2

Data Factory Ignite 2017