Extract, Transform & Load Options In Azure

Extract, Transform & Load (ETL), or ELT, ETLT or whatever term we use nowadays, has been around for many decades. In terms of “what it is” – although I’m sure many database professionals will be familiar with the concept – I will defer to Wikipedia for a formal definition –

In computing, extract, transform, load (ETL) is a process in database usage to prepare data for analysis, especially in data warehousing[1]. The ETL process became a popular concept in the 1970s.[2] Data extraction involves extracting data from homogeneous or heterogeneous sources, while data transformation processes data by transforming them into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, or a data warehouse. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions

Courtesy Of – https://en.wikipedia.org/wiki/Extract,_transform,_load

Essentially, we take some data stuff, pull it out from its home, mash it into a nicer shape, then serve it to welcoming consumers. See below for a simple example I’m sure we’ve all seen before –

So, that’ll do for a definition, but for this post I’m not going into the deep details of how ETL processes are built, instead I’m going to provide some details of the available options in Azure, and when/why you would use them. There are many choices right now, so it felt a good time to summarize the offerings.

SSIS

SSIS (SQL Server Integration Services) has been around for many, many years. It’s very mature, has lots of supporting books, blogs and articles as to how to setup and develop with. Part of the SQL Server product stack, it provides connectors to many different sources and contains many different transformation tasks that can handle pretty much any kind of traditional ETL workflow. Whilst showing it’s age a little in terms of the UI, recent additions such as the Azure Feature Pack has meant it can be used for large scale ETL scenarios into Azure, and an extensive community backing has led to the emergence of such automated ETL build tools for SSIS such as BIML.

For on-premise ETL workloads, SSIS is still the go-to tool of choice for those invested in the Microsoft BI Stack. It’s fully integrated into SQL Server Data Tools and is well equipped to handle most workloads.

For those people moving to Azure and sunk a lot of development time into SSIS, have no fear. Whilst you can simply lift your SQL Server virtual machine running SSIS into the cloud and keep running as before, for those who want to leverage more PaaS-based approaches we now offer the facility to publish your SSIS packages directly into Data Factory, and call your packages within SSIS in a similar way that’s been done previously in SSIS Master-Child package patterns.

SSIS Control Flow

So, for those projects that are born in the cloud, can we use SSIS? Yes, sure you can. But if your using mainly cloud-born data or a hybrid, or if you need to move data at scale, then you probably need to look at something else, which leads us nicely to…

Data Factory

Now, up until recently, I wouldn’t have called Data Factory an ETL tool. I would instead have called it a Cloud-Based Data Orchestration Tool or an EPL (Extract, Process, Load) tool instead. The simple reason being is that Data Factory excelled (and still does) at moving data in bulk into the cloud and then orchestrating a series of processes underneath. An example would be to copy terrabytes of data from your on-premise databases, land it in a Azure Blob Storage or Data Lake, then call a processing engine on top to cleanse, transform and curate the data. This would be done using tools such as Azure Data Lake Analytics, HDInsight, Polybase via SQL Data Warehouse or, more recently, with Azure Databricks.

This kind of pattern (land -> process -> move on) evolved as part of the big data architecture pattern. It was becoming increasingly more difficult to do traditional ETL using tools like SSIS as the scale, size and shape of the data to be processed meant that you’d be hard pressed to fit it all in memory and process. What Data Factory allows you to do is copy the data at massive scale into your data lake, and then use processing tools more appropriate to the job to transform the data ready for usage downstream. This pattern is not dissimilar in fact to a common pattern seen in SSIS wherein developers simply used SSIS as the orchestrator and called a series of SQL Statements (often Stored Procedures) to handle the processing. When I used to build these packages on a weekly basis, this was the pattern I preferred as I liked to take advantage of the processing power of the underlying database engine. (The difference now is that it’s not just relational tables we’re dealing with and SQL code. It’s parquet, orc and avro combined with SQL and Python, mixed with a healthy does of JSON, NoSQL, Key Value pairs and Graph databases plus a sprinkle of Spark. For those of us who cut our teeth on SQL and “simple” ETL patterns – this isn’t Kansas anymore)

This pattern with Data Factory is in reality an ELT (Extract, Load & Transform) approach. This means we extract, load it into the data lake, then process by the relevant tooling. This is compared to traditional ETL where the data is transformed in “flight” in the ETL tool in memory. With Data Factory we process the data where it is, using the engine best suited for that particular task.

Data Factory Pipeline

Now, Data Factory has recently taken a massive step forwards to being a true ETL tool with the annoucment of Azure Data Factory Data Flows. Currently in private preview, this aims to bring true cloud-scale visual ETL into Data Factory by allowing you build data pipelines without having to write SQL, Python, R, Scale, Java as part of your activities. Instead, Dataflows allows developers to build a data flow using a visual environment and the actual work is pushed down into an Azure Databricks cluster underneath. This high level abstraction allows developers to focus on building the actual pipeline without having to code specific routines. This visual approach, combined with using Apache Spark (Databricks) as the processing engine under the hood, essentially means you get the best of both worlds. A visual ETL design experience to simplify development combined with a processing engine that can deal with the massive scale and variety of data often encountered when building modern data warehouses in the cloud.

Check out the preview video here, with a (not great) screen grab below –

Data Factory Data Flows

Data Flows are shaping to be an awesome addition to Data Factory and I’m really looking forward to the innovative patterns that emerge when it’s out in the big wide world.

Power BI Data Flows

Recently, we announced (or more specifically re-announced with a new name) Power BI Data Flows (Not to be confused with Data Factory Data Flows). For those familiar with Power BI it’s essentially the Power Query part (a really powerful data preparation and cleansing engine) that’s now been taken out and added to the Power BI Service as a cloud-based self serve data prep tool. The aim of this is to allow BI Developers and Analysts to build re-usable data flows that are published into Power BI. These flows can then be consumed as a data source by Power BI (and soon, other tools) Desktop.  It also – and this is big – allows for the ability to write the output of the Data Flow back into Azure Data Lake (Gen 2), thus allowing the prepped data to be picked up and used by other tools. This latter functionality essentially democratizes ETL into the hands of the BI Developer or Analyst. Whereas previously it was often on the head of ETL developers or data engineers to build the pipelines for the analysts to consume, Dataflows now allows them to build these pipelines themselves and essentially giving them a full, end to end, analytics platform capability.

Essentially, with Power BI Dataflows, you can build the ETL, model the data, transform it into a reporting friendly structure (think Kimball-style dimensional models) and serve it, either within Power BI directly or to be consumed by other tools.

Another benefit of Power BI Data Flows is that it allows data transformation logic that had previously been locked to a specific Power BI Workbook to be taken out and published, allowing it to be re-used by other consumers. So rather than having duplicated logic hidden in Power BI workbooks you can simple build once and reuse many times.

Power BI Dataflows

This sounds awesome, I hear you cry. Why do I need anything else? Why can’t I just use Power BI end to end and be done with it? Let me dump my cumbersome data warehouse and just build the entire thing in Power BI.

Can I do that?

Well, in some cases, yes you can. But in others, I would stay probably stay away from it.

For team and departmental BI, Power BI Dataflows is a fantastic addition. No longer do you need to invest in a complex ETL and data warehouse build or conversely, be stuck in manual data manipulation hell using hand cranked SQL Code, Excel Vlookups and other time consuming and error-prone activities. Dataflows allows you to automate the data processing, centralise it, and then let the team do what adds most value, producing insights.

For complex data integration projects involving multiple teams, agile processes, automated testing and all the other components that comprise such a project, then it’s probably not best for that space. There’s no formal source control or devops integration to start with, and I’m to see how it scales to massive datasets. It’s very early in its life though so who knows where it’s going to go. It’s a powerful addition to the Power BI armoury, and I can definitely see some pure Power BI-only self contained BI solutions being created with ETL, Semantic Layer and Dashboards all build in the same platform.

3rd Party Offerings

As well as the first party options that Microsoft supply, there are also a plethora of tools available in the Azure Marketplace provided by real big hitters in the ETL space that are fully supported in Azure. This article doesn’t cover those, but for some examples check out Informatica and Datameer.

Summary

As always, there isn’t a black and white option about what tooling you should use, and often depends on several different factors including team size, skills, data source complexity and scale plus many others. Below I try and summarise where I feel each solution fits best, but over the coming weeks and months as Power BI Data Flows and the upcoming Data Factory Data Flows are picked up by more and more data developers, it’ll be interesting to see what new patterns emerge.

It’s also fair to say that you could pretty much adapt any of them to suit your needs depending on preference. The table belows some of my own opinions of what’s best for what, but feel free to experiment with each and see what works.

ETL Summary

Further Reading

ADF and SSIS Compared

Introducing Power BI Dataflows

Building ETL With Azure Data Factory (Whitepaper)

Loading Multiple Tables With Azure Data Factory

There arn’t many articles out there that discuss Azure Data Factory design patterns. One big concern I’ve encountered with customers is that there appears to be a requirement to create multiple pipelines/activities for every table you need to copy. This conjures up images of massive, convoluted data factories that are a nightmare to manage.

Whilst the ForEach functionality has existed in ADF V2 for a while now, it’s recently been enhanced to enable mass copying of tables without having to specify a target Linked Services and datasets for each table.

In this simple post I’m going to show you how to get started with this new functionality.

Using the Copy Data tool (and the Pipeline Editor), you can specify your target data source, then select multiple source tables. When you specify the target data source, Data Factory generates a ForEach activity that in turn calls a Copy activity that’s parameterized for each source table.

To get started we need to have an Azure Data Factory created, along with a Source and Target. In this example, I’ve used the Azure SQL Database with the sample AdventureWorks database and Azure Blob Storage as my target. See the following for assistance in getting setup –

Create A Data Factory

Create An Azure SQL Database

Create An Azure Blob Storage Account

With that all done, we launch our newly created Data Factory from the Azure portal, and select the Copy Data wizard –

 

ADFCopyData1.png

Copy Data

Which takes us to our Copy Data wizard. Fill in the the Task name and leave the rest as is.

adfcopy2.PNG

Give The Pipeline A Name

Click Next, and click “Create New Connection” to create our Source Linked Service –

adfcopy3createnewconnection

Choose A Source Data Store

For this example, choose “Azure SQL Database” –

adfcopy4choosesqldb

Linked Services

With your Linked Service chosen, populate your source details (Server, Credentials etc) from the SQL Database you created earlier and click “Continue”. This takes you to the “Select Tables” screen, and select all the tables you wish to copy –

adfcopy6choosetables

Choose Tables For Extraction

Define your destination data store in the same way as you created the source data store. In this example I’m using Azure Blob Storage as part of an ELT (Extract, Load & Transform) pipeline, and is called “staging” in my example. It’s my storage account which will act as the landing/staging area for incoming data.

adfcopy7stroagetarget

Destination Data Store

In your Destination Linked Service, complete the necessary details as prompted. As I’m using Blob Storage I simply have to designate the path, which in this instance is my “raw” container within my landing area. If it was a database source this would ask for me to map source tables to target tables.

adfcopy8blobrawfolder

Output Destination (This Screen Will Vary Depending On Your Destination)

Finish the wizard and run the completed pipeline –adfcopy11finalscreen

Once completed, we can see in our blob storage landing area that all my files have been successfully copied.

adfcopy15storageview

So far, so amazing, right? The even better bit is when we go to view the Data Factory pipeline we created, and instead of seeing loads of Copy activities, we simply see one, innocuous ForEach activity –

adfcopy12inadfForEach

Just One Activity – What Witchcraft Is This?

Expanding “Activities” shows us there’s just been one “Copy” activity created, and is in fact parameterized with the relevant values (source table, target destination etc) being passed as a JSON array.

 

adfcopy16jsonarray

ForEach Activity Parameters

Which then instructs our Copy activity what to do –

adfcopy14CopyhActivity

Copy Activity Source

 

And we’re done. Much simpleness. For those of you have gone through the pain of creating endless packages in SSIS for simple copy/1:1 data flows, and have previously been creating similar amounts of Copy activities in Data Factory, hopefully this new release will make your ETL/ELT development a much simpler process.

Happy Factory-ing!

 

 

 

 

 

Incremental ETL Processing With Azure Data Factory v2

Data Factory V2 was announced at Ignite 2017 and brought with it a host of new capabilities:

  • Lift your SSIS workloads into Data Factory and run using the new Integrated Runtime (IR)
  • Ability to schedule  Data Factory using wall-clock timers or on-demand via event generation
  • Introducing the first proper separation of Control Flow and Data Flow to allow more complex orchestrations such as looping,  branching and other conditional flows
  • Incremental Loads using the new Lookup Activity

And it’s this last item that today’s article is about.

It’s fair to say that in its initial incarnation, Data Factory didn’t allow for more traditional ETL workloads without some complex coding (more than you were used to if you came from the world of SSIS and similar ETL tools). For the big data focused ELT workloads where data is moved between data services (SQL Server, Blob Storage, HDInsight and so forth) and activities applied whilst the data is in place (SQL queries, Hive, USQL, Spark) Data Factory V1 really excelled, but for those who wanted to move their traditional ETL delta extracts to Data Factory, it wasn’t quite there.

In this new public preview the product team have taken great steps in remedying this, allowing ETL developers to implement proper incremental loading patterns with their relational data warehouses.  It’s also worth knowing I’ve seen where Data Factory is going, and it’s looking amazing…

The following example is based on the official tutorial here. The differences in this example are based on the scenario where you wish to perform incremental extracts from a source database to a staging area inside another database. This example uses Azure SQL Database as both the source and sink, but can be adapted for other data sources.

The solution files used in this example can be found here

Prerequisites

This example assumes you have previous experience with Data Factory, and doesn’t spend time explaining core concepts. For an overview of Data Factory concepts, please see here.

So for today, we need the following prerequisites:

  • An Azure Subscription
  • An Azure SQL Database instance setup using the AdventureWorksLT sample database

That’s it!

Incremental Loading in Data Factory v2

The diagram demonstrates a high level overview of the process, and should be familiar to those who have built similar data flows with other ETL tools such as SSIS:

advv2incloadingoverview

Incremental Loading Overview

Process

There are four main steps to this process to create an end to end incremental extract.

  1. Identify the trigger column. In the Data Factory documentation this is referred to as the Watermark column so I’ll use this latter term for consistency going forwards. This column is one that can be used to filter new or updated records for each run. This column is normally an automatically updating datetime column (e.g. Modified_Date_Time) or an ever increasing integer. We use the maximum value of this column as the watermark.
  2. Create a data store to hold the watermark value. In this example we store the watermark value in our SQL Database.
  3. Create a pipeline that does the following:
    • Creates two Lookup Activities. The first looks up the last watermark value and the second retrieves the current watermark value.
    • Create a Copy activity copies rows from the source where any watermark values are greater than the old watermark value. These delta rows are then written to our target table.
  4. Create a stored procedure that updates the watermark value, ready for the next run.

Building the Solution

Create a Staging Table

In this example, we’ll use the SalesOrderDetail table in the AdventureWorksLT database:

salesorderdetail

SalesOrderDetail

And we will use the ModifiedDate as the watermark column.

We therefore need to create a target or “staging” table to load our delta rows into.

Run the following SQL in your created database to create a staging version of the SalesOrderDetails table:

CREATE TABLE [SalesLT].[SalesOrderDetail_Staging]

(

[SalesOrderID] [int] NOT NULL,

[SalesOrderDetailID] [int] NOT NULL,

[OrderQty] [smallint] NOT NULL,

[ProductID] [int] NOT NULL, [UnitPrice] [money] NOT NULL,

[UnitPriceDiscount] [money] NOT NULL,

[LineTotal] [money] NULL,

[rowguid] [uniqueidentifier] NOT NULL,

[ModifiedDate] [datetime] NOT NULL,

[ExtractDate] [datetime] Default Getdate()

)

Create a table for our watermark values

Run the following command in your SQL Database to create a table named watermark that will store our watermark value:

create table watermark
(

TableName varchar(255),
WatermarkValue datetime,
);

Set the default value of the watermark with the name of our source table. In this example, the table name SalesOrderDetail.

Insert into watermarktable
values (‘SalesLT.SalesOrderDetail’, ‘1-jan-2017’)

Create a stored procedure in our SQL Database:

This stored procedure is used to update the watermark table when new rows have been extracted.

CREATE PROCEDURE sp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)
AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create Some Trigger Data

Run the following command on your SQL database to update a couple of rows in the source table. This is to validate the process is working correctly as it marks the rows as modified with today’s date, which is greater than the row in our watermark table. If the process is successful it will extract these changed rows.

update [SalesLT].[SalesOrderDetail]
set modifieddate = getdate()
where salesorderid = 71774

Create a Data Factory

Create a new Data Factory. For ease, do it via the portal following this guide.

(Ensure you create it using ADFv2):

adfv2

Creating a Data Factory via the Azure Portal

Create your Data Factory Artifacts

You will need to create the following (I’ve included my own samples in the link at the beginning of this article):

  • AzureSQLLinkedService
  • A source SQL dataset (for our source table)
  • A target SQL dataset (for our destination table)
  • A watermark dataset (that stores our watermark value entries
  • A pipeline that coordinates the activities

Modify the files to suit your needs and setup for your database server and database.

I won’t show the different JSON here, but just to highlight, the key area is the new Lookup Activity shown below:

pipelinelookupactivity

 

Create a folder on your root C Drive to hold the created JSON files

I created a folder called “adf” and put all my JSON files there:

adffolder

Deploy Data Factory Artifacts

For the current preview, you cannot deploy using the GUI, so you need to deploy programmatically. For this example I’m using Powershell. Note you will need to update your Powershell install with the latest cmdlets by running the following command:

Install-Module AzureRM

After updating the AzureRM cmdlets, run the following Powershell script below (attached at the start). This will deploy the necessary artifacts to your Data Factory.

powershellex2

Run the pipeline

Once deployed, you can run the deployed pipeline via Powershell with the following command:

$dataFactoryName = “[Your DATA FACTORY NAME]”
$resourceGroupName = “[Your RESOURCE GROUP NAME]”

$RunId = Invoke-AzureRmDataFactoryV2Pipeline -PipelineName “mainPipeline” -ResourceGroup $resourceGroupName -dataFactoryName $dataFactoryName

This fires off the pipeline. You can monitor its progress using Monitor & Manage:

minotoradf

Monitor Data Factory

And this allows you to see the status of the running pipelines:

pipelineruns

With the pipeline processed successfully,  I’ll check my database tables to check everything has updated correctly:

stagingdemo

Delta Rows Extracted

Bingo!

And the watermark table has been updated successfully:

watermarkupdate

Updated Watermark Table

This completes this article covering a simple ETL incremental extraction using the new Data Factory. The roadmap for this product is very exciting, and I encourage you to check out the links below for further reading:

Introduction to Data Factory v2

Data Factory Ignite 2017

 

Optimizing SSIS Data Loads with Azure SQL Data Warehouse & PolyBase

To gain the best loading performance with Azure SQL Data Warehouse, you want to leverage the power of PolyBase, which allows for massively parallel data loading into the underlying storage.

In standard scenarios, ETL developers using SSIS will leverage a data flow that takes data from a source (Database, File, etc) and load it directly into the target database:

SSIS Simple Data Flow

Simple SSIS Data Flow (Table to Table)

Unfortunately, adopting this pattern with an SSIS data flow into Azure SQL Data Warehouse will not naturally leverage the power of PolyBase. In order to understand this we need to take a little peek under the covers of SQL Data Warehouse:

Azure SQL DWH Overview

Azure SQL Data Warehouse Overview

As part of the underlying architecture, SQL DW has at its head the Control Node. This node manages and optimizes queries, and serves as the access point for client applications when accessing the SQL DW cluster. What this means for SSIS, is that when it makes its connection to the SQL DW using standard data connectors, this connection is made to the Control Node:

SSIS Front Load

SSIS Front Loading

This method introduces a bottleneck as the Control node serves as the single throughput point for the data flow. As the Control node does not scale out, this limits the speed of your ETL flow.

As stated earlier, the most efficient way of loading into Azure SQL DW is to leverage the power of PolyBase in a process known as “Back Loading”:

SSIS Back Load

Back loading via Polybase

Using this method, data is bulk loaded in parallel from staging files held in Blob Storage or Azure Data Lake Store, which allows for the full data processing power of PolyBase to be used.

Leveraging Polybase Within SSIS

Previously, ingesting data from files into SQL DW using PolyBase was a slightly drawn out process of uploading files to storage, defining credentials, file formats and external table definitions before finally loading the data into SQL DW using a CTAS (Create Table As Select) statement. This works perfectly well, but introduces a lot of extra SQL coding on top of the norm.

Thankfully, with the SSIS Azure Feature Back, the above is now made much, much easier with the introduction of the Azure SQL DW Upload task. This process automates the creation of the SQL scripts required, allows you to seamlessly incorporate this task into an SSIS ETL process that will fully leverage Polybase.

Azure DW Upload Control Flow

Azure SQL DW Upload Control Flow

Azure DW Upload Task

Azure SQL DW Upload Task

Further Reading

Azure SQL DW Best Practices

Azure SQL DW With Azure Data Lake Store (ADLS)

Polybase Guide