Extract, Transform & Load Options In Azure

Extract, Transform & Load (ETL), or ELT, ETLT or whatever term we use nowadays, has been around for many decades. In terms of “what it is” – although I’m sure many database professionals will be familiar with the concept – I will defer to Wikipedia for a formal definition –

In computing, extract, transform, load (ETL) is a process in database usage to prepare data for analysis, especially in data warehousing[1]. The ETL process became a popular concept in the 1970s.[2] Data extraction involves extracting data from homogeneous or heterogeneous sources, while data transformation processes data by transforming them into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, or a data warehouse. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions

Courtesy Of – https://en.wikipedia.org/wiki/Extract,_transform,_load

Essentially, we take some data stuff, pull it out from its home, mash it into a nicer shape, then serve it to welcoming consumers. See below for a simple example I’m sure we’ve all seen before –

So, that’ll do for a definition, but for this post I’m not going into the deep details of how ETL processes are built, instead I’m going to provide some details of the available options in Azure, and when/why you would use them. There are many choices right now, so it felt a good time to summarize the offerings.

SSIS

SSIS (SQL Server Integration Services) has been around for many, many years. It’s very mature, has lots of supporting books, blogs and articles as to how to setup and develop with. Part of the SQL Server product stack, it provides connectors to many different sources and contains many different transformation tasks that can handle pretty much any kind of traditional ETL workflow. Whilst showing it’s age a little in terms of the UI, recent additions such as the Azure Feature Pack has meant it can be used for large scale ETL scenarios into Azure, and an extensive community backing has led to the emergence of such automated ETL build tools for SSIS such as BIML.

For on-premise ETL workloads, SSIS is still the go-to tool of choice for those invested in the Microsoft BI Stack. It’s fully integrated into SQL Server Data Tools and is well equipped to handle most workloads.

For those people moving to Azure and sunk a lot of development time into SSIS, have no fear. Whilst you can simply lift your SQL Server virtual machine running SSIS into the cloud and keep running as before, for those who want to leverage more PaaS-based approaches we now offer the facility to publish your SSIS packages directly into Data Factory, and call your packages within SSIS in a similar way that’s been done previously in SSIS Master-Child package patterns.

SSIS Control Flow

So, for those projects that are born in the cloud, can we use SSIS? Yes, sure you can. But if your using mainly cloud-born data or a hybrid, or if you need to move data at scale, then you probably need to look at something else, which leads us nicely to…

Data Factory

Now, up until recently, I wouldn’t have called Data Factory an ETL tool. I would instead have called it a Cloud-Based Data Orchestration Tool or an EPL (Extract, Process, Load) tool instead. The simple reason being is that Data Factory excelled (and still does) at moving data in bulk into the cloud and then orchestrating a series of processes underneath. An example would be to copy terrabytes of data from your on-premise databases, land it in a Azure Blob Storage or Data Lake, then call a processing engine on top to cleanse, transform and curate the data. This would be done using tools such as Azure Data Lake Analytics, HDInsight, Polybase via SQL Data Warehouse or, more recently, with Azure Databricks.

This kind of pattern (land -> process -> move on) evolved as part of the big data architecture pattern. It was becoming increasingly more difficult to do traditional ETL using tools like SSIS as the scale, size and shape of the data to be processed meant that you’d be hard pressed to fit it all in memory and process. What Data Factory allows you to do is copy the data at massive scale into your data lake, and then use processing tools more appropriate to the job to transform the data ready for usage downstream. This pattern is not dissimilar in fact to a common pattern seen in SSIS wherein developers simply used SSIS as the orchestrator and called a series of SQL Statements (often Stored Procedures) to handle the processing. When I used to build these packages on a weekly basis, this was the pattern I preferred as I liked to take advantage of the processing power of the underlying database engine. (The difference now is that it’s not just relational tables we’re dealing with and SQL code. It’s parquet, orc and avro combined with SQL and Python, mixed with a healthy does of JSON, NoSQL, Key Value pairs and Graph databases plus a sprinkle of Spark. For those of us who cut our teeth on SQL and “simple” ETL patterns – this isn’t Kansas anymore)

This pattern with Data Factory is in reality an ELT (Extract, Load & Transform) approach. This means we extract, load it into the data lake, then process by the relevant tooling. This is compared to traditional ETL where the data is transformed in “flight” in the ETL tool in memory. With Data Factory we process the data where it is, using the engine best suited for that particular task.

Data Factory Pipeline

Now, Data Factory has recently taken a massive step forwards to being a true ETL tool with the annoucment of Azure Data Factory Data Flows. Currently in private preview, this aims to bring true cloud-scale visual ETL into Data Factory by allowing you build data pipelines without having to write SQL, Python, R, Scale, Java as part of your activities. Instead, Dataflows allows developers to build a data flow using a visual environment and the actual work is pushed down into an Azure Databricks cluster underneath. This high level abstraction allows developers to focus on building the actual pipeline without having to code specific routines. This visual approach, combined with using Apache Spark (Databricks) as the processing engine under the hood, essentially means you get the best of both worlds. A visual ETL design experience to simplify development combined with a processing engine that can deal with the massive scale and variety of data often encountered when building modern data warehouses in the cloud.

Check out the preview video here, with a (not great) screen grab below –

Data Factory Data Flows

Data Flows are shaping to be an awesome addition to Data Factory and I’m really looking forward to the innovative patterns that emerge when it’s out in the big wide world.

Power BI Data Flows

Recently, we announced (or more specifically re-announced with a new name) Power BI Data Flows (Not to be confused with Data Factory Data Flows). For those familiar with Power BI it’s essentially the Power Query part (a really powerful data preparation and cleansing engine) that’s now been taken out and added to the Power BI Service as a cloud-based self serve data prep tool. The aim of this is to allow BI Developers and Analysts to build re-usable data flows that are published into Power BI. These flows can then be consumed as a data source by Power BI (and soon, other tools) Desktop.  It also – and this is big – allows for the ability to write the output of the Data Flow back into Azure Data Lake (Gen 2), thus allowing the prepped data to be picked up and used by other tools. This latter functionality essentially democratizes ETL into the hands of the BI Developer or Analyst. Whereas previously it was often on the head of ETL developers or data engineers to build the pipelines for the analysts to consume, Dataflows now allows them to build these pipelines themselves and essentially giving them a full, end to end, analytics platform capability.

Essentially, with Power BI Dataflows, you can build the ETL, model the data, transform it into a reporting friendly structure (think Kimball-style dimensional models) and serve it, either within Power BI directly or to be consumed by other tools.

Another benefit of Power BI Data Flows is that it allows data transformation logic that had previously been locked to a specific Power BI Workbook to be taken out and published, allowing it to be re-used by other consumers. So rather than having duplicated logic hidden in Power BI workbooks you can simple build once and reuse many times.

Power BI Dataflows

This sounds awesome, I hear you cry. Why do I need anything else? Why can’t I just use Power BI end to end and be done with it? Let me dump my cumbersome data warehouse and just build the entire thing in Power BI.

Can I do that?

Well, in some cases, yes you can. But in others, I would stay probably stay away from it.

For team and departmental BI, Power BI Dataflows is a fantastic addition. No longer do you need to invest in a complex ETL and data warehouse build or conversely, be stuck in manual data manipulation hell using hand cranked SQL Code, Excel Vlookups and other time consuming and error-prone activities. Dataflows allows you to automate the data processing, centralise it, and then let the team do what adds most value, producing insights.

For complex data integration projects involving multiple teams, agile processes, automated testing and all the other components that comprise such a project, then it’s probably not best for that space. There’s no formal source control or devops integration to start with, and I’m to see how it scales to massive datasets. It’s very early in its life though so who knows where it’s going to go. It’s a powerful addition to the Power BI armoury, and I can definitely see some pure Power BI-only self contained BI solutions being created with ETL, Semantic Layer and Dashboards all build in the same platform.

3rd Party Offerings

As well as the first party options that Microsoft supply, there are also a plethora of tools available in the Azure Marketplace provided by real big hitters in the ETL space that are fully supported in Azure. This article doesn’t cover those, but for some examples check out Informatica and Datameer.

Summary

As always, there isn’t a black and white option about what tooling you should use, and often depends on several different factors including team size, skills, data source complexity and scale plus many others. Below I try and summarise where I feel each solution fits best, but over the coming weeks and months as Power BI Data Flows and the upcoming Data Factory Data Flows are picked up by more and more data developers, it’ll be interesting to see what new patterns emerge.

It’s also fair to say that you could pretty much adapt any of them to suit your needs depending on preference. The table belows some of my own opinions of what’s best for what, but feel free to experiment with each and see what works.

ETL Summary

Further Reading

ADF and SSIS Compared

Introducing Power BI Dataflows

Building ETL With Azure Data Factory (Whitepaper)

Migrating Your SQL Server Estate To Azure

Many customers want to move their SQL Server databases into the Azure cloud. Whilst not discussed here, moving to a PaaS (Platform As A Service) database platform provides many benefits over you having to manage your on premise or IaaS (Infrastructure As A Service) SQL Server databases.

Whilst many want to, the actual process isn’t traditionally easy. There are many steps to go through, including identifying what you need to migrate, assessing compatibility with newer SQL versions, making any necessary changes, and that’s even before you attempt the migration itself! I’ve had several DBA roles in the past and the challenge is all too real.

Thankfully, there’s a handful of tools in the Azure space that can make this whole process a lot easier. In this blog I’ll touch on each of these – what to use and when, and will also provide a high level process that links them all together.

A lot of the information I’m going to talk about can be found in the excellent Azure Data Migration Guide, and is recommended reading for any migration to Azure. It also covers other sources such as MySQL and PostgreSQL as well as good old SQL Server.

migration process

Data Migration Process (Azure Data Migration Guide)

Discover

In this phase you are looking to identify and inventorise your SQL Server estate so that you know what you actually have running. The MAP toolkit is a great free tool to get started, and will give a breakdown of the SQL Server Editions and the features in use.

MAP Summary

Sample MAP Summary (Data Migration Guide)

 

Assess

Now we’ve documented our estate and know what we need to migrate. We need to assess these source databases and see if there are any breaking or deprecated features that would need changing before attempting to move to the latest version of SQL.

In our kit bag this time we have the Database Migration Assistant (DMA).  Newly updated (October 2018) to include Azure SQL Managed Instance as a target, you simply point DMA at your source database, choose “Assessment” and it will spit out any issues with SQL Server feature parity or compatibility that you need to be aware of.

DMA

Database Migration Assistant

Migrate

Okay, so we’ve ran our assessment using the DMA. We fixed or amended any potential blockers to moving our source database into our target SQL database. We’re now ready to move onto the final part of the migration, the actual move itself.

There are several methods of doing this (depending on your source and target), from the good old back up, copy and restore through to more advanced methods. You can also use the DMA itself to do a migration, however the DMA is more recommended for singleton migrations or for testing before the real thing.

For production scale migrations we recommend the use of the Azure Database Migration Service (DMS), which is a fully managed database migration service and allows us to not only migrate databases from SQL Server in bulk, but also let’s us do these migrations online, meaning that, following the restore to the target database, the DMS establishes a continuous sync between source and target. This feature means that you don’t have to do the cut off straight away if you’re not ready, and make that move further down the line whilst minimizing the operational impact.

dms-create-project3

Example DMS Migration Project

Summary

In this article I’ve barely skimmed the surface on some of the processes and tools required to migrate your database into Azure SQL. However, with the Database Migration Guide, coupled with the Database Migration Assistant and the Database Migration Service, you’ve now got an excellent kit bag to make that challenging migration just a little bit easier (and less stressful!).

Useful Reading/Links

Azure Data Migration Guide

MAP Toolkit

Database Migration Assistant

Database Migration Service

Choosing Your Database Migration Path (Whitepaper)

 

 

Loading Multiple Tables With Azure Data Factory

There arn’t many articles out there that discuss Azure Data Factory design patterns. One big concern I’ve encountered with customers is that there appears to be a requirement to create multiple pipelines/activities for every table you need to copy. This conjures up images of massive, convoluted data factories that are a nightmare to manage.

Whilst the ForEach functionality has existed in ADF V2 for a while now, it’s recently been enhanced to enable mass copying of tables without having to specify a target Linked Services and datasets for each table.

In this simple post I’m going to show you how to get started with this new functionality.

Using the Copy Data tool (and the Pipeline Editor), you can specify your target data source, then select multiple source tables. When you specify the target data source, Data Factory generates a ForEach activity that in turn calls a Copy activity that’s parameterized for each source table.

To get started we need to have an Azure Data Factory created, along with a Source and Target. In this example, I’ve used the Azure SQL Database with the sample AdventureWorks database and Azure Blob Storage as my target. See the following for assistance in getting setup –

Create A Data Factory

Create An Azure SQL Database

Create An Azure Blob Storage Account

With that all done, we launch our newly created Data Factory from the Azure portal, and select the Copy Data wizard –

 

ADFCopyData1.png

Copy Data

Which takes us to our Copy Data wizard. Fill in the the Task name and leave the rest as is.

adfcopy2.PNG

Give The Pipeline A Name

Click Next, and click “Create New Connection” to create our Source Linked Service –

adfcopy3createnewconnection

Choose A Source Data Store

For this example, choose “Azure SQL Database” –

adfcopy4choosesqldb

Linked Services

With your Linked Service chosen, populate your source details (Server, Credentials etc) from the SQL Database you created earlier and click “Continue”. This takes you to the “Select Tables” screen, and select all the tables you wish to copy –

adfcopy6choosetables

Choose Tables For Extraction

Define your destination data store in the same way as you created the source data store. In this example I’m using Azure Blob Storage as part of an ELT (Extract, Load & Transform) pipeline, and is called “staging” in my example. It’s my storage account which will act as the landing/staging area for incoming data.

adfcopy7stroagetarget

Destination Data Store

In your Destination Linked Service, complete the necessary details as prompted. As I’m using Blob Storage I simply have to designate the path, which in this instance is my “raw” container within my landing area. If it was a database source this would ask for me to map source tables to target tables.

adfcopy8blobrawfolder

Output Destination (This Screen Will Vary Depending On Your Destination)

Finish the wizard and run the completed pipeline –adfcopy11finalscreen

Once completed, we can see in our blob storage landing area that all my files have been successfully copied.

adfcopy15storageview

So far, so amazing, right? The even better bit is when we go to view the Data Factory pipeline we created, and instead of seeing loads of Copy activities, we simply see one, innocuous ForEach activity –

adfcopy12inadfForEach

Just One Activity – What Witchcraft Is This?

Expanding “Activities” shows us there’s just been one “Copy” activity created, and is in fact parameterized with the relevant values (source table, target destination etc) being passed as a JSON array.

 

adfcopy16jsonarray

ForEach Activity Parameters

Which then instructs our Copy activity what to do –

adfcopy14CopyhActivity

Copy Activity Source

 

And we’re done. Much simpleness. For those of you have gone through the pain of creating endless packages in SSIS for simple copy/1:1 data flows, and have previously been creating similar amounts of Copy activities in Data Factory, hopefully this new release will make your ETL/ELT development a much simpler process.

Happy Factory-ing!

 

 

 

 

 

Building The Data Warehouse-less Data Warehouse (Part 2 of 2)

Part 1 of this series can be found here.

In my previous post I discussed and explored the feasibility of building a simplified reporting platform in Microsoft Azure that did away with the need for a relational data warehouse. In this article, I proposed that we land, process and present curated datasets (both dimensional files for our data warehouse “layer” and other data assets for our data scientists to work with) within Azure Data Lake Store, with the final step being to product a series of dimension and fact files to be consumed by our semantic layer. The diagram below highlights this approach:

conceptualmodeldwlessdw

At the end of the previous post I’d produced our data warehouse files (dimensions and facts) and in this second and final part I will show how we consume these files with our semantic layer (Azure Analysis Services) to finally present a business-friendly reporting layer without a data warehouse in size.

Semantic Layer

A semantic layer presents the business view of the data model, and allows users to interact with the layer without needing knowledge of the underlying schema or even knowledge of writing SQL code. In the words of my colleague Christian Wade, it’s “clicky-clicky draggy droppy” reporting that provides a single version of the truth without risk of users creating inaccurate measures by a misplaced join or incorrect Group By statement.

Microsoft’s semantic layer offering is Azure Analysis Services, and allows users to connect to models built with Analysis Services using any compliant tool, such as Power BI and Tableau.

I create an Azure Analysis Services project in Visual Studio, and connect to my Azure Data Lake Store from Part 1 (Ensure you change your model to use the latest SQL Server 2017/Azure Analysis Services Compatibility Level) :

aasdatalakestore

Azure Analysis Services Get Data

In the Query Editor I create queries that pull in the csv files that I created earlier for DimProduct, DimCustomer and FactSales:

aasimport

Query Editor

Note, whilst it’s relatively straight forward to import csv files into Azure Analysis Services from Data Lake Store, my colleague Kay Unkroth wrote a great article that makes this much easier, and I use this method in my solution. Please see this article for further details.

Once the tables have been imported into Azure Analysis Services, it’s then a simple feat to define our star schema and create a couple of measures:

simpleschema

Simple Star Schema (No Date Dimension Yet!)

We then publish our Analysis Services model to the Azure Analysis Services Server we created in part 1, and connect to it using Power BI:

powerbiexample

Power BI Example

That’s it, all done!

Not quite…

Refresh and Orchestration

So we’ve shown now that you can ingest, process and serve data as dimensional constructs using Databricks, Data Lake Store and Analysis Services. However this isn’t at all useful if the pattern can’t be repeated on a schedule. From Part 1, we use Azure Data Factory to copy data from our sources and also to call our Databricks notebook that does the bulk of the processing. With our Analysis Services model now published, we simply need to extend our Data Factory pipeline to automate processing the model.

Logic Apps

There are a few methods out there for refreshing an Azure Analysis Services cube, including this one here. However I particularly like the use of Azure Logic Apps for a code-lite approach to orchestration. Using Logic Apps I can call the Azure Analysis Services API on demand to process the model (refresh with latest data from our data store). The Logic App presents a URI that I can then call a POST against that triggers the processing.

Jorg Klein did an excellent post on this subject here, and it’s his method I use in the following example:

logicapps

Logic App Example

Once you’ve verified that the Logic App can call the Azure Analysis Services refresh API successfully, you simply need to embed it into the Data Factory workflow. This is simply a matter of using the Data Factory “Web” activity that is used to call the URI obtained from the Logic App you created above:

logicappPost

Logic App Post URL

Our final (simplified for this blog post) Data Factory looks like this, with the Web Activity highlighted.

datafactorywithweb

A simple test of the Data Factory pipeline verifies that all is working.

Conclusion

So, there you have it. My aim in this post was to see if we could create a simplified data warehouse-like approach that did away with a relational data warehouse platform yet still provided the ability to serve the various workloads of a modern data platform. By keeping the data all in one location (our data lake), we minimize the amount of data movement, thus simplifying many aspects, including governance and architecture complexity.

In terms of how we did it:

  1. Ingested data from source systems using Azure Data Factory, landing these as CSV files in Azure Data Lake Store
  2. Azure Databricks was then used to process the data and create our dimensional model, writing back the data files into Azure Data Lake Store
  3. Azure Analysis Services ingested the the dimensional files into its in-memory engine, presenting a user friendly view that can be consumed by BI tools
  4. Refresh of the Analysis Services model was achieved using Azure Logic Apps, with this component being added to our data pipeline in Azure Data Factory

Is This A Viable Approach?

Simply put, I believe the approach can work, however I think it is definitely dependent on specific scenarios. You can’t, or at least, not very easily, create “traditional” data warehouse elements such as Slowly Changing Dimensions in this approach. The example proposed in these articles is a simple star schema model, with a “rebuilt-every-load” approach being taken as our data sets are very small. For large, enterprise scale data warehouse solutions you need to work in different ways with Data Lake Store than we would do with a traditional data warehouse pipeline. There are many other factors to discuss that would affect your decision but these are out of scope for this particular article.

So, can we build a datawarehouse-less data warehouse?

Yes we can.

Should we build them this way?

It depends, and it’s definitely not for everyone. But the joy of cloud is you can try things out quickly and see if they work. If they don’t, tear it down and build it a different way. One definite benefit of this particular solution is that it allows you to get started quickly for an alpha or POC. Sure you might need a proper RDBMS data warehouse further down the line, but to keep things simple get the solution up and running using an approach such as suggested in this article and “back fill” in with a more robust pipeline once you’ve got your transformation code nailed down.

Happy building.

Further Reading

Azure Analysis Services With Azure Data Lake Store

Process Azure Analysis Services Using Logic Apps

Operationalize Databricks Notebooks Using Azure Data Factory

Building The Data Warehouse-less Data Warehouse (Part 1 of 2)

*Update – part 2 of this series is now live, and can be found here*

In times of yore, those who maketh the houses of data would bring forth vast swathes of tables and hurl them down in the forts of staging. Here the ancient priests of Ee, Tee and El would perform arcane magicks and transform these rows of chaos into purest order. This order would be rebuilt into the fabled Data Warehouses, and all who looked upon them would gasp in awe and wonder.

But that was then. Today’s data world is different, isn’t it? Data is varied. It’s big and small, fast and slow. It’s tougher to wrangle and make into shapes fit for use. But through all of this, we still need the hallowed, fabled, data warehouse.

Or do we?

This post is basically me exploring a different approach to building a data warehouse that’s based on the data lake paradigm and a form of ELT (Extract, Transform and Load), but leaves out the actual data warehouse part. I’m still going to build a star schema, but it’s going to be file based, using modern data engineering tools to do the munging and “schema-tizing” before sucking into a semantic layer for reporting. It’s also me exploring the capabilities of my current favourite tool – Azure Databricks.

What are the potential benefits of this approach? A few spring to mind, such as cost, flexibility and simplicity. By keeping all the processing within the data lake means it’s easier to control and govern, and the reduced data movement (you’re not copying into a data warehouse) makes an altogether more simpler structure.

Conversely, I’m well aware that this approach brings it’s own challenges. Traditional data warehouse constructs like Slowly Changing Dimensions, Surrogate Keys and other elements of the Kimball checklist will be harder or even not possible with this, so it won’t suit every scenario.

My aim here though was simple – can we build a BI solution without the data warehouse element and is it a viable approach for certain scenarios?

In short, the solution looks like this:

conceptualmodeldwlessdw

File-based Data Warehouse Conceptual Model

The above construct largely follows the traditional ETL model with data flowing in from source systems, and comprises the following:

  1. Data ingested as raw files into a staging zone in the data lake
  2. Files processed using data engineering platform into cleansed outputs
  3. Scrubbed files then shaped and moved into relevant serving zones. In this example I’ve kept it simple, with one zone for the star schema files and one for a data assets folder that provides cleansed, curated data sets to be consumed by analysts and data scientists.
  4. The star schema files are subsequently loaded into a semantic layer for enterprise reporting and ad hoc slice n’ dice functionality, whilst the asset files are consumed using the variety of tools preferred by today’s analysts.

To bring this to life, I’ll be using a handful of Azure data services for the relevant solution components –

solutioncomps

Solution Components

In this example, I’ll be using an Azure SQL Database as my data source with the AdventureworksLT database.

Prerequisites

For each of the above, ensure you put everything in the same Azure region.

Extract

There are plenty of options in this space that could move data from source to my lake, including ETL tools such as SSIS, Talend and Azure Data Factory. For this example I’m using Azure Data Factory (version 2), with copy activities moving data from my source SQL database and dropping as *.csv files. I’m also taking advantage of the new Databricks functionality built into Azure Data Factory that allows me to call a Databricks Notebook as part of the data pipeline.

datafactory

Data Factory Copy Activity

The above data factory copies the source tables into my “raw” directory, and from there I process the files, with the end result being to create my dimension and fact files ready to be loaded into Azure Analysis Services.

files

Load Files Into Spark DataFrames

With this files safely copied into our raw landing zone, I can now extract the source tables directly into Spark DataFrames. An example is shown below for the product table. We create a DataFrame for each source file.

dataframesrpoduct

Example DataFrame Creation

Transform

With the DataFrames created, I then create temporary SQL tables from them so that we can use SQL code to define my dimension. You can of course manipulate them natively as DataFrames using Python/Scala/R if you’re more familiar with those languages. I’m a SQL dude, and am familiar in building data warehouse routines in SQL code so have opted for that method here.

createview

Create Temporary Views From DataFrames

With these views created we can use good ‘old SQL to create DataFrames that reflect the dimensions and facts we want to load into Azure Analysis Services.

exampledim

Dimension Example

and fact tables:

examplefact

Fact Example

When I run my notebook now I have my dimensional and fact tables created as Spark dataframes. In order for them to be consumed by Azure Analysis Services I need to write the dataframes into CSV files that can then be imported into my tabular model.

Output Dimension and Fact Files

Writing files back from Spark into csv wasn’t as seamless as I thought it would be. Whilst the commands are fairly straight forward, as Spark is a distributed system it writes multiple files as an output, including crc and SUCCESS metadata files. It also doesn’t name them as the file you specify, but instead names it based on the partition name.

We don’t need these files, and need a proper, consistent filename, so I wrote some extra code to rename and move the files back under our Data Warehouse directory.

writetofile

Writing Out To CSV

With all this done, it leaves us nicely with a set of files ready for ingestion into Azure Analysis Services, which is the subject of Part 2 of this series.

outputfiles

Outputted Dimension And Fact Files

Coming in the second and final part of this series…

  • Build an Azure Analysis Services model directly off the dimension and fact files within the Data Lake Store
  • String it all together using Azure Data Factory

Part 2 is now live, and can be found here….

Introducing The Azure Data Architecture Guide

Designing data-centric solutions in the cloud can be a challenge. Not only do we have the standard challenge of meeting customer requirements, but we also have to deal with an ever evolving data landscape of tools, paradigms and platforms. In the good old days of data warehousing for example, we had relational stores, often coupled with flat files/rogue Access databases etc, which were then fed nicely into another database using an ETL tool and then made reporting ready, often following one of the prevalent design methodologies for these solutions (Kimball, Inmon, Data Vault etc).

Fast forward to now (2018), and whilst the data warehouse is far from dead, the modern data platform is a much evolved beast featuring fast data, slow data, big data, small data, cold data, funny-shaped data, not-funny – you get the picture. All of it is needed to be processed in a way that can derive actionable insights and intelligence for us poor humans.

To help with this, Microsoft have released the Azure Data Architecture Guide, and it provides, to quote directly –

…a structured approach for designing data-centric solutions on Microsoft Azure. It is based on proven practices derived from customer engagements.

The guide’s main focus is the difference between relational and non-relational data, the design paradigms of both and the best practices, challenges and processes for designing these kinds of solutions.

It is by no means a deep-dive in the nuances of each component, but it gives good, broad content in this subject, allowing us to understand and appreciate the core concepts before diving into specific focus areas as required.

Recommended Reading.

Find it here – https://docs.microsoft.com/en-us/azure/architecture/data-guide/

And a PDF can be downloaded here.

 

Exam Tips For Getting Your MCSA (BI Reporting)

To get your MCSA (BI Reporting) you need to take 2 exams – Analyzing and Visualizing Data With Power BI (70-778) and Analyzing and Visualizing Data with Excel (70-779). I recently took both exams at Microsoft Ready, and thought with them fresh in mind it was a good opportunity to “pass it on” and provide some tips and advice on how to achieve the certification. I wasn’t actually planning to take 70-779, but in a prep session we were informed that 70-779 is very similar to 70-778 so decided to take both.

It was good advice, and I managed to pass both in the same day, achieving my MCSA (BI Reporting) in the process.

mcsa-bi

MCSA (BI Reporting)

With this mind, I’ve added a couple of study recommendations plus some stand out items from both exams that I’d recommend you double down on when preparing.

Study Recommendation 1Take the Analyzing and Visualizing Data with Power BI on Edx. It’s free to take (or audit, as it’s referred to on the site) the course, and it provides an excellent foundation in all the skills tested in 70-778.

Study Recommendation 2 – Take the Analyzing and Visualizing Data with Excel on Edx. This is basically the sister courses of the Power BI, and it provides an excellent foundation in all the skills test in 70-779..

Study Recommendation 3 – Go through the Power BI Guided Learning. Like the above course, this provides an excellent Zero to Hero progression in Power BI, developing the hands on skills that are essential to knowledge retention.

After completing the above, below are some of my specific recommendations based on what I encountered in the exams:

  1. Know your DAX. Especially creating measures, calculated columns and calculated tables. Know the difference between CALCULATE, SUMX and similar operators. Here is a good place to start.
  2. Know your M, specifically around manipulating string values in columns. Check out the M language reference and play around with a few commands to get the feel for it.
  3. Know the difference between Import to Data Model and Import to Worksheet (70-779).
  4. Relationships in the Data Model – Know when you would use Bidirectional Filters, the difference between 1-1 and 1-many joins and specifically the impact of how the filter direction affects the visuals produced. There’s a great whitepaper on this subject here.
  5. Refreshing Power BI models in the Power BI Service – Understand what you can refresh from Onedrive, the difference between DirectQuery and Import modes.
  6. Pivot Tables and Pivot Charts in Excel. Understand the structure (Rows vs Columns vs Values). (70-779)
  7. Slicers and Filters. Swot up on Report Interactions and understand how to to turn this off and on to achieve the desired effect. Go through this tutorial.
  8. Know the main differences between Power BI Pro and Power BI Free and Power BI Premium. Know what each licencing option gives you, and it doesn’t. Start here.
  9. Sharing in Power BI – know how to do it, know the differences between the different sharing options. Read this for a good overview.
  10. Row Level Security – Understand the main concepts of Roles and Rules and how to implement it via the data model.
  11. Data Model sizing and performance best practices. MVP Avi Singh provides a great video on this here.
  12. Understand the kind of data sources supported in Power BI. I’d suggest playing through connecting to and manipulating the following sources to get a feel for the steps you undertake – SQL table, JSON file and Excel workbook.

These are the items I encountered when I took 70-778 and 70-779. There will obviously be some variation in what you will encounter, but hopefully this provides some good pointers in where focus. Hands down though the best preparation is going through both of the MOOCS identified at the beginning. They cover massive chunks of the material tested in the exam so I 100% recommend you work through these as a minimum.

Good luck!

*Addendum* Shout out to my colleague Dustin Ryan who’s produced a similar post for 70-778. I saw this posted on LinkedIn recently when was this one was nearly done so figured I’d still release it. Read both. They can only help you in the long run.

https://sqldusty.com/2017/08/23/preparing-for-microsoft-certification-exam-70-778-analyzing-and-visualizing-data-with-microsoft-power-bi/

 

SQL Server Options In Azure

The data platform options in Azure are vast, and they grow and change with each month. Choosing the right database platform for your workloads can be a challenge in itself. This post aims to give some clarity as to the options available in the SQL Server space.

Before diving in, it’s worth clarifying the underlying platform options to be discussed. SQL Server in Azure comes in both Infrastructure As a Service (IaaS) and Platform As a Service (PaaS) flavors, which each providing different benefits depending on your specific needs. This is best summarized in the graphic below:

sqliaasoptions

SQL Server IaaS & PaaS

Infrastructure As A Service Options

SQL Server In Azure VM

As simple as it sounds, you can run SQL Server “as is”, running the full engine on a Windows or  Linux virtual machine, and also in a Docker container. This option is exactly the same as the one you’d run on premise, with the full SQL Server feature set, including High Availability, Security and various other features depending on the edition you choose.

If you need full control over/access to the OS, need to run apps or agents alongside the database, and basically want to manage all aspects of your solution, then SQL Server IaaS is the right solution.

This SQL Server option can be accessed from the Azure marketplace, with many editions available depending on your needs.

Billing in this model comes in two flavors too, Bring Your Own Licence (BYOL) where you provide your own SQL licence and just pay for the underlying compute/storage, or Pay As You Go where you pay the SQL licence per minute the machine is running for.

Platform as a Service (PaaS) Options

Along with the full control of SQL Server IaaS, you also have several PaaS options. These provide much better total cost of ownership (TCO) as much of the daily administration is handled by the Azure platform, so things like backups, high availability, performance tuning and monitoring are all abstracted away, allowing you to just focus on building/using your application.

 

sqlpassoptions

SQL PaaS Options

SQL Database

The first of the pure PaaS offerings, (sometimes called Database As a Service (DBaaS)), SQL Database offers the power of the SQL Server engine, but without the management overhead that comes with maintaining a full SQL Server instance.

As a DBaaS, SQL Database brings with it many features, including dynamic provisioning and resizing, built in High Availability, automatic backups, point-in-time restore and active geo-replication. As Microsoft assumes much of the daily maintenance work, this frees you up to realize cost or operational benefits that you wouldn’t have experienced with your on-premise or hosted solution.

Unlike the IaaS option, where you choose the virtual machine that your instance will reside on (with associated cpus, ram etc), SQL Database instead comes in different tiers, from Basic through to Premium. Rather than choosing specific hardware items like RAM, SQL Database tiers are measured in Database Throughput Units (DTUs), with the underlying specifications abstracted away.

Read here for more details on how to choose the best tier for your application.

There is a small subset of features not available or not applicable in SQL Database compared to SQL Server. Please check here and here.

SQL Database Managed Instances (Preview)

Designed to minimize the challenges of migrating applications to a SQL Database environment without having to perform application rewrites, SQL Managed Instance is an extension to SQL Database that offers the full SQL Server programming surface and includes several native SQL Server features:

  • Native Backup and Restore
  • Cross Database queries and transactions
  • Security features – Transparent Data Encryption, SQL Audit, Always Encrypted and Dynamic Data Masking
  • SQL Agent, DBMail and Alerts
  • Change Data Capture, Service Broker, Transactional Replication and CLR
  • DMVs, XEvents and Query Store

On top of this, SQL Managed Instance offers full security and isolation, with SQL Managed Instance being behind their own virtual network (vnet) within Azure (this is now also available for SQL Database).

Currently in private preview, with public preview due shortly, SQL Managed Instance is a great way to get started moving data to the cloud that combines the benefits of both PaaS and IaaS SQL Server models without having to make changes to the affected application. Microsoft have also created the Azure Data Migration Service to make the migration as seamless as possible.

SQL Database Managed Instance also provides an added incentive to move to the cloud with the Azure Hybrid Cloud Benefit for SQL Server. This allows you to move your on-premise SQL Servers to Azure, and only pay for the compute and storage. See here for more details.

SQL Database Managed Instance is going to be a real game changer in this space in my opinion. My colleague James Serra has created an excellent deck that goes into more details here.

SQL Database Elastic Pools

Whilst not technically a different type of SQL Database offering, Elastic Pools provides the ability to manage multiple SQL Databases that have a variable and unpredictable workload. SQL Databases in an Elastic Pool are allocated elastic Database Throughput Units (eDTUs) that dynamically scale the databases within the pool to meet the required performance demands.

Elastic Pools are ideal for multi-tenancy environments where the workload can’t be predicted, but you don’t want to have to over provision for those “just in case” moments.

SQL Data Warehouse

SQL Data Warehouse, whilst is “relational” in that it has tables with relations between them, is a different concept to the options above. SQL Data Warehouse is what is known as an MPP (Massively Parallel Processing) solution, designed for heavy duty data processing and querying at scale, the likes you’d see in a data warehouse or similar analytics solution.

SQL Data Warehouse could easily be the subject of a full article all on its own. This platform provides a fully scalable analytics engine where the compute and storage are scaled independently, meaning you’re not fixed into certain configurations that are over specced, thus providing greater TCO benefits. With the underlying Polybase engine supporting it, SQL Data Warehouse is a must-have component in many modern data warehouse architectures within Azure.

Read here for more information about SQL Data Warehouse use cases and design patterns.

Honorable Mentions

This post is specifically around SQL Server options within Azure, however I wouldn’t be doing justice to the platform if I didn’t mention the other database options available, and will be covered in future posts:

  • Cosmos DB – A globally distributed, multi-mode database service that supports SQL, Graph, Cassandra, MongoDB and Table APIs.
  • Azure Database for MySQL – Exactly as it says on the tin. This is another PaaS offering that provisions a MySQL instance.
  • Azure Database for PostgreSQL – As above, this PaaS offering provides PostgreSQL functionality, all managed by the Azure platform.

Which Should I Choose?

There are many factors to consider when choosing a relational database service in Azure, including cost, control, workload type, governance requirements and many more. As a general rule, if you need to have full control over your environment, with a full feature set, then SQL Server 2017 as a virtual machine or container is the way to go. For applications born in the cloud that just need the database engine and don’t want or need the hassle of maintaining the database, then SQL Database or SQL Database Managed Instance are excellent options. Again, if you need that elasticity that comes with hosting a multi tenant or highly variable environment, then SQL Database Elastic Pools is the option for you.

The real game changer now is SQL Managed Instances. Offering a full SQL Server feature set, but with all the benefits of PaaS, this option is great for those looking for a seamless move from on-premise to the cloud.

Further Reading

SQL Database vs SQL Server

SQL Data Warehouse vs SQL Database

SQL Managed Instances

 

 

 

 

Getting Started with SQL Operations Studio

I’m a fan of new shiny things, and as a former SQL Server DBA the release of SQL Operations Studio definitely peaked my interest.

So, in the words of the Cat from Red Dwarf, “So what is it?”

(It’s not a White Hole…just in case you were wondering…)

The best description comes straight from the product team themselves, see here:

SQL Operations Studio (preview) offers a modern, keyboard-focused T-SQL coding experience that makes your everyday tasks easier with built-in features, such as multiple tab windows, a rich T-SQL editor, IntelliSense, keyword completion, code snippets, code navigation, and source control integration (Git). Run on-demand T-SQL queries, view and save results as text, JSON, or Excel. Edit data, organize your favorite database connections, and browse database objects in a familiar object browsing experience.

So, it’s a lightweight, slimmed down, better looking cousin of SQL Server Management Studio. In fact, my first thoughts when I saw this made me think of the approach that’s been taken with Visual Studio Code, wherein we’ve stripped out all the “bloat” of Visual Studio to focus primarily on the code side of things, making it a much faster, simpler experience.

It’s still in preview, but is already looking really slick. I’ve had a play around with it on a project I’m working on right now and here’re some stand out thoughts:

It Looks Really Nice

Server Dashboard

Server Dashboard

Like Visual Studio Code it’s got a fresh, clean feel about it.  You can quickly dive into deeper content (similar to SQL Server Management Studio) and also customize the look and feel and suit your personal tastes.

Intellisense Works (I used to have so much trouble with this – thank the heavens for Red Gate)

intellisense

Intellisense Example

Intellisense works out of the box, as should be expected.

You Can Create Your Own Insight Widgets and Dashboards

addedwidget

Database Dashboard Showing Slow Query Widget

You can create your own custom widgets, and there’s also some Out-of-the-Box ones already available. The widget above shows the Top Worse Performing Queries, and is already available to use. See here for information on setting up the widget.

Which can be Further Customized:

sqlopswithdatabasetasks

Database Dashboard

In the above example, I’ve customized the Database Dashboard for my TutorialDB to show both the Tasks widget and Slow Queries widget. This is based on the tutorial here.

Source Control Integration

A must have is source control integration, and the preview offering of SQL Operations Studio supports Git for version/source control.

GitIntegration

Git Integration

See here for more information on setting up with Git.

Integrated Terminal

And finally, just like Visual Studio Code, you have support for Integrated Terminal, so you can interact directly with Powershell, Command Prompt and GitBash (at the time of writing). A lot of my Azure work involves interacting with Powershell so this is a really useful feature, plus the ability to work natively with SQLCMD, bcp etc is great.

In the example below I’ve also customized the theme as I prefer the darker themes in VS to the light ones.

TerminalSQLOps

Integrated Terminal Set For Powershell

Other Stuff

I’ve only lightly touched on the features available at Preview. As well as those above, the current version includes, amongst others:

  • Code Snippets
  • Save Results as Text, JSON or Excel
  • Server Groups
  • Edit Data

 

Summary

SQL Operations Studio is in public preview (my version is 0.23) still, so it’s not fair to compare it directly to SQL Server Management Studio (SSMS) at this point. I think even when generally available, it should still be considered a separate beast to SSMS, as Visual Studio is considered different to Visual Studio Code.

As it stands right now, it’s very clean and slick, and has some great functionality already. I can see definite use cases for this where developers and DBAs want to work in a lightweight code-focused editor that provides all the key functionality that’s required to work with and administer SQL instances. Time will tell if it’s sufficient to pull people away from the tried and trusted SSMS, and I’ll look to do a follow up post when we’re at GA.

Further Reading

SQL Operations Studio (Preview)

A Lap Around SQL Operations Studio – Channel 9

So What’s SQL Operations Studio?

SQL Operations Studio FAQs

Bugs/Suggestions

Incremental ETL Processing With Azure Data Factory v2

Data Factory V2 was announced at Ignite 2017 and brought with it a host of new capabilities:

  • Lift your SSIS workloads into Data Factory and run using the new Integrated Runtime (IR)
  • Ability to schedule  Data Factory using wall-clock timers or on-demand via event generation
  • Introducing the first proper separation of Control Flow and Data Flow to allow more complex orchestrations such as looping,  branching and other conditional flows
  • Incremental Loads using the new Lookup Activity

And it’s this last item that today’s article is about.

It’s fair to say that in its initial incarnation, Data Factory didn’t allow for more traditional ETL workloads without some complex coding (more than you were used to if you came from the world of SSIS and similar ETL tools). For the big data focused ELT workloads where data is moved between data services (SQL Server, Blob Storage, HDInsight and so forth) and activities applied whilst the data is in place (SQL queries, Hive, USQL, Spark) Data Factory V1 really excelled, but for those who wanted to move their traditional ETL delta extracts to Data Factory, it wasn’t quite there.

In this new public preview the product team have taken great steps in remedying this, allowing ETL developers to implement proper incremental loading patterns with their relational data warehouses.  It’s also worth knowing I’ve seen where Data Factory is going, and it’s looking amazing…

The following example is based on the official tutorial here. The differences in this example are based on the scenario where you wish to perform incremental extracts from a source database to a staging area inside another database. This example uses Azure SQL Database as both the source and sink, but can be adapted for other data sources.

The solution files used in this example can be found here

Prerequisites

This example assumes you have previous experience with Data Factory, and doesn’t spend time explaining core concepts. For an overview of Data Factory concepts, please see here.

So for today, we need the following prerequisites:

  • An Azure Subscription
  • An Azure SQL Database instance setup using the AdventureWorksLT sample database

That’s it!

Incremental Loading in Data Factory v2

The diagram demonstrates a high level overview of the process, and should be familiar to those who have built similar data flows with other ETL tools such as SSIS:

advv2incloadingoverview

Incremental Loading Overview

Process

There are four main steps to this process to create an end to end incremental extract.

  1. Identify the trigger column. In the Data Factory documentation this is referred to as the Watermark column so I’ll use this latter term for consistency going forwards. This column is one that can be used to filter new or updated records for each run. This column is normally an automatically updating datetime column (e.g. Modified_Date_Time) or an ever increasing integer. We use the maximum value of this column as the watermark.
  2. Create a data store to hold the watermark value. In this example we store the watermark value in our SQL Database.
  3. Create a pipeline that does the following:
    • Creates two Lookup Activities. The first looks up the last watermark value and the second retrieves the current watermark value.
    • Create a Copy activity copies rows from the source where any watermark values are greater than the old watermark value. These delta rows are then written to our target table.
  4. Create a stored procedure that updates the watermark value, ready for the next run.

Building the Solution

Create a Staging Table

In this example, we’ll use the SalesOrderDetail table in the AdventureWorksLT database:

salesorderdetail

SalesOrderDetail

And we will use the ModifiedDate as the watermark column.

We therefore need to create a target or “staging” table to load our delta rows into.

Run the following SQL in your created database to create a staging version of the SalesOrderDetails table:

CREATE TABLE [SalesLT].[SalesOrderDetail_Staging]

(

[SalesOrderID] [int] NOT NULL,

[SalesOrderDetailID] [int] NOT NULL,

[OrderQty] [smallint] NOT NULL,

[ProductID] [int] NOT NULL, [UnitPrice] [money] NOT NULL,

[UnitPriceDiscount] [money] NOT NULL,

[LineTotal] [money] NULL,

[rowguid] [uniqueidentifier] NOT NULL,

[ModifiedDate] [datetime] NOT NULL,

[ExtractDate] [datetime] Default Getdate()

)

Create a table for our watermark values

Run the following command in your SQL Database to create a table named watermark that will store our watermark value:

create table watermark
(

TableName varchar(255),
WatermarkValue datetime,
);

Set the default value of the watermark with the name of our source table. In this example, the table name SalesOrderDetail.

Insert into watermarktable
values (‘SalesLT.SalesOrderDetail’, ‘1-jan-2017’)

Create a stored procedure in our SQL Database:

This stored procedure is used to update the watermark table when new rows have been extracted.

CREATE PROCEDURE sp_write_watermark @LastModifiedtime datetime, @TableName varchar(50)
AS

BEGIN

UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName

END

Create Some Trigger Data

Run the following command on your SQL database to update a couple of rows in the source table. This is to validate the process is working correctly as it marks the rows as modified with today’s date, which is greater than the row in our watermark table. If the process is successful it will extract these changed rows.

update [SalesLT].[SalesOrderDetail]
set modifieddate = getdate()
where salesorderid = 71774

Create a Data Factory

Create a new Data Factory. For ease, do it via the portal following this guide.

(Ensure you create it using ADFv2):

adfv2

Creating a Data Factory via the Azure Portal

Create your Data Factory Artifacts

You will need to create the following (I’ve included my own samples in the link at the beginning of this article):

  • AzureSQLLinkedService
  • A source SQL dataset (for our source table)
  • A target SQL dataset (for our destination table)
  • A watermark dataset (that stores our watermark value entries
  • A pipeline that coordinates the activities

Modify the files to suit your needs and setup for your database server and database.

I won’t show the different JSON here, but just to highlight, the key area is the new Lookup Activity shown below:

pipelinelookupactivity

 

Create a folder on your root C Drive to hold the created JSON files

I created a folder called “adf” and put all my JSON files there:

adffolder

Deploy Data Factory Artifacts

For the current preview, you cannot deploy using the GUI, so you need to deploy programmatically. For this example I’m using Powershell. Note you will need to update your Powershell install with the latest cmdlets by running the following command:

Install-Module AzureRM

After updating the AzureRM cmdlets, run the following Powershell script below (attached at the start). This will deploy the necessary artifacts to your Data Factory.

powershellex2

Run the pipeline

Once deployed, you can run the deployed pipeline via Powershell with the following command:

$dataFactoryName = “[Your DATA FACTORY NAME]”
$resourceGroupName = “[Your RESOURCE GROUP NAME]”

$RunId = Invoke-AzureRmDataFactoryV2Pipeline -PipelineName “mainPipeline” -ResourceGroup $resourceGroupName -dataFactoryName $dataFactoryName

This fires off the pipeline. You can monitor its progress using Monitor & Manage:

minotoradf

Monitor Data Factory

And this allows you to see the status of the running pipelines:

pipelineruns

With the pipeline processed successfully,  I’ll check my database tables to check everything has updated correctly:

stagingdemo

Delta Rows Extracted

Bingo!

And the watermark table has been updated successfully:

watermarkupdate

Updated Watermark Table

This completes this article covering a simple ETL incremental extraction using the new Data Factory. The roadmap for this product is very exciting, and I encourage you to check out the links below for further reading:

Introduction to Data Factory v2

Data Factory Ignite 2017