Extract, Transform & Load Options In Azure

Extract, Transform & Load (ETL), or ELT, ETLT or whatever term we use nowadays, has been around for many decades. In terms of “what it is” – although I’m sure many database professionals will be familiar with the concept – I will defer to Wikipedia for a formal definition –

In computing, extract, transform, load (ETL) is a process in database usage to prepare data for analysis, especially in data warehousing[1]. The ETL process became a popular concept in the 1970s.[2] Data extraction involves extracting data from homogeneous or heterogeneous sources, while data transformation processes data by transforming them into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, or a data warehouse. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions

Courtesy Of – https://en.wikipedia.org/wiki/Extract,_transform,_load

Essentially, we take some data stuff, pull it out from its home, mash it into a nicer shape, then serve it to welcoming consumers. See below for a simple example I’m sure we’ve all seen before –

So, that’ll do for a definition, but for this post I’m not going into the deep details of how ETL processes are built, instead I’m going to provide some details of the available options in Azure, and when/why you would use them. There are many choices right now, so it felt a good time to summarize the offerings.

SSIS

SSIS (SQL Server Integration Services) has been around for many, many years. It’s very mature, has lots of supporting books, blogs and articles as to how to setup and develop with. Part of the SQL Server product stack, it provides connectors to many different sources and contains many different transformation tasks that can handle pretty much any kind of traditional ETL workflow. Whilst showing it’s age a little in terms of the UI, recent additions such as the Azure Feature Pack has meant it can be used for large scale ETL scenarios into Azure, and an extensive community backing has led to the emergence of such automated ETL build tools for SSIS such as BIML.

For on-premise ETL workloads, SSIS is still the go-to tool of choice for those invested in the Microsoft BI Stack. It’s fully integrated into SQL Server Data Tools and is well equipped to handle most workloads.

For those people moving to Azure and sunk a lot of development time into SSIS, have no fear. Whilst you can simply lift your SQL Server virtual machine running SSIS into the cloud and keep running as before, for those who want to leverage more PaaS-based approaches we now offer the facility to publish your SSIS packages directly into Data Factory, and call your packages within SSIS in a similar way that’s been done previously in SSIS Master-Child package patterns.

SSIS Control Flow

So, for those projects that are born in the cloud, can we use SSIS? Yes, sure you can. But if your using mainly cloud-born data or a hybrid, or if you need to move data at scale, then you probably need to look at something else, which leads us nicely to…

Data Factory

Now, up until recently, I wouldn’t have called Data Factory an ETL tool. I would instead have called it a Cloud-Based Data Orchestration Tool or an EPL (Extract, Process, Load) tool instead. The simple reason being is that Data Factory excelled (and still does) at moving data in bulk into the cloud and then orchestrating a series of processes underneath. An example would be to copy terrabytes of data from your on-premise databases, land it in a Azure Blob Storage or Data Lake, then call a processing engine on top to cleanse, transform and curate the data. This would be done using tools such as Azure Data Lake Analytics, HDInsight, Polybase via SQL Data Warehouse or, more recently, with Azure Databricks.

This kind of pattern (land -> process -> move on) evolved as part of the big data architecture pattern. It was becoming increasingly more difficult to do traditional ETL using tools like SSIS as the scale, size and shape of the data to be processed meant that you’d be hard pressed to fit it all in memory and process. What Data Factory allows you to do is copy the data at massive scale into your data lake, and then use processing tools more appropriate to the job to transform the data ready for usage downstream. This pattern is not dissimilar in fact to a common pattern seen in SSIS wherein developers simply used SSIS as the orchestrator and called a series of SQL Statements (often Stored Procedures) to handle the processing. When I used to build these packages on a weekly basis, this was the pattern I preferred as I liked to take advantage of the processing power of the underlying database engine. (The difference now is that it’s not just relational tables we’re dealing with and SQL code. It’s parquet, orc and avro combined with SQL and Python, mixed with a healthy does of JSON, NoSQL, Key Value pairs and Graph databases plus a sprinkle of Spark. For those of us who cut our teeth on SQL and “simple” ETL patterns – this isn’t Kansas anymore)

This pattern with Data Factory is in reality an ELT (Extract, Load & Transform) approach. This means we extract, load it into the data lake, then process by the relevant tooling. This is compared to traditional ETL where the data is transformed in “flight” in the ETL tool in memory. With Data Factory we process the data where it is, using the engine best suited for that particular task.

Data Factory Pipeline

Now, Data Factory has recently taken a massive step forwards to being a true ETL tool with the annoucment of Azure Data Factory Data Flows. Currently in private preview, this aims to bring true cloud-scale visual ETL into Data Factory by allowing you build data pipelines without having to write SQL, Python, R, Scale, Java as part of your activities. Instead, Dataflows allows developers to build a data flow using a visual environment and the actual work is pushed down into an Azure Databricks cluster underneath. This high level abstraction allows developers to focus on building the actual pipeline without having to code specific routines. This visual approach, combined with using Apache Spark (Databricks) as the processing engine under the hood, essentially means you get the best of both worlds. A visual ETL design experience to simplify development combined with a processing engine that can deal with the massive scale and variety of data often encountered when building modern data warehouses in the cloud.

Check out the preview video here, with a (not great) screen grab below –

Data Factory Data Flows

Data Flows are shaping to be an awesome addition to Data Factory and I’m really looking forward to the innovative patterns that emerge when it’s out in the big wide world.

Power BI Data Flows

Recently, we announced (or more specifically re-announced with a new name) Power BI Data Flows (Not to be confused with Data Factory Data Flows). For those familiar with Power BI it’s essentially the Power Query part (a really powerful data preparation and cleansing engine) that’s now been taken out and added to the Power BI Service as a cloud-based self serve data prep tool. The aim of this is to allow BI Developers and Analysts to build re-usable data flows that are published into Power BI. These flows can then be consumed as a data source by Power BI (and soon, other tools) Desktop.  It also – and this is big – allows for the ability to write the output of the Data Flow back into Azure Data Lake (Gen 2), thus allowing the prepped data to be picked up and used by other tools. This latter functionality essentially democratizes ETL into the hands of the BI Developer or Analyst. Whereas previously it was often on the head of ETL developers or data engineers to build the pipelines for the analysts to consume, Dataflows now allows them to build these pipelines themselves and essentially giving them a full, end to end, analytics platform capability.

Essentially, with Power BI Dataflows, you can build the ETL, model the data, transform it into a reporting friendly structure (think Kimball-style dimensional models) and serve it, either within Power BI directly or to be consumed by other tools.

Another benefit of Power BI Data Flows is that it allows data transformation logic that had previously been locked to a specific Power BI Workbook to be taken out and published, allowing it to be re-used by other consumers. So rather than having duplicated logic hidden in Power BI workbooks you can simple build once and reuse many times.

Power BI Dataflows

This sounds awesome, I hear you cry. Why do I need anything else? Why can’t I just use Power BI end to end and be done with it? Let me dump my cumbersome data warehouse and just build the entire thing in Power BI.

Can I do that?

Well, in some cases, yes you can. But in others, I would stay probably stay away from it.

For team and departmental BI, Power BI Dataflows is a fantastic addition. No longer do you need to invest in a complex ETL and data warehouse build or conversely, be stuck in manual data manipulation hell using hand cranked SQL Code, Excel Vlookups and other time consuming and error-prone activities. Dataflows allows you to automate the data processing, centralise it, and then let the team do what adds most value, producing insights.

For complex data integration projects involving multiple teams, agile processes, automated testing and all the other components that comprise such a project, then it’s probably not best for that space. There’s no formal source control or devops integration to start with, and I’m to see how it scales to massive datasets. It’s very early in its life though so who knows where it’s going to go. It’s a powerful addition to the Power BI armoury, and I can definitely see some pure Power BI-only self contained BI solutions being created with ETL, Semantic Layer and Dashboards all build in the same platform.

3rd Party Offerings

As well as the first party options that Microsoft supply, there are also a plethora of tools available in the Azure Marketplace provided by real big hitters in the ETL space that are fully supported in Azure. This article doesn’t cover those, but for some examples check out Informatica and Datameer.

Summary

As always, there isn’t a black and white option about what tooling you should use, and often depends on several different factors including team size, skills, data source complexity and scale plus many others. Below I try and summarise where I feel each solution fits best, but over the coming weeks and months as Power BI Data Flows and the upcoming Data Factory Data Flows are picked up by more and more data developers, it’ll be interesting to see what new patterns emerge.

It’s also fair to say that you could pretty much adapt any of them to suit your needs depending on preference. The table belows some of my own opinions of what’s best for what, but feel free to experiment with each and see what works.

ETL Summary

Further Reading

ADF and SSIS Compared

Introducing Power BI Dataflows

Building ETL With Azure Data Factory (Whitepaper)

Building The Data Warehouse-less Data Warehouse (Part 2 of 2)

Part 1 of this series can be found here.

In my previous post I discussed and explored the feasibility of building a simplified reporting platform in Microsoft Azure that did away with the need for a relational data warehouse. In this article, I proposed that we land, process and present curated datasets (both dimensional files for our data warehouse “layer” and other data assets for our data scientists to work with) within Azure Data Lake Store, with the final step being to product a series of dimension and fact files to be consumed by our semantic layer. The diagram below highlights this approach:

conceptualmodeldwlessdw

At the end of the previous post I’d produced our data warehouse files (dimensions and facts) and in this second and final part I will show how we consume these files with our semantic layer (Azure Analysis Services) to finally present a business-friendly reporting layer without a data warehouse in size.

Semantic Layer

A semantic layer presents the business view of the data model, and allows users to interact with the layer without needing knowledge of the underlying schema or even knowledge of writing SQL code. In the words of my colleague Christian Wade, it’s “clicky-clicky draggy droppy” reporting that provides a single version of the truth without risk of users creating inaccurate measures by a misplaced join or incorrect Group By statement.

Microsoft’s semantic layer offering is Azure Analysis Services, and allows users to connect to models built with Analysis Services using any compliant tool, such as Power BI and Tableau.

I create an Azure Analysis Services project in Visual Studio, and connect to my Azure Data Lake Store from Part 1 (Ensure you change your model to use the latest SQL Server 2017/Azure Analysis Services Compatibility Level) :

aasdatalakestore

Azure Analysis Services Get Data

In the Query Editor I create queries that pull in the csv files that I created earlier for DimProduct, DimCustomer and FactSales:

aasimport

Query Editor

Note, whilst it’s relatively straight forward to import csv files into Azure Analysis Services from Data Lake Store, my colleague Kay Unkroth wrote a great article that makes this much easier, and I use this method in my solution. Please see this article for further details.

Once the tables have been imported into Azure Analysis Services, it’s then a simple feat to define our star schema and create a couple of measures:

simpleschema

Simple Star Schema (No Date Dimension Yet!)

We then publish our Analysis Services model to the Azure Analysis Services Server we created in part 1, and connect to it using Power BI:

powerbiexample

Power BI Example

That’s it, all done!

Not quite…

Refresh and Orchestration

So we’ve shown now that you can ingest, process and serve data as dimensional constructs using Databricks, Data Lake Store and Analysis Services. However this isn’t at all useful if the pattern can’t be repeated on a schedule. From Part 1, we use Azure Data Factory to copy data from our sources and also to call our Databricks notebook that does the bulk of the processing. With our Analysis Services model now published, we simply need to extend our Data Factory pipeline to automate processing the model.

Logic Apps

There are a few methods out there for refreshing an Azure Analysis Services cube, including this one here. However I particularly like the use of Azure Logic Apps for a code-lite approach to orchestration. Using Logic Apps I can call the Azure Analysis Services API on demand to process the model (refresh with latest data from our data store). The Logic App presents a URI that I can then call a POST against that triggers the processing.

Jorg Klein did an excellent post on this subject here, and it’s his method I use in the following example:

logicapps

Logic App Example

Once you’ve verified that the Logic App can call the Azure Analysis Services refresh API successfully, you simply need to embed it into the Data Factory workflow. This is simply a matter of using the Data Factory “Web” activity that is used to call the URI obtained from the Logic App you created above:

logicappPost

Logic App Post URL

Our final (simplified for this blog post) Data Factory looks like this, with the Web Activity highlighted.

datafactorywithweb

A simple test of the Data Factory pipeline verifies that all is working.

Conclusion

So, there you have it. My aim in this post was to see if we could create a simplified data warehouse-like approach that did away with a relational data warehouse platform yet still provided the ability to serve the various workloads of a modern data platform. By keeping the data all in one location (our data lake), we minimize the amount of data movement, thus simplifying many aspects, including governance and architecture complexity.

In terms of how we did it:

  1. Ingested data from source systems using Azure Data Factory, landing these as CSV files in Azure Data Lake Store
  2. Azure Databricks was then used to process the data and create our dimensional model, writing back the data files into Azure Data Lake Store
  3. Azure Analysis Services ingested the the dimensional files into its in-memory engine, presenting a user friendly view that can be consumed by BI tools
  4. Refresh of the Analysis Services model was achieved using Azure Logic Apps, with this component being added to our data pipeline in Azure Data Factory

Is This A Viable Approach?

Simply put, I believe the approach can work, however I think it is definitely dependent on specific scenarios. You can’t, or at least, not very easily, create “traditional” data warehouse elements such as Slowly Changing Dimensions in this approach. The example proposed in these articles is a simple star schema model, with a “rebuilt-every-load” approach being taken as our data sets are very small. For large, enterprise scale data warehouse solutions you need to work in different ways with Data Lake Store than we would do with a traditional data warehouse pipeline. There are many other factors to discuss that would affect your decision but these are out of scope for this particular article.

So, can we build a datawarehouse-less data warehouse?

Yes we can.

Should we build them this way?

It depends, and it’s definitely not for everyone. But the joy of cloud is you can try things out quickly and see if they work. If they don’t, tear it down and build it a different way. One definite benefit of this particular solution is that it allows you to get started quickly for an alpha or POC. Sure you might need a proper RDBMS data warehouse further down the line, but to keep things simple get the solution up and running using an approach such as suggested in this article and “back fill” in with a more robust pipeline once you’ve got your transformation code nailed down.

Happy building.

Further Reading

Azure Analysis Services With Azure Data Lake Store

Process Azure Analysis Services Using Logic Apps

Operationalize Databricks Notebooks Using Azure Data Factory

Exam Tips For Getting Your MCSA (BI Reporting)

To get your MCSA (BI Reporting) you need to take 2 exams – Analyzing and Visualizing Data With Power BI (70-778) and Analyzing and Visualizing Data with Excel (70-779). I recently took both exams at Microsoft Ready, and thought with them fresh in mind it was a good opportunity to “pass it on” and provide some tips and advice on how to achieve the certification. I wasn’t actually planning to take 70-779, but in a prep session we were informed that 70-779 is very similar to 70-778 so decided to take both.

It was good advice, and I managed to pass both in the same day, achieving my MCSA (BI Reporting) in the process.

mcsa-bi

MCSA (BI Reporting)

With this mind, I’ve added a couple of study recommendations plus some stand out items from both exams that I’d recommend you double down on when preparing.

Study Recommendation 1Take the Analyzing and Visualizing Data with Power BI on Edx. It’s free to take (or audit, as it’s referred to on the site) the course, and it provides an excellent foundation in all the skills tested in 70-778.

Study Recommendation 2 – Take the Analyzing and Visualizing Data with Excel on Edx. This is basically the sister courses of the Power BI, and it provides an excellent foundation in all the skills test in 70-779..

Study Recommendation 3 – Go through the Power BI Guided Learning. Like the above course, this provides an excellent Zero to Hero progression in Power BI, developing the hands on skills that are essential to knowledge retention.

After completing the above, below are some of my specific recommendations based on what I encountered in the exams:

  1. Know your DAX. Especially creating measures, calculated columns and calculated tables. Know the difference between CALCULATE, SUMX and similar operators. Here is a good place to start.
  2. Know your M, specifically around manipulating string values in columns. Check out the M language reference and play around with a few commands to get the feel for it.
  3. Know the difference between Import to Data Model and Import to Worksheet (70-779).
  4. Relationships in the Data Model – Know when you would use Bidirectional Filters, the difference between 1-1 and 1-many joins and specifically the impact of how the filter direction affects the visuals produced. There’s a great whitepaper on this subject here.
  5. Refreshing Power BI models in the Power BI Service – Understand what you can refresh from Onedrive, the difference between DirectQuery and Import modes.
  6. Pivot Tables and Pivot Charts in Excel. Understand the structure (Rows vs Columns vs Values). (70-779)
  7. Slicers and Filters. Swot up on Report Interactions and understand how to to turn this off and on to achieve the desired effect. Go through this tutorial.
  8. Know the main differences between Power BI Pro and Power BI Free and Power BI Premium. Know what each licencing option gives you, and it doesn’t. Start here.
  9. Sharing in Power BI – know how to do it, know the differences between the different sharing options. Read this for a good overview.
  10. Row Level Security – Understand the main concepts of Roles and Rules and how to implement it via the data model.
  11. Data Model sizing and performance best practices. MVP Avi Singh provides a great video on this here.
  12. Understand the kind of data sources supported in Power BI. I’d suggest playing through connecting to and manipulating the following sources to get a feel for the steps you undertake – SQL table, JSON file and Excel workbook.

These are the items I encountered when I took 70-778 and 70-779. There will obviously be some variation in what you will encounter, but hopefully this provides some good pointers in where focus. Hands down though the best preparation is going through both of the MOOCS identified at the beginning. They cover massive chunks of the material tested in the exam so I 100% recommend you work through these as a minimum.

Good luck!

*Addendum* Shout out to my colleague Dustin Ryan who’s produced a similar post for 70-778. I saw this posted on LinkedIn recently when was this one was nearly done so figured I’d still release it. Read both. They can only help you in the long run.

https://sqldusty.com/2017/08/23/preparing-for-microsoft-certification-exam-70-778-analyzing-and-visualizing-data-with-microsoft-power-bi/

 

Power BI Premium Vs Analysis Services

What? Is that even a valid question?

At first glance this isn’t an obvious topic, but I’ve had a couple of conversations recently with partners that echoed some of my own thoughts about where and how Power BI Premium sits with Analysis Services.

Let’s recap what Power BI Premium brings in terms of performance enhancements (read about its wider capabilities here):

  • Increased dataset limits
  • Real time cache/pin to memory
  • Secondary model copies
  • Geo-replicas
  • Incremental/Dedicated Refreshes

With the new support for very large data sets, incremental refreshes etc, this has led to the following question being asked:

“If I have Power BI Premium with all these enhancements, why would I need Analysis Services?”

To me, there’s one main reason above all others why you would use Analysis Services in conjunction with Power BI, and you can skip to the summary to see it, but in order to answer this, let’s first look at the two products in more detail:

Power BI

Until recently, we had Power BI Pro (or Free) and Analysis Services. Both had quite a clear, defined place in the BI ecosystem. Power BI is described as the following:

…a suite of business analytics tools to analyze data and share insights”

These tools include Power BI Desktop (a powerful desktop data mashup and reporting tool), Power BI Mobile (mobile apps allowing you to access reports from anywhere) and the Power BI Service (allows you to publish and share reports throughout the organisation).

So, Power BI allows users to ingest data from a variety of sources, mash it up, build some reports and publish them to the Power BI Service for sharing and collaboration.

Power BI Premium is an add on to Power BI Pro that brings dedicated capacity, performance and scalability (see earlier in this article).

Analysis Services

With Analysis Services, you can mashup and combine data from multiple sources, define metrics, and secure your data in a single, trusted semantic data model.

Sound familiar?

In many ways, Power BI and Analysis Services are very similar. In fact, Power BI has at its heart a Tabular Analysis Services engine. When you ingest and mash up data into Power BI Desktop, you’re actually populating a local version of Analysis Services.

In fact with both tools you have very similar features:

  • Create hierarchies
  • Measures
  • Calculated columns
  • Row Level Security
  • Live Connection/Direct Querying

Data Visualization and reporting aside (Power BI does this, Analysis Services doesn’t – it’s not a reporting tool) there are also differences, but the lines have become blurred in recent times.

The Main Difference

The key difference is that with a data model created in Power BI directly, it’s essentially a closed model to Power BI only.  Users can create pages (reports) in Power BI Desktop with multiple visualizations and then publish this workbook to the Power BI Service. Within the Service itself, users can edit the published reports on this dataset as well as adding new ones:

EditReportPowerBI

Edit Report In Power BI Service

The point here is that this model and its data can be reused within and by Power BI only. Users must login to Power BI to edit the report, or they can publish a Power BI Desktop file as a central model and connect to it as a Power Service Live Connection. All of this functionality is pure Power BI.

Where Analysis Services differs is that it is essentially an open model. Yes, you create a data model by ingesting data and creating a user-friendly reporting model in exactly the same way you do with Power BI, but that’s where you stop. You publish the Analysis Services model as a central, trusted data source and – and this it the crux of the matter – ANY tool that can submit DAX or MDX can use your Analysis Services model and get exactly the same answer. This is the ultimate “single version of the truth” that is the goal of so many BI projects. You create your central model, publish it, then connect to it from any of the following tools, plus many more:

 

Summary

In the end, Power BI Premium on its own is a compelling option if you’re just a Power BI shop (and Excel, sort of) and nothing else. Your analysts and developers can develop Power BI models in Desktop, and these models can be used to create reports in the Service, or from Power BI Desktop using the Power BI Service Live Connection. My advice in this scenario is to start with Power BI as your data models, especially if your models are small or moderate in size, then consider moving to Analysis Services as a scaling up option. This is an especially appealing option now with the release of the Analysis Services Web Designer, which allows you to publish Power BI workbooks to Analysis Services and create a data model from them. This is an ideal scenario where you’re running a PoC and want to promote it to something more “enterprise”.

ScaleUpAAS

Scale Up Power BI to Analysis Services

 

However, if you want a central data model to act as the single version of the truth and have many different tools in use across the organization, then Analysis Services is your friend:

aaswithpowerbi

Power BI & Other Tools With Analysis Services

In conclusion, I recommend Analysis Services as a key component for any relational data warehouse platform in combination with Power BI. Whilst the decision isn’t as clear cut if you only use Power BI, the ability to create a completely abstracted, semantic layer that serves as your golden view of data regardless of what visualization tool is used is a point that can’t be over sold.

Agree? Disagree? Please feel free to comment below.

Related Links

Power BI Premium

Azure Analysis Services

Power BI Service Live Connection

Why Analysis Services

 

 

 

 

 

Connecting to Power BI Datasets with Power BI Service Live Connection

One of the most useful features to come to Power BI in recent months is the Power BI Service Live Connection. You can find out more here.

What is it?

In a nutshell, it allows developers to connect to an already published Power BI dataset and create new reports without having to redo all the queries and data ingestion.

This scenario is great as an Analysis Services “lite” approach, allowing teams to better collaborate and reduce duplication of effort by sharing standardized models and content. As most of the effort involved in self-service BI tools is spent doing the data mashing, with this approach you can “model once and reuse many times”.

How does it work?

As Power BI Service Live Connection is a preview feature, you first have to enable it in “options”:

powerbioptions_powerbiservice

Once enabled, restart Power BI and create a new dataset using the normal Get Data functionality. This becomes your data model to publish and share. Instead of building any reports, simply publish the pbix file to the Power BI Service:

powerbidatamodel

Adventure Works Data Model In Power BI

 

emptymodel

Blank Power BI Workbook with Data Model

Once published, this dataset can then be connected to from any other Power BI workbook using the new functionality. Developers no longer need to do all the heavy lifting work with the data mashing, and instead just use the already built data model to create their own reports and dashboards.

powerbigetdatafrombilive

Browse to the published data set:

workspace

You then have access to the published data model, and can build your own content. Note the “Edit Queries” option is grayed out.

connecttomodel

New Power BI Workbook Connecting to the Published Dataset

This new functionality is a great addition to the Power BI toolset, and allows for a much improved sharing and collaboration experience within teams.

Find Out More:

Blog Announcement

Step by Step Guide