Building The Data Warehouse-less Data Warehouse (Part 2 of 2)

Part 1 of this series can be found here.

In my previous post I discussed and explored the feasibility of building a simplified reporting platform in Microsoft Azure that did away with the need for a relational data warehouse. In this article, I proposed that we land, process and present curated datasets (both dimensional files for our data warehouse “layer” and other data assets for our data scientists to work with) within Azure Data Lake Store, with the final step being to product a series of dimension and fact files to be consumed by our semantic layer. The diagram below highlights this approach:

conceptualmodeldwlessdw

At the end of the previous post I’d produced our data warehouse files (dimensions and facts) and in this second and final part I will show how we consume these files with our semantic layer (Azure Analysis Services) to finally present a business-friendly reporting layer without a data warehouse in size.

Semantic Layer

A semantic layer presents the business view of the data model, and allows users to interact with the layer without needing knowledge of the underlying schema or even knowledge of writing SQL code. In the words of my colleague Christian Wade, it’s “clicky-clicky draggy droppy” reporting that provides a single version of the truth without risk of users creating inaccurate measures by a misplaced join or incorrect Group By statement.

Microsoft’s semantic layer offering is Azure Analysis Services, and allows users to connect to models built with Analysis Services using any compliant tool, such as Power BI and Tableau.

I create an Azure Analysis Services project in Visual Studio, and connect to my Azure Data Lake Store from Part 1 (Ensure you change your model to use the latest SQL Server 2017/Azure Analysis Services Compatibility Level) :

aasdatalakestore

Azure Analysis Services Get Data

In the Query Editor I create queries that pull in the csv files that I created earlier for DimProduct, DimCustomer and FactSales:

aasimport

Query Editor

Note, whilst it’s relatively straight forward to import csv files into Azure Analysis Services from Data Lake Store, my colleague Kay Unkroth wrote a great article that makes this much easier, and I use this method in my solution. Please see this article for further details.

Once the tables have been imported into Azure Analysis Services, it’s then a simple feat to define our star schema and create a couple of measures:

simpleschema

Simple Star Schema (No Date Dimension Yet!)

We then publish our Analysis Services model to the Azure Analysis Services Server we created in part 1, and connect to it using Power BI:

powerbiexample

Power BI Example

That’s it, all done!

Not quite…

Refresh and Orchestration

So we’ve shown now that you can ingest, process and serve data as dimensional constructs using Databricks, Data Lake Store and Analysis Services. However this isn’t at all useful if the pattern can’t be repeated on a schedule. From Part 1, we use Azure Data Factory to copy data from our sources and also to call our Databricks notebook that does the bulk of the processing. With our Analysis Services model now published, we simply need to extend our Data Factory pipeline to automate processing the model.

Logic Apps

There are a few methods out there for refreshing an Azure Analysis Services cube, including this one here. However I particularly like the use of Azure Logic Apps for a code-lite approach to orchestration. Using Logic Apps I can call the Azure Analysis Services API on demand to process the model (refresh with latest data from our data store). The Logic App presents a URI that I can then call a POST against that triggers the processing.

Jorg Klein did an excellent post on this subject here, and it’s his method I use in the following example:

logicapps

Logic App Example

Once you’ve verified that the Logic App can call the Azure Analysis Services refresh API successfully, you simply need to embed it into the Data Factory workflow. This is simply a matter of using the Data Factory “Web” activity that is used to call the URI obtained from the Logic App you created above:

logicappPost

Logic App Post URL

Our final (simplified for this blog post) Data Factory looks like this, with the Web Activity highlighted.

datafactorywithweb

A simple test of the Data Factory pipeline verifies that all is working.

Conclusion

So, there you have it. My aim in this post was to see if we could create a simplified data warehouse-like approach that did away with a relational data warehouse platform yet still provided the ability to serve the various workloads of a modern data platform. By keeping the data all in one location (our data lake), we minimize the amount of data movement, thus simplifying many aspects, including governance and architecture complexity.

In terms of how we did it:

  1. Ingested data from source systems using Azure Data Factory, landing these as CSV files in Azure Data Lake Store
  2. Azure Databricks was then used to process the data and create our dimensional model, writing back the data files into Azure Data Lake Store
  3. Azure Analysis Services ingested the the dimensional files into its in-memory engine, presenting a user friendly view that can be consumed by BI tools
  4. Refresh of the Analysis Services model was achieved using Azure Logic Apps, with this component being added to our data pipeline in Azure Data Factory

Is This A Viable Approach?

Simply put, I believe the approach can work, however I think it is definitely dependent on specific scenarios. You can’t, or at least, not very easily, create “traditional” data warehouse elements such as Slowly Changing Dimensions in this approach. The example proposed in these articles is a simple star schema model, with a “rebuilt-every-load” approach being taken as our data sets are very small. For large, enterprise scale data warehouse solutions you need to work in different ways with Data Lake Store than we would do with a traditional data warehouse pipeline. There are many other factors to discuss that would affect your decision but these are out of scope for this particular article.

So, can we build a datawarehouse-less data warehouse?

Yes we can.

Should we build them this way?

It depends, and it’s definitely not for everyone. But the joy of cloud is you can try things out quickly and see if they work. If they don’t, tear it down and build it a different way. One definite benefit of this particular solution is that it allows you to get started quickly for an alpha or POC. Sure you might need a proper RDBMS data warehouse further down the line, but to keep things simple get the solution up and running using an approach such as suggested in this article and “back fill” in with a more robust pipeline once you’ve got your transformation code nailed down.

Happy building.

Further Reading

Azure Analysis Services With Azure Data Lake Store

Process Azure Analysis Services Using Logic Apps

Operationalize Databricks Notebooks Using Azure Data Factory

SQL Server Options In Azure

The data platform options in Azure are vast, and they grow and change with each month. Choosing the right database platform for your workloads can be a challenge in itself. This post aims to give some clarity as to the options available in the SQL Server space.

Before diving in, it’s worth clarifying the underlying platform options to be discussed. SQL Server in Azure comes in both Infrastructure As a Service (IaaS) and Platform As a Service (PaaS) flavors, which each providing different benefits depending on your specific needs. This is best summarized in the graphic below:

sqliaasoptions

SQL Server IaaS & PaaS

Infrastructure As A Service Options

SQL Server In Azure VM

As simple as it sounds, you can run SQL Server “as is”, running the full engine on a Windows or  Linux virtual machine, and also in a Docker container. This option is exactly the same as the one you’d run on premise, with the full SQL Server feature set, including High Availability, Security and various other features depending on the edition you choose.

If you need full control over/access to the OS, need to run apps or agents alongside the database, and basically want to manage all aspects of your solution, then SQL Server IaaS is the right solution.

This SQL Server option can be accessed from the Azure marketplace, with many editions available depending on your needs.

Billing in this model comes in two flavors too, Bring Your Own Licence (BYOL) where you provide your own SQL licence and just pay for the underlying compute/storage, or Pay As You Go where you pay the SQL licence per minute the machine is running for.

Platform as a Service (PaaS) Options

Along with the full control of SQL Server IaaS, you also have several PaaS options. These provide much better total cost of ownership (TCO) as much of the daily administration is handled by the Azure platform, so things like backups, high availability, performance tuning and monitoring are all abstracted away, allowing you to just focus on building/using your application.

 

sqlpassoptions

SQL PaaS Options

SQL Database

The first of the pure PaaS offerings, (sometimes called Database As a Service (DBaaS)), SQL Database offers the power of the SQL Server engine, but without the management overhead that comes with maintaining a full SQL Server instance.

As a DBaaS, SQL Database brings with it many features, including dynamic provisioning and resizing, built in High Availability, automatic backups, point-in-time restore and active geo-replication. As Microsoft assumes much of the daily maintenance work, this frees you up to realize cost or operational benefits that you wouldn’t have experienced with your on-premise or hosted solution.

Unlike the IaaS option, where you choose the virtual machine that your instance will reside on (with associated cpus, ram etc), SQL Database instead comes in different tiers, from Basic through to Premium. Rather than choosing specific hardware items like RAM, SQL Database tiers are measured in Database Throughput Units (DTUs), with the underlying specifications abstracted away.

Read here for more details on how to choose the best tier for your application.

There is a small subset of features not available or not applicable in SQL Database compared to SQL Server. Please check here and here.

SQL Database Managed Instances (Preview)

Designed to minimize the challenges of migrating applications to a SQL Database environment without having to perform application rewrites, SQL Managed Instance is an extension to SQL Database that offers the full SQL Server programming surface and includes several native SQL Server features:

  • Native Backup and Restore
  • Cross Database queries and transactions
  • Security features – Transparent Data Encryption, SQL Audit, Always Encrypted and Dynamic Data Masking
  • SQL Agent, DBMail and Alerts
  • Change Data Capture, Service Broker, Transactional Replication and CLR
  • DMVs, XEvents and Query Store

On top of this, SQL Managed Instance offers full security and isolation, with SQL Managed Instance being behind their own virtual network (vnet) within Azure (this is now also available for SQL Database).

Currently in private preview, with public preview due shortly, SQL Managed Instance is a great way to get started moving data to the cloud that combines the benefits of both PaaS and IaaS SQL Server models without having to make changes to the affected application. Microsoft have also created the Azure Data Migration Service to make the migration as seamless as possible.

SQL Database Managed Instance also provides an added incentive to move to the cloud with the Azure Hybrid Cloud Benefit for SQL Server. This allows you to move your on-premise SQL Servers to Azure, and only pay for the compute and storage. See here for more details.

SQL Database Managed Instance is going to be a real game changer in this space in my opinion. My colleague James Serra has created an excellent deck that goes into more details here.

SQL Database Elastic Pools

Whilst not technically a different type of SQL Database offering, Elastic Pools provides the ability to manage multiple SQL Databases that have a variable and unpredictable workload. SQL Databases in an Elastic Pool are allocated elastic Database Throughput Units (eDTUs) that dynamically scale the databases within the pool to meet the required performance demands.

Elastic Pools are ideal for multi-tenancy environments where the workload can’t be predicted, but you don’t want to have to over provision for those “just in case” moments.

SQL Data Warehouse

SQL Data Warehouse, whilst is “relational” in that it has tables with relations between them, is a different concept to the options above. SQL Data Warehouse is what is known as an MPP (Massively Parallel Processing) solution, designed for heavy duty data processing and querying at scale, the likes you’d see in a data warehouse or similar analytics solution.

SQL Data Warehouse could easily be the subject of a full article all on its own. This platform provides a fully scalable analytics engine where the compute and storage are scaled independently, meaning you’re not fixed into certain configurations that are over specced, thus providing greater TCO benefits. With the underlying Polybase engine supporting it, SQL Data Warehouse is a must-have component in many modern data warehouse architectures within Azure.

Read here for more information about SQL Data Warehouse use cases and design patterns.

Honorable Mentions

This post is specifically around SQL Server options within Azure, however I wouldn’t be doing justice to the platform if I didn’t mention the other database options available, and will be covered in future posts:

  • Cosmos DB – A globally distributed, multi-mode database service that supports SQL, Graph, Cassandra, MongoDB and Table APIs.
  • Azure Database for MySQL – Exactly as it says on the tin. This is another PaaS offering that provisions a MySQL instance.
  • Azure Database for PostgreSQL – As above, this PaaS offering provides PostgreSQL functionality, all managed by the Azure platform.

Which Should I Choose?

There are many factors to consider when choosing a relational database service in Azure, including cost, control, workload type, governance requirements and many more. As a general rule, if you need to have full control over your environment, with a full feature set, then SQL Server 2017 as a virtual machine or container is the way to go. For applications born in the cloud that just need the database engine and don’t want or need the hassle of maintaining the database, then SQL Database or SQL Database Managed Instance are excellent options. Again, if you need that elasticity that comes with hosting a multi tenant or highly variable environment, then SQL Database Elastic Pools is the option for you.

The real game changer now is SQL Managed Instances. Offering a full SQL Server feature set, but with all the benefits of PaaS, this option is great for those looking for a seamless move from on-premise to the cloud.

Further Reading

SQL Database vs SQL Server

SQL Data Warehouse vs SQL Database

SQL Managed Instances

 

 

 

 

Azure Analysis Services Web Designer

I’m a long time fan of Analysis Services, and this latest feature is a really cool addition to the offering. Currently in Preview, Azure Analysis Services Web Designer offers the following functionality that extends Azure Analysis Services, all through a simple web UI:

  • Add a new logical server in Azure
  • Create a new Analysis Services model from SQL DB, Azure SQL DW and….Power BI workbooks! (More data sources to come)
  • Browse an existing Analysis Services model and add new measures for quick validation
  • Open an existing Analysis Services model as a Visual Studio Project, in Excel, or in Power BI
  • Edit an existing Analysis Services model using TMSL (Tabular Model Scripting Language)

Example – Creating a Model from Power BI Desktop

In this example, I’ve created a simple Power BI workbook (.pbix file) that’s connecting to an Azure SQL DB instance running the sample AdventureWorks database:

Simple Report

Simple Power BI Report

To connect to the new service, I can access via the Azure portal or go direct to https://analysisservices.azure.com:

AAS Web Designer Portal

Azure Analysis Services Web Designer Portal

On this occasion, I’ll use the server I already have and instead go straight to adding a new model:

CreateModelFromPowerBI

Importing a Model from Power BI

Once created, I can browse the model in the Designer:

BrowseModel

Browse Model

Or open it using Power BI, Excel or Visual Studio:

Open Options

Open Model Options

Pbi From AAS Designer

Power BI From Published Model

This is in preview right now, with many features still to come. The ability to import Power BI models into Analysis Services is a massive feature in its own right, but aside from that it already shows how you can quickly create and modify models without having to delve into Visual Studio/SQL Server Data Tools. New features are coming every month, so keep on eye on the team blog (below) to follow its progress.

Further Reading

Introducing Azure Analysis Services Web Designer

Analysis Services Team Blog

Optimizing SSIS Data Loads with Azure SQL Data Warehouse & PolyBase

To gain the best loading performance with Azure SQL Data Warehouse, you want to leverage the power of PolyBase, which allows for massively parallel data loading into the underlying storage.

In standard scenarios, ETL developers using SSIS will leverage a data flow that takes data from a source (Database, File, etc) and load it directly into the target database:

SSIS Simple Data Flow

Simple SSIS Data Flow (Table to Table)

Unfortunately, adopting this pattern with an SSIS data flow into Azure SQL Data Warehouse will not naturally leverage the power of PolyBase. In order to understand this we need to take a little peek under the covers of SQL Data Warehouse:

Azure SQL DWH Overview

Azure SQL Data Warehouse Overview

As part of the underlying architecture, SQL DW has at its head the Control Node. This node manages and optimizes queries, and serves as the access point for client applications when accessing the SQL DW cluster. What this means for SSIS, is that when it makes its connection to the SQL DW using standard data connectors, this connection is made to the Control Node:

SSIS Front Load

SSIS Front Loading

This method introduces a bottleneck as the Control node serves as the single throughput point for the data flow. As the Control node does not scale out, this limits the speed of your ETL flow.

As stated earlier, the most efficient way of loading into Azure SQL DW is to leverage the power of PolyBase in a process known as “Back Loading”:

SSIS Back Load

Back loading via Polybase

Using this method, data is bulk loaded in parallel from staging files held in Blob Storage or Azure Data Lake Store, which allows for the full data processing power of PolyBase to be used.

Leveraging Polybase Within SSIS

Previously, ingesting data from files into SQL DW using PolyBase was a slightly drawn out process of uploading files to storage, defining credentials, file formats and external table definitions before finally loading the data into SQL DW using a CTAS (Create Table As Select) statement. This works perfectly well, but introduces a lot of extra SQL coding on top of the norm.

Thankfully, with the SSIS Azure Feature Back, the above is now made much, much easier with the introduction of the Azure SQL DW Upload task. This process automates the creation of the SQL scripts required, allows you to seamlessly incorporate this task into an SSIS ETL process that will fully leverage Polybase.

Azure DW Upload Control Flow

Azure SQL DW Upload Control Flow

Azure DW Upload Task

Azure SQL DW Upload Task

Further Reading

Azure SQL DW Best Practices

Azure SQL DW With Azure Data Lake Store (ADLS)

Polybase Guide