*Update – part 2 of this series is now live, and can be found here*
In times of yore, those who maketh the houses of data would bring forth vast swathes of tables and hurl them down in the forts of staging. Here the ancient priests of Ee, Tee and El would perform arcane magicks and transform these rows of chaos into purest order. This order would be rebuilt into the fabled Data Warehouses, and all who looked upon them would gasp in awe and wonder.
But that was then. Today’s data world is different, isn’t it? Data is varied. It’s big and small, fast and slow. It’s tougher to wrangle and make into shapes fit for use. But through all of this, we still need the hallowed, fabled, data warehouse.
Or do we?
This post is basically me exploring a different approach to building a data warehouse that’s based on the data lake paradigm and a form of ELT (Extract, Transform and Load), but leaves out the actual data warehouse part. I’m still going to build a star schema, but it’s going to be file based, using modern data engineering tools to do the munging and “schema-tizing” before sucking into a semantic layer for reporting. It’s also me exploring the capabilities of my current favourite tool – Azure Databricks.
What are the potential benefits of this approach? A few spring to mind, such as cost, flexibility and simplicity. By keeping all the processing within the data lake means it’s easier to control and govern, and the reduced data movement (you’re not copying into a data warehouse) makes an altogether more simpler structure.
Conversely, I’m well aware that this approach brings it’s own challenges. Traditional data warehouse constructs like Slowly Changing Dimensions, Surrogate Keys and other elements of the Kimball checklist will be harder or even not possible with this, so it won’t suit every scenario.
My aim here though was simple – can we build a BI solution without the data warehouse element and is it a viable approach for certain scenarios?
In short, the solution looks like this:
The above construct largely follows the traditional ETL model with data flowing in from source systems, and comprises the following:
- Data ingested as raw files into a staging zone in the data lake
- Files processed using data engineering platform into cleansed outputs
- Scrubbed files then shaped and moved into relevant serving zones. In this example I’ve kept it simple, with one zone for the star schema files and one for a data assets folder that provides cleansed, curated data sets to be consumed by analysts and data scientists.
- The star schema files are subsequently loaded into a semantic layer for enterprise reporting and ad hoc slice n’ dice functionality, whilst the asset files are consumed using the variety of tools preferred by today’s analysts.
To bring this to life, I’ll be using a handful of Azure data services for the relevant solution components –
In this example, I’ll be using an Azure SQL Database as my data source with the AdventureworksLT database.
- An Azure Data Lake Store
- An Azure Databricks Workspace
- Azure Analysis Services Server
- An Azure Data Factory Account
- Azure SQL Database with AdventureWorksLT sample database installed
For each of the above, ensure you put everything in the same Azure region.
There are plenty of options in this space that could move data from source to my lake, including ETL tools such as SSIS, Talend and Azure Data Factory. For this example I’m using Azure Data Factory (version 2), with copy activities moving data from my source SQL database and dropping as *.csv files. I’m also taking advantage of the new Databricks functionality built into Azure Data Factory that allows me to call a Databricks Notebook as part of the data pipeline.
The above data factory copies the source tables into my “raw” directory, and from there I process the files, with the end result being to create my dimension and fact files ready to be loaded into Azure Analysis Services.
Load Files Into Spark DataFrames
With this files safely copied into our raw landing zone, I can now extract the source tables directly into Spark DataFrames. An example is shown below for the product table. We create a DataFrame for each source file.
With the DataFrames created, I then create temporary SQL tables from them so that we can use SQL code to define my dimension. You can of course manipulate them natively as DataFrames using Python/Scala/R if you’re more familiar with those languages. I’m a SQL dude, and am familiar in building data warehouse routines in SQL code so have opted for that method here.
With these views created we can use good ‘old SQL to create DataFrames that reflect the dimensions and facts we want to load into Azure Analysis Services.
and fact tables:
When I run my notebook now I have my dimensional and fact tables created as Spark dataframes. In order for them to be consumed by Azure Analysis Services I need to write the dataframes into CSV files that can then be imported into my tabular model.
Output Dimension and Fact Files
Writing files back from Spark into csv wasn’t as seamless as I thought it would be. Whilst the commands are fairly straight forward, as Spark is a distributed system it writes multiple files as an output, including crc and SUCCESS metadata files. It also doesn’t name them as the file you specify, but instead names it based on the partition name.
We don’t need these files, and need a proper, consistent filename, so I wrote some extra code to rename and move the files back under our Data Warehouse directory.
With all this done, it leaves us nicely with a set of files ready for ingestion into Azure Analysis Services, which is the subject of Part 2 of this series.
Coming in the second and final part of this series…
- Build an Azure Analysis Services model directly off the dimension and fact files within the Data Lake Store
- String it all together using Azure Data Factory
Part 2 is now live, and can be found here….