Implementing a data warehouse in the cloud using Microsoft Azure technologies can help companies of all sizes move to a modern cloud data platform while leveraging the value created in an existing on-premises data warehouse.
Benefits of a data lake
Azure Data Lake Generation 1 and 2
While Azure Data Lake Generation 1 provides a solid solution for storing and organizing source data, Generation 2 (1) offers an improved product:
General architecture
What is Azure Data Factory?
Azure Data Factory (3) is a cloud tool for orchestrating events in a logical progression using “pipelines.” It allows users to move data, call events, send notifications, look up data and even call other pipelines. This can all be done in sequence, in loops, in parallel, or dependent on actions occurring within the pipeline. For this scenario, it is important to know that Azure Data Factory allows calls to Azure Databricks with parameters.
As a basic example of an ETL platform, a user can set up a pipeline to copy data to a staging area in data lake, then run an Azure Databricks notebook to transform the data as needed and finally, send the transformed data to a separate landing zone.
Benefits of Azure Data Factory
There are several benefits of using Azure Data Factory for a cloud data warehouse:
Orchestration only
While Data Factory’s Data Flow is available as a transform tool, the practice for this type of project has been to write standardized and custom Databricks notebooks.
This keeps the areas of concern separate, contains business logic for building in one place, allows the use of Databricks’ notebook version control for development releases and allows the organization to easily change where the final data sets are surfaced if necessary.
What is Azure Databricks?
Databricks is a cloud platform built around Apache Spark. Spark is an open-source distributed processing framework that enables fast processing of big data with multiple improvements over the original MapReduce paradigm (4). Databricks provides additional features for ease of use such as managed clusters, collaborative workspaces and a notebook-style interface. Azure Databricks (5) is the Databricks platform hosted on Azure.
Why use Databricks?
Azure Databricks is currently the technology of choice for handline data engineering on Azure.
Key resources to use Azure Databricks:
For a cloud-first data warehouse, Azure Databricks provides the tools and flexibility needed to integrate a variety of data types. It also provides the computational power needed to process big data without requiring data engineering teams to develop additional code for cluster management.
What is Delta Lake?
Open source
Parquet
Why use Delta Lake?
Data Lake
Malleable
Transactional
Efficient
The Microsoft Azure data platform stack provides flexibility, ease of scaling and speed to deliver on the promise of the data warehouse in the cloud. Ingestion, storage and transformation of data occur using Azure Data Factory, Data Lake Storage and Databricks. Additional complementary Azure services can build on this framework to address many of the most common use cases.
The Microsoft analytics stack then provides additional capabilities on top of the data warehouse to deliver the analytics, insights and predictions that lead to data-driven decision-making.
Interested in learning more?
References
1. https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction
2. https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
3. https://azure.microsoft.com/en-us/services/data-factory
4. https://www.infoworld.com/article/3236869/what-is-apache-spark-the-big-data-platform-that-crushed-hadoop.html
5. https://www.infoworld.com/article/3236869/what-is-apache-spark-the-big-data-platform-that-crushed-hadoop.html
6. https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html
7. https://databricks.com/product/databricks-delta
8. https://parquet.apache.org
9. https://docs.databricks.com/delta/optimizations.html
10. https://docs.databricks.com/delta/optimizations.html