The ‘journey towards Data Warehouse’ might look like a walk in the park with all the popular cloud vendors offering data infrastructure and BI capabilities. Data Management technologies have surely become more familiar topics, but businesses are still struggling to timetable their data journey and bring together the important abilities to acquire any substantial benefits from them. As our data experts always say,
Business Intelligence is not a state but a continuous practice that involves constant application, learning, and re-learning.
According to the business and data analysts, the first practical step is to have a data repository for holding all info such as transactions, customer details, feedback reviews, financials, etc. This repository with blobs of raw and unorganized data is called Data Lake. But can’t a business’s data journey directly start with Data Warehouse? There are digital-native companies who took advantage of SQL or Postgres, started with a few tables without bothering much about design – well, technically it is a functional Data Warehouse. But the inability of Data Warehouses to store unstructured data would lessen your chances of exploratory data analysis in the future when there is no Data Lake. Last time, we talked about why business executives should back the Data Lake investments. In this article, let’s discuss the journey towards Data Warehouse via Data Lake, the pros & cons of each Data Management approach, and how joining both in the right proportion can make your BI investments financially rewarding.
Just like many new technological concepts, the terms ‘Data Warehouse’ and ‘Data Lake’ are often used interchangeably. But the concept of Data Warehousing has been around for a while. Just that the new technologies such as Cloud storage, AI, and ML changed its reputation of being expensive and unmanageable. Once you dip your toes in the waters, you’ll know both concepts have distinct and very dissimilar abilities.
Data Warehouse | Data Lake |
Analyzing structured data | Analyzing all types of data (structured, semi-structured, unstructured) |
Online Transaction Processing (OLTP) | Online analytical processing (OLAP) |
Schema-on-write | Schema-on-read |
SQL | API-connectivity |
Delivering ACID-compliant database transactions | Low-cost object storage systems for data in open file formats |
Before we discuss why the journey towards Data Warehouse is more sustainable when it is backed by a Data Lake, let us first look at the basic architectures of both these Data Management Technologies.
Figure: Basic Architecture of Data Warehouse
Data Warehouses are designed for specific use cases and the usual practice is to build one for each business unit or function focused on the metrics that matter to them. They follow schema-on-write model) and ingest structured data which is then connected to analytical tools that support BI initiatives. This means the data has to fit into a predefined schema to be able to enter your Data Warehouse. What about the data that doesn’t fit into the schema? It simply stays as a blob of binary in your Database and doesn’t make it to the other side of the BI wall.
Figure: Basic Architecture of Data Lake
Unlike Data Warehouse, Data Lake is about aggregating all the data in real-time and historic, from various sources - be it structured, unstructured, or semi-structured - into secured and scalable cloud storage. No prior data cleansing, cataloging, or schema is involved while collecting the data. This allows the data scientists to query the data in numerous ways and offers flexibility to explore experimental and investigative use cases. It accommodates massive amounts of data irrespective of its formats, sizes, and structures.
As already mentioned, it is totally possible to have a Data Warehouse before a Data Lake. In fact, a Data Lake wouldn’t even be a practical choice of Data Management practice unless your business holds gigabytes of historic data or deal with a crazy flow of real-time data. That is why the experts are suggesting a hybrid Data Management model that routes your journey towards Data Warehouse via Data Lake to take advantage of both long-term as well as short-term Business Intelligence (BI) goals.
Remember the Hype Cycle for Analytics and Business Intelligence, 2021 by Gartner?
Figure: Hype Cycle for Analytics and Business Intelligence, Gartner
According to this graph, Gartner predicted the concept of ‘Logical Data Warehouse’ to be on the Plateau of Productivity with a high benefit rating. By definition, it is all about combining multiple physical analytics engines into a logically integrated unit. LDW is praised as one of the current best practices for analytical systems design to enable reliable real-time Business Intelligence flow. It not only breaks the Data silos by combining the architectures of Data Warehouse and Data Lake, but also brings out the best of various analytical technologies including SQL, OLAP, Graph, Geospatial, machine learning, statistical, and more.
This hybrid Data Management setup is already being practiced in a few digitally forward enterprises and so far, proved to be a robust model to save enterprises from high-cost and low-value data integration cycles. The ideal strategy of Enterprise Data Management of the 21st century and nearly every Data & Analytics (D&A) professional’s toolbox consists of RDBMS, data warehouses and marts, Hadoop clusters, and Data Lakes. This is how it starts to work.
- List out all your data sources such as ERP, CRM, marketing analytics tools, customer support channels, social media platforms, geospatial platforms, etc.
- Irrespective of SQL or No SQL, ingest all that raw data into the Data Lake.
- This is where the D&A team comes into the picture to look up relevant datasets and sources based on the metadata, create customized algorithms for anomaly prediction, predictive maintenance, or fraud detection.
- Certain segments of the data will be sent through the predefined schemas for an ETL process and will be uploaded into the cloud-based Data Warehouse.
- The self-service BI tools, data visualization software and other analytics tools will pick up the data that passed the schema to generate ready-to-use analytics and reports.
- It goes without saying, find a trusted data transformation partner who is experienced in digital business transformations to consult, experiment, and implement.
Data Lake and Data Warehouse go hand in hand – one serves your immediate analytical needs and the other stores your data, so nothing is overlooked when you need to take a serious business decision. But they both are different in more than one way. On top of my head, I can think of 5.
- They assist different kinds of data.
- Their user groups/consumers are different.
- The way they both maintain data differs.
- The speed at which they deliver insights is different.
- The purpose they solve is different.
If you are on a journey towards Data Warehouse and chose to add Data Lake to your architecture, you’d want all the pros and cons of the setup.
Conceptually, the whole idea of unifying Data Warehouse and Data Lake is to get the best of both worlds but mistakes in the process would not only cost you time and money but also the data. Business Intelligence through Data Management technologies requires proficiency in AI and ML. At Qentelli, we are working relentlessly to bring in data literacy in enterprises in efficient, cost-optimal, and sustainable ways. Would you like to know more about our learnings from those projects? Or would you like a bit of expert advice on how to start your journey towards Data Warehouse? We always love a good chat discussing newer challenges. Write us: [email protected]