Created on 12-26-2017 03:48 PM
Introduction
EDW (Enterprise Data Warehouse) is a traditional methodology developed and grown over past three decades, highly matured with its own best practices. Hadoop has emerged a decade ago but one of the fastest growing eco system. Merging these two systems without giving importance to details of both the worlds would easily make a scenario of not utilizing strengths of both and eventually lead to frustrations and failures. Most of EDW migration fails due to ignoring fundamentals of EDW and Hadoop.
EDW
An Enterprise Data warehouse is set of tools and techniques to enable business to answer compelling questions about what happened. It enables businesses to view the facts of various business processes from a perspective of different dimensions.
Example: What is profit generated out of customers from Texas during Christmas season on each year for the past 5 years.
In the above profit is a fact or “some form of transaction”, customers, state Texas and Christmas season are dimensions. Thus data warehouse Enables businesses to take informed decisions based on facts.
A Data warehouse is also defined as, subject oriented, time variant, non volatile and integrated data store.
Traditionally, it is feasible to build an EDW from following approaches,
Bottom up
Build data marts with facts and dimensions and combine them together which will eventually form an EDW.
Top down
Build an EDW with 3NF model, which would to try to represent the whole business. A typical 3NF model will include,
ETL
In a Data warehouse, ETL provides the skeleton to the framework and also consumes 80% of effort, remaining goes to presentation layer. ETL includes, Data modeling, ETL architecture, Meta data management, ETL development and maintenance. Thus the core of data is different in Hadoop and EDW. Hadoop is ingested with raw data, processing and quality is later concern, where as EDW stores cleaned and integrated, tested thus confidently consumable data.
ETL Tools
ETL tools have been significant part in building an ETL infrastructure. These contributed in design, build, test and maintain ETL flows with lot easier and with lower or feasible budget in terms of developer hours, please be reminded that In a data warehouse budget spent on ETL development and maintenance effort is significant. Before evolution of stable ETL tools, ETL was carried out with SQL, PL/SQL and shell scripts.
ETL tools evolved due to following advantages
Thus complex ETL logics built on SQL, PL/SQL and shell scripts were migrated to ETL tools, which addressed
Well, why should there be lot of concern on ETL tools while discussing EDW migration?
Because most of EDW migration effort fails due to undermining the importance of ETL tools and by pass them with all their advantages being lost, and going back to scripting approach on Hadoop. The key is ETL tools were used to bring down budget spent on man hours.
So should I keep my ETL tool untouched?
Hadoop should be used at the points where a ETL tool struggles. Its tough to handle enterprise size loads for ETL tools, so they come up with two options,
Full push down approach of an ETL tool on top of Hadoop would be ideal, but maturity of this approach has to be considered.
Scenarios to consider EDW -Hadoop merge
Let us classify a Data warehouse or EDW into following by data volume and computational needs.
Small (Don’t consider merging this with Hadoop now)
on a single, medium or high powered server, and users are enjoying low latency on current reports. Rate of growth on data size and computational needs are way below than improvement happens at hardware size of single server in terms of number of cores, memory and storage.
Example
Medium (Strongly consider Hadoop)
At present the scenario is as mentioned in previous case, but rate of growth (on data volume and computational needs) is high and management is considering to move from an RDBMS to MPP system.
Large (Must merge with Hadoop)
It’s an Enterprise wide Data warehouse running on a MPP system. This is typical scenario on many enterprises across industries. This kind of Data warehouse would grow from few hundred Teradata bytes to few Peta bytes.
How to merge EDW with Hadoop
An EDW is made up of Staging, Transformation and Consumption layer.
Consumption layer
There is a misconception that latency in data warehouse could be in seconds but in Hadoop it is in minutes, yes this is true for smaller warehouses with few hundred GB and hostable on a single server, but Enterprise warehouses with few hundred TBs and hosted in a MPP system, the average latency would easily go for 5 min to 8 mins on ad-hoc reports. Also note that latency is a concern for ad-hoc reports, in a typical data warehouse 40 to 70% of reports would be canned reports, for this category latency is a not a concern at all.
Staging layer
This another data store before consumption layer, data extracted from different source systems lands here. This layer is used by ETL tools to clean and integrate data. This can be next to move into Hadoop, both hdfs and hive can be used for this. Traditionally data staged on Linux servers, RDBMS or mainframe.
Transformation layer
This layer process the data at staging to clean and integrate with data at consumption layer. As being core layer with business logic, it is one of the most effort consuming layer. This should go into Hadoop as last and with careful consideration. This can be considered, when economy of moving this layer into Hadoop is justified over current ETL tool approach.
Ideal is keeping business logic on ETL tools and leverage cheaper computation power from Hadoop through push down approach.
Why “EDW migration into Hadoop” would fail?
References
[1] The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling 2nd Edition by Ralph Kimball, Margy Ross
[2] Building the Data Warehouse by W. H. Inmon