Community Articles

knatarasan · ‎12-26-2017

Introduction

EDW (Enterprise Data Warehouse) is a traditional methodology developed and grown over past three decades, highly matured with its own best practices. Hadoop has emerged a decade ago but one of the fastest growing eco system. Merging these two systems without giving importance to details of both the worlds would easily make a scenario of not utilizing strengths of both and eventually lead to frustrations and failures. Most of EDW migration fails due to ignoring fundamentals of EDW and Hadoop.

EDW

An Enterprise Data warehouse is set of tools and techniques to enable business to answer compelling questions about what happened. It enables businesses to view the facts of various business processes from a perspective of different dimensions.

Example: What is profit generated out of customers from Texas during Christmas season on each year for the past 5 years.

In the above profit is a fact or “some form of transaction”, customers, state Texas and Christmas season are dimensions. Thus data warehouse Enables businesses to take informed decisions based on facts.

A Data warehouse is also defined as, subject oriented, time variant, non volatile and integrated data store.

Traditionally, it is feasible to build an EDW from following approaches,

Bottom up

Build data marts with facts and dimensions and combine them together which will eventually form an EDW.

A fact could be in the following forms,
- Transaction: made up of individual transactions.
- Accumulating: which captures a process flow. (example: a cycle from order placed to products sold out for a retail store)
- Periodic: Data captured on periodic intervals. (example: In a bank, Amount on every account at beginning of every month)
A dimension could be
- Date and time
- Customer
  - Classified on things like geography, demography, age etc.
  - Tracks changes,
    - This data set should store history of a customer.
    - Example: Customer John, where he lived during different period of time etc.
- Product
- Employee
- Cost center
Profit center

Top down

Build an EDW with 3NF model, which would to try to represent the whole business. A typical 3NF model will include,

People
Asset holdings
Contracts
Product
Event
Campaign
Medium
Location
Organization

ETL

In a Data warehouse, ETL provides the skeleton to the framework and also consumes 80% of effort, remaining goes to presentation layer. ETL includes, Data modeling, ETL architecture, Meta data management, ETL development and maintenance. Thus the core of data is different in Hadoop and EDW. Hadoop is ingested with raw data, processing and quality is later concern, where as EDW stores cleaned and integrated, tested thus confidently consumable data.

ETL Tools

ETL tools have been significant part in building an ETL infrastructure. These contributed in design, build, test and maintain ETL flows with lot easier and with lower or feasible budget in terms of developer hours, please be reminded that In a data warehouse budget spent on ETL development and maintenance effort is significant. Before evolution of stable ETL tools, ETL was carried out with SQL, PL/SQL and shell scripts.

ETL tools evolved due to following advantages

Graphical user interface
Short learning curve
Ease of development
Ease of maintenance

Thus complex ETL logics built on SQL, PL/SQL and shell scripts were migrated to ETL tools, which addressed

Complexity in data processing logics were reduced by replacing complex sqls.
Complexity in workflow design is reduced by replacing complex shell scripts and scheduling logics.

Well, why should there be lot of concern on ETL tools while discussing EDW migration?

Because most of EDW migration effort fails due to undermining the importance of ETL tools and by pass them with all their advantages being lost, and going back to scripting approach on Hadoop. The key is ETL tools were used to bring down budget spent on man hours.

So should I keep my ETL tool untouched?

Hadoop should be used at the points where a ETL tool struggles. Its tough to handle enterprise size loads for ETL tools, so they come up with two options,

Running ETL logic on a grid of servers (example: Informatica grid).
Leveraging the power of MPP (example: By Informatica’s full push down optimization, ETL logic would be built and maintained in Informatica, where as entire processing will be carried out in Teradata).

Full push down approach of an ETL tool on top of Hadoop would be ideal, but maturity of this approach has to be considered.

Scenarios to consider EDW -Hadoop merge

Let us classify a Data warehouse or EDW into following by data volume and computational needs.

Small (Don’t consider merging this with Hadoop now)

Data size occupied by consumption layer can be handled by an RDBMS running

on a single, medium or high powered server, and users are enjoying low latency on current reports. Rate of growth on data size and computational needs are way below than improvement happens at hardware size of single server in terms of number of cores, memory and storage.

Example

Data marts with an objective to serve small and limited set of users
EDW for a small business or a business which is not data oriented (example: A law firm)

Medium (Strongly consider Hadoop)

At present the scenario is as mentioned in previous case, but rate of growth (on data volume and computational needs) is high and management is considering to move from an RDBMS to MPP system.

Large (Must merge with Hadoop)

It’s an Enterprise wide Data warehouse running on a MPP system. This is typical scenario on many enterprises across industries. This kind of Data warehouse would grow from few hundred Teradata bytes to few Peta bytes.

How to merge EDW with Hadoop

An EDW is made up of Staging, Transformation and Consumption layer.

Consumption layer

This is the data repository which stores ready to consume data. This layer can be first to go into Hadoop using HIVE. As hive supports most of the features of a MPP system,
Parallel computation
Support to SQL syntax
Matured drivers to connect with leading BI tools
Able to load and unload huge data in shorter time
Latency similar or slightly higher than existing MPP

There is a misconception that latency in data warehouse could be in seconds but in Hadoop it is in minutes, yes this is true for smaller warehouses with few hundred GB and hostable on a single server, but Enterprise warehouses with few hundred TBs and hosted in a MPP system, the average latency would easily go for 5 min to 8 mins on ad-hoc reports. Also note that latency is a concern for ad-hoc reports, in a typical data warehouse 40 to 70% of reports would be canned reports, for this category latency is a not a concern at all.

Staging layer

This another data store before consumption layer, data extracted from different source systems lands here. This layer is used by ETL tools to clean and integrate data. This can be next to move into Hadoop, both hdfs and hive can be used for this. Traditionally data staged on Linux servers, RDBMS or mainframe.

Transformation layer

This layer process the data at staging to clean and integrate with data at consumption layer. As being core layer with business logic, it is one of the most effort consuming layer. This should go into Hadoop as last and with careful consideration. This can be considered, when economy of moving this layer into Hadoop is justified over current ETL tool approach.

Ideal is keeping business logic on ETL tools and leverage cheaper computation power from Hadoop through push down approach.

Fact
- Transaction (insert once read many) - hive table can solve this.
  - Accumulating (insert once update a row multiple times, read many) – hive with work around on update logic can solve this.
- Periodic (insert once read many) - hive table can solve.
Dimension
- Data and time - plain hive table
- Customer (insert once update a row occasionally) - hive table can solve.
- Product (insert once update a row occasionally) - hive table can solve.
- Cost center (insert once update a row occasionally) - hive table can solve.
- Profit center (insert once update a row occasionally) - hive table can solve.

Why “EDW migration into Hadoop” would fail?

Seeing data warehouse as a voluminous data store (Actually a data warehouse is built by defined data processing rules, tested data pipelines, built through confluence of business processes and SMEs).
Trying to migrate all possible pieces and mechanisms into Hadoop instead of moving only feasible things.
Aggression towards saving money on cost of data storage and cost on computation power but neglecting cost involved in man hours to clean integrate and maintain business data.
Migrating pieces from EDW into Hadoop based on how it is possible instead of how it is feasible.

References

[1] The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling 2nd Edition by Ralph Kimball, Margy Ross

[2] Building the Data Warehouse by W. H. Inmon

Cloudera Community

Community Articles

Merging EDW with Hadoop

Apache Hadoop

EDW After Druid

Hive ACID Merge by Example

ANNOUNCEMENT: HCC Community Merge

nifi toolkit merge-param-context help

Hadoop Using SAP

Interacting with Hadoop HDFS using Python codes

Hadoop Security Concepts

Hadoop Cluster Maintenance

Apache Nifi - Merge Content

Small Files in Hadoop