Created on 05-12-2017 02:15 AM - edited 08-17-2019 01:00 PM
Scalable cheap storage and parallel processing are at the foundation of Hadoop. These capabilities allow Hadoop to play a key role in complementing and optimizing traditional Enterprise Data Warehouse (EDW) workloads. In addition, recent technologies like Hive LLAP (in-memory, long-running execution threads) with ACID merge and AtScale proprietary software (virtual OLAP cubes with aggregation cacheing in HDFS) now for the first time allow fast BI directly against TB- and PB-sized data on Hadoop using well-known BI tools (e.g Tableau).
This article describes reference architectures for three use cases of EDW optimization:
Note that any of the below architectures can be implemented alone or a combination can be implemented together, depending on your needs and strategic roadmap. Also for each diagram below, red represents EDW optimization data architecture and black represents existing data architecture.
In this use case aged data is offloaded to Hadoop instead of being stored on the EDW or on archival storage like tape. Offload can be via tools such as Sqoop (native to Hadoop) or an ETL tool like Syncsort DMX-h (proprietary, integrated with Hadoop framework including YARN and map-reduce).
Benefits:
In this use case both staging data and ETL are offloaded from the EDW to hadoop. Raw data is stored on the lake and processed into cleaned and standardized data used for Hive LLAP tables. Cleaned and standardized data is transformed into structures that are exported to the existing EDW. BI users continue to use the existing EDW unchanged, unaware that the plumbing beneath has changed.
Benefits:
This use case is identical to EDW offload as per above but the EDW is either fully replaced by or augmented by OLAP on Hadoop. For greenfield environments, replacement (i.e. prevention) of OLAP is particularly attractive.
Benefits:
Notes on OLAP on Hadoop:
Traditional Enterprise Data Warehouses are feeling the strain of the modern Big Data era: these warehouses are unsustainably expensive; the majority of their data storage and processing is typically dedicated to prep work for BI queries, and not the queries themselves (the purpose of the warehouse); it is difficult to store varied data like semi-structured social and clickstream; they are constrained by how much data volume can be stored, for both cost and scaling reasons.
Hadoop's scalable inexpensive storage and parallel processing can be used to optimize existing EDWs by offloading data and ETL to this platform. Moreover, recent technologies like Hive LLAP and Druid or Jethro allow you to transfer your warehouse to Hadoop and run your BI tooling (Tableau, MicroStrategy, etc) directly against TBs and PBs of data on Hadoop. The reference architectures in this article show how to architect your data on and off of Hadoop to achieve significant gains in cost, performance and Big Data strategy. Why wait?
Hive LLAP and ACID merge
Syncsort DMX-h
Jethro
Druid