Community Articles

rbiswas1 · ‎07-13-2017

Design approach

The designs depend on the work done in the below Jira, where data node is conceptualized as a collection of heterogeneous storage with different durability and performance requirements.

https://issues.apache.org/jira/browse/HDFS-2832

Design 1

1) Hot data with partitions that are wholly hosted by HDFS.

2) Cold data with partitions that are wholly hosted by S3.

3) A view that unions these two tables which is the live table that we expose to end users.

Design 2

1) Hot data with partitions that are wholly hosted by HDFS.

2) Cold data with partitions that are wholly hosted by S3.

3) Both hot and cold data are in the same table

Design 2 is chosen over Design 1 because Design 1 is not transparent to the application layer. The change from old table to the view would inherently transfer some level of porting/integration extra work to the application.

Architecture Diagram

High Level Design

Automation Flow Diagram

Code

Automation tool codebase

https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/hive_hybrid_storage.sh

Example configuration file

https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/test_table.conf

Setup & Run

Setup

cd /root/scripts/dataCopy
vi hive_hybrid_storage.sh ##Put the script here
chmod 755 hive_hybrid_storage.sh
cd /root/scripts/dataCopy/conf
vi test_table.conf ##This is where the cold partition names are placed

Run

Option1

Retain the hdfs partition and delete it manually after data verification.

 ./hive_hybrid_storage.sh schema_name.test_table test_table.conf retain

Option2

Delete the hdfs partition as part of the script. It will delete after data is copied to s3. So there is an option to copy it back to hdfs if you want to revert the location of the partition to hdfs.

./hive_hybrid_storage.sh schema_name.test_table test_table.conf delete

For part 1 of the article refer to the following link:

https://community.hortonworks.com/content/kbentry/113932/hive-hybrid-storage-mechanism-to-reduce-sto...

Cloudera Community