Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar

Design approach

The designs depend on the work done in the below Jira, where data node is conceptualized as a collection of heterogeneous storage with different durability and performance requirements.

https://issues.apache.org/jira/browse/HDFS-2832

Design 1

1) Hot data with partitions that are wholly hosted by HDFS.

2) Cold data with partitions that are wholly hosted by S3.

3) A view that unions these two tables which is the live table that we expose to end users.

Design 2

1) Hot data with partitions that are wholly hosted by HDFS.

2) Cold data with partitions that are wholly hosted by S3.

3) Both hot and cold data are in the same table

Design 2 is chosen over Design 1 because Design 1 is not transparent to the application layer. The change from old table to the view would inherently transfer some level of porting/integration extra work to the application.

Architecture Diagram

High Level Design

19417-architecture.jpg

Automation Flow Diagram

19418-flow.jpg

Code

Automation tool codebase

https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/hive_hybrid_storage.sh

Example configuration file

https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/test_table.conf

Setup & Run

Setup

  1. cd /root/scripts/dataCopy
  2. vi hive_hybrid_storage.sh ##Put the script here
  3. chmod 755 hive_hybrid_storage.sh
  4. cd /root/scripts/dataCopy/conf
  5. vi test_table.conf ##This is where the cold partition names are placed

Run

Option1

Retain the hdfs partition and delete it manually after data verification.

 ./hive_hybrid_storage.sh schema_name.test_table test_table.conf retain 

Option2

Delete the hdfs partition as part of the script. It will delete after data is copied to s3. So there is an option to copy it back to hdfs if you want to revert the location of the partition to hdfs.

./hive_hybrid_storage.sh schema_name.test_table test_table.conf delete

For part 1 of the article refer to the following link:

https://community.hortonworks.com/content/kbentry/113932/hive-hybrid-storage-mechanism-to-reduce-sto...

1,323 Views