Community Articles

rbiswas1 · ‎07-13-2017

Introduction

Traditional data warehouse archive strategy involves moving the old data into offsite tapes.

This does not quite fit the size for modern analytics applications since the data is unavailable for business analytics in real time need. Mature Hadoop clusters need a modern data archival strategy to keep the storage expense at check when data volume increases exponentially.

The term hybrid here designates an archival solution which is always available as well as completely transparent to the application layer

This document will cover

Use case
Requirement
Storage cost analysis
Design Approach
Architecture diagram
Code
How to Setup and Run the code

Use case

Entire business data is in HDFS (HDP clusters) backed by Amazon EBS.

Disaster recovery solution is in place. Amazon claims S3 storage delivers 99.999999999% durability. In the case of data loss from S3 we have to recover the data from disaster recovery site.

Requirement

Decrease storage costs.
Archived data should be available to perform analytics 24X7.
Access hot and cold (archived) data simultaneously from the application.
The solution should be transparent to the application layer. In other words, absolutely no change should be required from the application layer after the hybrid archival strategy is implemented.
Performance should be acceptable.

Storage cost analysis

Storage vs Cost Graph

Basis for Calculation

For S3

$0.023 per GB-month of usage

Source: https://aws.amazon.com/s3/pricing/

For EBS SSD (gp2)

$0.10 per GB-month of provisioned storage

Including replication factor of 3, this becomes net $0.30 per GB

Source: https://aws.amazon.com/ebs/pricing/

Important Note

EBS is provisioned storage, whereas S3 is paid as you use.

In other words for future data growth, say you provision EBS storage of 1 TB. You have to pay 100% for it regardless you are using 0% or 90% of it.

Whereas S3 is just the storage you are using. So for 2GB pay for 2 GB and for 500 GB pay for 500GB. Hence S3 price calculation is divided by 2 roughly calculating the way it will grow in correlation to the HDFS EBS storage.

Please refer to part 2 for the architecture of the proposed solution and codebase:

https://community.hortonworks.com/articles/113934/hive-hybrid-storage-mechanism-to-reduce-storage-co...

Cloudera Community