Created on 07-13-201712:58 AM - edited 08-17-201911:58 AM
Introduction
Traditional data warehouse archive strategy involves moving the old data into offsite tapes.
This does not quite fit the size for modern analytics applications since the data is unavailable for business analytics in real time need. Mature Hadoop clusters need a modern data archival strategy to keep the storage expense at check when data volume increases exponentially.
The term hybrid here designates an archival solution which is always available as well as completely transparent to the application layer
This document will cover
Use case
Requirement
Storage cost analysis
Design Approach
Architecture diagram
Code
How to Setup and Run the code
Use case
Entire business data is in HDFS (HDP clusters) backed by Amazon EBS.
Disaster recovery solution is in place. Amazon claims S3 storage delivers 99.999999999% durability. In the case of data loss from S3 we have to recover the data from disaster recovery site.
Requirement
Decrease storage costs.
Archived data should be available to perform analytics 24X7.
Access hot and cold (archived) data simultaneously from the application.
The solution should be transparent to the application layer. In other words, absolutely no change should be required from the application layer after the hybrid archival strategy is implemented.
EBS is provisioned storage, whereas S3 is paid as you use.
In other words for future data growth, say you provision EBS storage of 1 TB. You have to pay 100% for it regardless you are using 0% or 90% of it.
Whereas S3 is just the storage you are using. So for 2GB pay for 2 GB and for 500 GB pay for 500GB. Hence S3 price calculation is divided by 2 roughly calculating the way it will grow in correlation to the HDFS EBS storage.
Please refer to part 2 for the architecture of the proposed solution and codebase: