Created on 07-13-201701:11 AM - edited 08-17-201911:58 AM
Design approach
The designs depend on the work done in the below Jira, where data node is conceptualized as a collection of heterogeneous storage with different durability and performance requirements.
1) Hot data with partitions that are wholly hosted by HDFS.
2) Cold data with partitions that are wholly hosted by S3.
3) A view that unions these two tables which is the live table that we expose to end users.
Design 2
1) Hot data with partitions that are wholly hosted by HDFS.
2) Cold data with partitions that are wholly hosted by S3.
3) Both hot and cold data are in the same table
Design 2 is chosen over Design 1 because Design 1 is not transparent to the application layer. The change from old table to the view would inherently transfer some level of porting/integration extra work to the application.
Delete the hdfs partition as part of the script. It will delete after data is copied to s3. So there is an option to copy it back to hdfs if you want to revert the location of the partition to hdfs.