I'm reading the available cloud connector integrations supported by HDP and I have some questions/doubts.
Context: Our environment is on-prem and consists of a HDP 2.6.4 5x datanode + 2x namenode cluster in which the datanodes are filling up with rarely accessed data (i.e: ~ 1x per month or even less) that we want to offload to cheaper cloud storage (i.e: Azure BLOB). Our HDFS data is pretty much all mapped to hive tables.
Simplified Requirements I'm trying to validate:
1) Offloaded data should remain seamless accessible for user workloads (usually spark and hive/tez jobs), including cases where data from both local storage and external storage are used in the same workload.
2) Offloaded data URI in the local HDFS should be maintained (users are not aware data is being offloaded). I.e: Hive table would have part of it's partitions in local storage and another part in the cloud (let's say the partition key is yyyymmdd).
3) Hadoop storage policies or other data tier management tool to handle the offloading process
Does anyone gone through this path that can shed a light?
I'm afraid, for example if 2) is not guaranteed, I'll need Two hive tables (one for local partitions and another for external partitions) which will cause some user experience impact and the offload process will have to make sure moved partitions are automatically picked up in the external table.