I was wondering whether there is a best practice for providing Hadoop storage tiering in HDP or not?
What would be the recommended replication factor for hot warm and cold storages? Which architecture would be recommended? Having separate data nodes for warm and cold storage or using hybrid disks in the entire platform? Is that possible to provide separate data nodes in HDP cluster through Ambari or it needs some customization at the Hadoop layer?
Hi @Ali there's really no single right way to use HDFS tiered storage. It's a flexible framework that can be implemnted to suite your particular usage requirement. There was a recent graphic added to some of the datalake 3.0 discussion which is quite relevant here and at least gives some view into one set of potential options.
The number of replicas should always be 3 for production HDFS.
I personally prefer and recommend having tiers tied to specific storage node types, i.e. a Hot node with 12 x SSD, a Warm node with 12 x 2TB etc, as this means if I need to add more capacity to a tier I just add a node of that type, and if a node goes down I know what type I need to replace, it just keeps things simpler.
It's all managed via Ambari, and with the upcoming Erasure Coding that's coming with 3.x we will see yet another potential layer appear in this design as shown above.
I prefer whenever possible to keep things simple, you can of course mix 1 replica on hot, 2 on warm etc etc, but in my experience that gets over complicated very quickly.
I hope that helps.
Hi David Russel,
We are trying to test this scenario. Where your defining the Data age conf? I want to know how the hdfs data knows its 30 days ago.
Thanks in advance.