I'm currently running some proof of concepts in a new HDP 2.4 cluster that I am virtualizing with a SAN back-end that is all flash.
I've been reading articles such as this http://www.bluedata.com/blog/2015/12/separating-hadoop-compute-and-storage/ && http://www.infostor.com/disk-arrays/hadoop-storage-options-time-to-ditch-das.html
My question is, are there any concerns and design considerations when doing this with HDP? Would this essentially be having a series of nodes running purely HDFS / RegionServers for HBASE for the storage implementation, and then a series of nodes for MR and YARN for compute processing? The whole concept of splitting compute and storage is very new to me. I'm used to having all machines be identical and the DAS method that I currently use in my production environment. Also, what would the configuration files look like for this? I assume the DataNode directories parameter would hold the shared storage / SAN endpoint for HDFS? That would mean the SAN would have to be set up with native HDFS volumes. Correct me if I'm wrong.
I realize this may go against the past fundamentals people have on how to use this software, but like I said, this is purely for R&D PoC testing.
@Dezka Dex Like you mentioned this goes against the fundamentals of Hadoop. The first question I would like to ask is about your loads. SAN is remote. What kind of pipe is attached to the SAN. What other applications are sharing it? You do mention MR and YARN. If you are still using heavy batch processing, consider the following (please change with your own reasonable assumptions):
50 MB/s per disk throughput. Assume 1 job using 20 disks in parallel. That's 1000 MB/s throughput per job.
Is it reasonable to assume 5 jobs running concurrently on your cluster?
That is 5GB/s throughput required or 5x8Gb/s = 40 Gb/second.
This is only for Hadoop and only for the purpose of moving data from storage to compute. I am assuming you are sharing SAN with other applications. Those applications will likely be impacted.
If my assumptions are way off, please plug in your numbers. Since it's for R&D, you can do the POC to find out the results. That being said, in your POC, try if you can possibly run other applications which will be sharing SAN and see the possible impact.
So it's not only about performance of Hadoop applications but also other applications that will be sharing SAN.
Thanks for the reply. The SAN is a Nimble SAN and the VM's hosting Hadoop are sitting on a Cisco UCS environment. We have fiber connectivity between the UCS and SAN so essentially my throughput is around 16GB/s and running around 1.6k IOPS per second as far as Hadoop is concerned. I've already run a few test jobs that we run on production in this environment and had great results, so performance and latency isn't a concern at this point. We are using the UCS / SAN with other environments with resource / storage policies in place to segregate resources as much as possible.
I'm just trying to take this further by separating out the storage and compute nodes and curious how that configuration / architecture would look in respect to a HDP deployment. Thanks for your input.