Created 02-08-2016 06:15 PM
I am looking deepen my understanding on type of storage disk used for data nodes. outside of single point of failure (SAN box goes down) are they any other reason not to use SAN storage on data nodes? Spindle dedicated on SAN? Is that even possible? How is performance san vs das (dedicated attached storage)? Any insights you can share would be appreciated.
Created 02-08-2016 06:25 PM
Hadoop is Shared Nothing architecture. SAN Storage usually goes against the grain for distributed storage in a distributed compute environment. The only central storage we support so far is Isilon because we did some joint engineering with them. Even then, DAS has its advantages (as well as disadvantages mainly because of 3 factor replicator).
The main issue is that compute nodes where YARN spins up containers, for every data access needs, having it on separate SAN disk means that every query or access would then have to go over network speeds and would no longer be distributed across the spindles on the storage nodes. That not only decreases access time it introduces more points of failure through switches and creates additional potential for bottleneck.
Normally I would have also compromise a bit for master nodes but I just came from a client who did VMs with SAN for master nodes and performance started great but once multiple users came on board and the master nodes needed to handle more blocks, performance tanked. We wasted a week and a half moving the master components to physical nodes on a cluster with data. Painful.
See a good discussion here: http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster
Created 02-08-2016 06:21 PM
SAN is terrible for Hadoop go with direct attached or Isilon NAS. SAN suffers from busy neighbor aside from being a shared pool of storage, it's outside of namenode's control so blocks can move around and lose data locality, latency can also be an issue and final thought, direct attached disk is redundant by many disks, so you can tolerate failure by having more disk. Quick search led to this http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/ and more http://www.infoworld.com/article/2609694/application-development/never--ever-do-this-to-hadoop.html Here's our official doc http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_cluster-planning-guide/content/hardware-f...
I guess one thing more to mention is that busy neighbor is also a problem in reverse, you will affect the other applications running on your SAN.
Created 02-08-2016 06:32 PM
cost of typical disk vs. SAN backed disk would be cost prohibitive. @Sunile Manjee
Created 02-08-2016 06:46 PM
Although I have heard the argument that over time, the cost of replacing disk and managing DAS disk with 3 factor replication, makes SAN cheaper, from a TCO perspective
Created 02-08-2016 06:25 PM
Hadoop is Shared Nothing architecture. SAN Storage usually goes against the grain for distributed storage in a distributed compute environment. The only central storage we support so far is Isilon because we did some joint engineering with them. Even then, DAS has its advantages (as well as disadvantages mainly because of 3 factor replicator).
The main issue is that compute nodes where YARN spins up containers, for every data access needs, having it on separate SAN disk means that every query or access would then have to go over network speeds and would no longer be distributed across the spindles on the storage nodes. That not only decreases access time it introduces more points of failure through switches and creates additional potential for bottleneck.
Normally I would have also compromise a bit for master nodes but I just came from a client who did VMs with SAN for master nodes and performance started great but once multiple users came on board and the master nodes needed to handle more blocks, performance tanked. We wasted a week and a half moving the master components to physical nodes on a cluster with data. Painful.
See a good discussion here: http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster
Created 02-08-2016 06:25 PM
While it is possible (and makes sense in some cases) to use SAN for Master Nodes I would strongly encourage you not to do this with Datanodes. Use bare metal machines with directly attached storage for Datanodes to optimize throughput and performance.
We have seen some very poor performance in environments where the Datanodes used SAN.
Created 02-08-2016 09:05 PM
@Sunile Manjee though I have no personal experience with them there are companies like BlueData who abstract the storage component and provide a interesting private cloud experience based on containers. An interesting read on this subject is a book by Google called Datacenter as a Computer.
Created 02-09-2016 03:08 AM
This answerhub thread is an example of how AWESOME answerhub is. Thanks all for great great great info.
Created 02-09-2016 03:41 AM
@Sunile Manjee I think Ancil answer is best one 😉 You are the judge.