Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

SAN vs DAS(JBOD) on data node

Solved Go to solution

SAN vs DAS(JBOD) on data node

Super Guru

I am looking deepen my understanding on type of storage disk used for data nodes. outside of single point of failure (SAN box goes down) are they any other reason not to use SAN storage on data nodes? Spindle dedicated on SAN? Is that even possible? How is performance san vs das (dedicated attached storage)? Any insights you can share would be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: SAN vs DAS(JBOD) on data node

Hadoop is Shared Nothing architecture. SAN Storage usually goes against the grain for distributed storage in a distributed compute environment. The only central storage we support so far is Isilon because we did some joint engineering with them. Even then, DAS has its advantages (as well as disadvantages mainly because of 3 factor replicator).

The main issue is that compute nodes where YARN spins up containers, for every data access needs, having it on separate SAN disk means that every query or access would then have to go over network speeds and would no longer be distributed across the spindles on the storage nodes. That not only decreases access time it introduces more points of failure through switches and creates additional potential for bottleneck.

Normally I would have also compromise a bit for master nodes but I just came from a client who did VMs with SAN for master nodes and performance started great but once multiple users came on board and the master nodes needed to handle more blocks, performance tanked. We wasted a week and a half moving the master components to physical nodes on a cluster with data. Painful.

See a good discussion here: http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster

View solution in original post

8 REPLIES 8
Highlighted

Re: SAN vs DAS(JBOD) on data node

Mentor

@Sunile Manjee

SAN is terrible for Hadoop go with direct attached or Isilon NAS. SAN suffers from busy neighbor aside from being a shared pool of storage, it's outside of namenode's control so blocks can move around and lose data locality, latency can also be an issue and final thought, direct attached disk is redundant by many disks, so you can tolerate failure by having more disk. Quick search led to this http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/ and more http://www.infoworld.com/article/2609694/application-development/never--ever-do-this-to-hadoop.html Here's our official doc http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_cluster-planning-guide/content/hardware-f...

I guess one thing more to mention is that busy neighbor is also a problem in reverse, you will affect the other applications running on your SAN.

Highlighted

Re: SAN vs DAS(JBOD) on data node

Mentor

cost of typical disk vs. SAN backed disk would be cost prohibitive. @Sunile Manjee

Highlighted

Re: SAN vs DAS(JBOD) on data node

Although I have heard the argument that over time, the cost of replacing disk and managing DAS disk with 3 factor replication, makes SAN cheaper, from a TCO perspective

Highlighted

Re: SAN vs DAS(JBOD) on data node

Hadoop is Shared Nothing architecture. SAN Storage usually goes against the grain for distributed storage in a distributed compute environment. The only central storage we support so far is Isilon because we did some joint engineering with them. Even then, DAS has its advantages (as well as disadvantages mainly because of 3 factor replicator).

The main issue is that compute nodes where YARN spins up containers, for every data access needs, having it on separate SAN disk means that every query or access would then have to go over network speeds and would no longer be distributed across the spindles on the storage nodes. That not only decreases access time it introduces more points of failure through switches and creates additional potential for bottleneck.

Normally I would have also compromise a bit for master nodes but I just came from a client who did VMs with SAN for master nodes and performance started great but once multiple users came on board and the master nodes needed to handle more blocks, performance tanked. We wasted a week and a half moving the master components to physical nodes on a cluster with data. Painful.

See a good discussion here: http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster

View solution in original post

Highlighted

Re: SAN vs DAS(JBOD) on data node

While it is possible (and makes sense in some cases) to use SAN for Master Nodes I would strongly encourage you not to do this with Datanodes. Use bare metal machines with directly attached storage for Datanodes to optimize throughput and performance.

We have seen some very poor performance in environments where the Datanodes used SAN.

Highlighted

Re: SAN vs DAS(JBOD) on data node

@Sunile Manjee though I have no personal experience with them there are companies like BlueData who abstract the storage component and provide a interesting private cloud experience based on containers. An interesting read on this subject is a book by Google called Datacenter as a Computer.

Highlighted

Re: SAN vs DAS(JBOD) on data node

Super Guru

This answerhub thread is an example of how AWESOME answerhub is. Thanks all for great great great info.

Highlighted

Re: SAN vs DAS(JBOD) on data node

@Sunile Manjee I think Ancil answer is best one ;) You are the judge.

Don't have an account?
Coming from Hortonworks? Activate your account here