Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Should we use RAID with Hadoop?


Super Guru
@Sheetal Sharma

Not for data nodes. For some master nodes processes like Hive Metastore, yes. Also, use RAID for all OS disks, You don't want a node failure just because one OS disk fails.

As for data nodes, they make three copies of data on different machines, so you don't need RAID. In fact, RAID will reduce performance as performance in RAID is determined by the slowest disk. Same with Zookeeper and Quorum journal manager. They have redundant processes running on three different nodes on three different disks, so you don't need RAID.


HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes.

RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk.
If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable.

RAID is recommended for NameNode to protect corruptions against metadata.