- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Should we use RAID with Hadoop?
- Labels:
-
Apache Hadoop
Created on ‎09-26-2017 11:17 AM - edited ‎09-16-2022 05:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎09-26-2017 12:26 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not for data nodes. For some master nodes processes like Hive Metastore, yes. Also, use RAID for all OS disks, You don't want a node failure just because one OS disk fails.
As for data nodes, they make three copies of data on different machines, so you don't need RAID. In fact, RAID will reduce performance as performance in RAID is determined by the slowest disk. Same with Zookeeper and Quorum journal manager. They have redundant processes running on three different nodes on three different disks, so you don't need RAID.
Created ‎09-27-2017 05:40 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HDFS clusters do not benefit using RAID for data storage, as the redundancy that RAID provides is not required since HDFS handles it by replicating data on different data nodes.
RAID striping used to increase the performance turns out to be slower than the JBOD (Just a bunch of disks) used by HDFS which round-robins across all disks. Its because in RAID, the read/write operations are limited by the slowest disk in the array. In JBOD, the disk operations are independent, so the average speed of operations is greater than the slowest disk.
If a disk fails in JBOD, HDFS can continue to operate with out it, but in RAID if a disk fails the whole array becomes unavailable.
RAID is recommended for NameNode to protect corruptions against metadata.
