I am building a Hadoop cluster on a new hardware. Each data node has 14 4T disks.
Do I understand correctly that one should build a file system (ext3, ext4 or xfs) on each disk separately and mount each disk before configuring HDFS?
Cloudera documentation seems to suggest that ext3 is the most tested file system for HDFS. Am I really better off with ext3 or should I use ext4 or xfs? Reliability is more important to me than performance.
The recommended options for mounting in /etc/fstab:
You can start building your cluster using any of Cloudera CDH supported file systems; ext3, ext4 and XFS. Avoid using LVM partitioning method (which is the default partitioning method in CentOS6 and 7, but use the manual disk partitioning instead).
And yes, the recommended option for mounting in /etc/fstab is just like you stated.
/dev/sdb1 /data1 ext4 defaults,noatime 0
For more information please take a look into this article.
|mkfs.ext4 /dev/sdb or should I use some extra options? For example, considering that by default HDFS uses block size 128M, might it make sense to have bigger block size for the underlying ext4? Thank you, Igor |
To create a partition in Linux, you’d need to ‘fdisk’ it first. In your example, (sdb) is the disk, so you’d need to to create the partition (sdb1):
After that, you’d need to format the new partition into an ext4:
Make sure you are mount it correctly in /etc/fstab, just like I stated in my first response, ‘mount -a’ command is a good way to examine your fstab entries.
In regards to the HDFS block size, the block division in HFDS is just logically built over the physical blocks of the ext4 filesystem; HDFS blocks are large compared to disk blocks, and the reason for this is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block.
If there are any additional questions, please let me know.