Support Questions

shihab_pri · ‎06-10-2016

Hi All,

We currently have the following data node directories

/data01,/data02,/data03 of 10 TB each .Now we have adding new file systems data04,data05,data06 16 TB each.

Is the size different going to create any performance issues?

Do we need to shrink the file system to match our existing ones?

Any inputs would be appreciated.

sunile_manjee · ‎06-10-2016

@Shihab

The data nodes typically have the same disk size; however, this is not required nor necessarily a best practice. There will be no performance issues unless you have slow performing hardware (heterogeneous hardware). In that case when you launch a job it will run map task on any node which has capacity to run the task. It may select a node which is lower performing them others. If you have heterogeneous storage, that is fine with hortonworks since you pre-configure the storage policies based on the storage type. here is more info on that. This way you can avoid performance issues related to hetergenoous hardware.

View solution in original post

sunile_manjee · ‎06-10-2016

@Shihab

The data nodes typically have the same disk size; however, this is not required nor necessarily a best practice. There will be no performance issues unless you have slow performing hardware (heterogeneous hardware). In that case when you launch a job it will run map task on any node which has capacity to run the task. It may select a node which is lower performing them others. If you have heterogeneous storage, that is fine with hortonworks since you pre-configure the storage policies based on the storage type. here is more info on that. This way you can avoid performance issues related to hetergenoous hardware.

shihab_pri · ‎06-10-2016

Thanks that clears a lot 🙂

aengineer · ‎06-10-2016

@Shihab

There are couple of things to consider in this scenario.

1. Time to recover when a machine fails : With these new disks you have 78 TB of data on each data node. Depending on the network speed and how many datanodes you have in the cluster reconstruction can take some time.

2. Booting up the cluster : if all your datanodes have such huge capacity -- the block reports could get quite large. This would cause increase in bootup times especially during the initial bootup of data node, and possibly in larger block reports. Disk scans can be expensive -- but I am hoping that these 16 TB disks are all Samsung SSDs and quite fast.

3. Data Imbalance : If you are adding these disks because you are running out of space, then you have this issue that older disks have far less space. if you are running a round-robin block placement algorithm (which most probably you are) then it is possible to get errors from these nodes since datanode would try to write to these older disks with more data. Balancer may not solve this issue since balancer tries to achieve good data distribution over the cluster not between disks on a node. if you run into this problem -- There are 2 ways to fix it.

1. Decommission the node and let it rejoin. That process of rebuilding a data node will create an even distribution of data on all disks.

2, Run disk Balancer - A tool that is still a work in progress - tracked in HDFS-1312.

So generally what @Sunile Manjee said is correct. While this should not cause any performance issues, it does have potential to cause operational issues.

Cloudera Community

Support Questions

Adding Data Node directories of different size.

Adding nodes to an HDP cluster

Spark to read the Hive table sub-directory data

Rebalancing differently sized nodes

Data Node heap size warning

LLAP sizing and setup

Decommission and Reconfigure Data Node Disks

Tuning Heap Sizing for a 80 Node HDP cluster

Hadoop Data Node Density Tradeoff

Zookeeper Sizing and Placement

Data Node Pause Duration