Support Questions
Find answers, ask questions, and share your expertise

Uneven data distribution on HDFS cluster

New Contributor

We have provisioned a 4 node HDFS cluster for the purpose of testing.

There are 4 data nodes: node1, node2, node3 and node4

node3 doubles up as a Name node as well.

The replication factor is set to 2.

The applications that write data run on node5 (outside the cluster). Data is written to HDFS in multiple ways:

i) Java application using the HDFS api

ii) Java application which copies data from a text table to ORC table using "insert into select * from"

On loading 100 GB of data, we've noticed a highly uneven distribution of the data across the 4 data nodes.

node1 - 16.66 GB

node2 - 16.66 GB

node3 - 16.66 GB

node4 - 50 GB

I am unable to figure out the reason for this skew. All the nodes have been in operation from start of the load. We are not writing the data from one of the data nodes. But it seems that one copy of every block is being stored on node4 while the second copy is being distributed across other 3 nodes.

Is there something about the block placement policy that i am missing ? Is there a way i could debug this ?

1 ACCEPTED SOLUTION

Accepted Solutions

New Contributor

We were able to resolve this. The node managers on the other 3 nodes were missing/down. Bring up the Node Manager on the other three nodes lead to an even distribution.

View solution in original post

1 REPLY 1

New Contributor

We were able to resolve this. The node managers on the other 3 nodes were missing/down. Bring up the Node Manager on the other three nodes lead to an even distribution.

View solution in original post