We have provisioned a 4 node HDFS cluster for the purpose of testing.
There are 4 data nodes: node1, node2, node3 and node4
node3 doubles up as a Name node as well.
The replication factor is set to 2.
The applications that write data run on node5 (outside the cluster). Data is written to HDFS in multiple ways:
i) Java application using the HDFS api
ii) Java application which copies data from a text table to ORC table using "insert into select * from"
On loading 100 GB of data, we've noticed a highly uneven distribution of the data across the 4 data nodes.
node1 - 16.66 GB
node2 - 16.66 GB
node3 - 16.66 GB
node4 - 50 GB
I am unable to figure out the reason for this skew. All the nodes have been in operation from start of the load. We are not writing the data from one of the data nodes. But it seems that one copy of every block is being stored on node4 while the second copy is being distributed across other 3 nodes.
Is there something about the block placement policy that i am missing ? Is there a way i could debug this ?