Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Force block redistribution for some particular file or directory

avatar
Rising Star

Hi dear experts!

 

i'm wondering is there any way to force block redistridution for some particular file/directory.

my case is:

1) load file from node that have DataNode process with replication factor 1

2) increace replication factor by executing: hdfs dfs -setrep 3 /tmp/path/to/my/file

3) check distribution with some specific Java tool:

hadoop jar FileDistribution.jar /tmp/path/to/my/file

 

and got:
-----------------------------------
-----------------------------------
Files distribution in directory across cluster is : {scaj31bda05.us.oracle.com=400, scaj31bda03.us.oracle.com=183, scaj31bda04.us.oracle.com=156, scaj31bda01.us.oracle.com=151, scaj31bda02.us.oracle.com=154, scaj31bda06.us.oracle.com=156}

 

it's obvious that first node contain 400 blocks. other 400*2=800 blocks evenly distributed across other nodes.

it there any way for force block redistribution for make it even?

 

thanks!

 

1 ACCEPTED SOLUTION

avatar
Cloudera Employee
When you ingest the data from an edge node that is also running datanode role, the 1st copy will always be written to that DN and it will use space much faster than any other datanode. To re-distribute space usage among all datanodes, you must run hdfs balancer.

View solution in original post

2 REPLIES 2

avatar
Mentor
This is an expected side-effect of loading data from a DN host. While there's no 'even distribution' tool today, you can perhaps try to get a more random effect going by raising the replication factor (to 4 or 5) and then lowering it back again.

avatar
Cloudera Employee
When you ingest the data from an edge node that is also running datanode role, the 1st copy will always be written to that DN and it will use space much faster than any other datanode. To re-distribute space usage among all datanodes, you must run hdfs balancer.