Created on 05-07-2015 08:09 AM - edited 09-16-2022 02:28 AM
I have run the fsck command on my HDFS and I am seeing a high number of Under-replicated blocks (over 30%)!!!
My HDFS Replication Factor is set up to 2!
What are the Best Practices / Recommended Methods to 'fix' this issue??
1) Should I se "hadoop fs -setrep" to change the replication factor of certain files?
2) What's the manual way to 'force' the affected blocks to replicate themselves?
3) Should I remove permanetly certain types of files?
For instance, in the fsch log report I am seeing a lot of files with of this type:
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446320640/p1=p1/p2=421 <dir>
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446620772 <dir>
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446620772/p1=p0 <dir>
4) How about the /tmp/logs/ files? Dp I reset their setrep setting or periodically remove them?
5) I am also having quite a few Accumulo tables reporting under-replicated blocks!
Created 05-10-2015 08:10 PM
Hi TS, are you still facing this issue too? Have you changed back to 3 replicates? Or still configured with 2?
1) Should I se "hadoop fs -setrep" to change the replication factor of certain files?
JMS: No. Keep it the way it is for now.
2) What's the manual way to 'force' the affected blocks to replicate themselves?
JMS: It depends... If they are configure to replicate 100 times, you might not have enought nodes and you can not force that. How many nodes do you have in your cluster? Car you past here part of the fsck output?
3) Should I remove permanetly certain types of files?
For instance, in the fsch log report I am seeing a lot of files with of this type:
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446320640/p1=p1/p2=421 <dir>
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446620772 <dir>
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446620772/p1=p0 <dir>
JMS: This is the trash. If you don't need those file, clean the trash?
4) How about the /tmp/logs/ files? Dp I reset their setrep setting or periodically remove them?
JMS: Same thing. Temporary files. Can you list them to make sure? You might be able to delete them.
5) I am also having quite a few Accumulo tables reporting under-replicated blocks!
JMS: Here again, please paste the logs here. This one is the most concerning. They should have the default, except if accumulo set that to more than the factor 2 you have.
JMS
Created 05-10-2015 08:10 PM
Hi TS, are you still facing this issue too? Have you changed back to 3 replicates? Or still configured with 2?
1) Should I se "hadoop fs -setrep" to change the replication factor of certain files?
JMS: No. Keep it the way it is for now.
2) What's the manual way to 'force' the affected blocks to replicate themselves?
JMS: It depends... If they are configure to replicate 100 times, you might not have enought nodes and you can not force that. How many nodes do you have in your cluster? Car you past here part of the fsck output?
3) Should I remove permanetly certain types of files?
For instance, in the fsch log report I am seeing a lot of files with of this type:
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446320640/p1=p1/p2=421 <dir>
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446620772 <dir>
/user/hue/.Trash/150507010000/user/hue/.cloudera_manager_hive_metastore_canary/hive0_hms/cm_test_table1430446620772/p1=p0 <dir>
JMS: This is the trash. If you don't need those file, clean the trash?
4) How about the /tmp/logs/ files? Dp I reset their setrep setting or periodically remove them?
JMS: Same thing. Temporary files. Can you list them to make sure? You might be able to delete them.
5) I am also having quite a few Accumulo tables reporting under-replicated blocks!
JMS: Here again, please paste the logs here. This one is the most concerning. They should have the default, except if accumulo set that to more than the factor 2 you have.
JMS
Created 05-10-2015 08:18 PM
Hi JM, thank you again!
The issue (under-replicated & corrupt blocks) started when I added 2 new nodes into an existing CDH 5.4 cluster.
I went and selectively removed and restored files back into HDFS.
HDFS now is HEALTHY.
However, I haven't pinpointed the root cause!
I have opened up another thread listing more details about the corrputed blocks issues.
I'll close this one amd continue with the other one.
Thanks for all your help.
Happy Mother's Day 🙂
Created 07-07-2015 05:36 AM
Hi TS,
How come adding the nodes made blocks under-replicated?
Did you run the balancer?
Created 07-07-2015 06:23 AM
Hi Siddesh,
Adding a note is not related to the under-replicated blocks.
under-replicated blocks where most probably already there before the new node got added.
By default, for the JARs, MR set the replication to 10. So if there is less than 10 nodes in the cluster you will most always have under replicated blocks.
You should check with fsck to see what is missing. Balancer will not help for that.
JM
Created 07-07-2015 06:31 AM
I may have missed something above but JAR set the replication to 10? Where is that mentioned in the post?
Created 07-07-2015 06:34 AM
Hi Siddesh,
This is not mention, but it's a potential cause of the under replicated blocks. Can also be anything else.
JMS
Created 07-07-2015 06:36 AM
Well absolutely. He needs to check his config files.
Created 02-20-2017 07:09 PM
Hi,
When i Run fsck on my cluster i got that several blocks under replicated and the target replication is 3 even i changed the dfs.replication to NN/ DNs and client server to replication factor 2, and mapred.submit.replication changed to 2.
tried also:
<property>
<name>dfs.replication</name>
<value>2</value>
<final>true</final>
</property>
I also restarted all service at my cluster including the oozie.
Looking at one of the running jobs conf and see the following with replication factor 3:
mapreduce.client.submit.file.replication
s3.replication
kfs.replication
dfs.namenode.replication.interval
ftp.replication
s3native.replication