Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

under-replicated blocks + why we get this warning on new scratch installtion?

avatar

We installed new ambari cluster with the following details ( we moved to redhat 7.5 instead 7.2 )

 

Redhat – 7.5
HDP version – 2.6.4
Ambari – 2.6.2

 

After we complete the installation , we notice about very strange behavior ( please note that this is new cluster )

 

On HDFS status summary, I see the following messages about under-replicated blocks

 

We see under replicated blocks is 12 ( while its should be 0 on new installation )

 

Any suggestion – why this ?

I just want to say that this behavior not appears on redhat 7.2

Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Master Mentor

@mike_bronson7 

Surely you can use that hdfs fsck / -delete but remember it will be put in the trash !!! 

View solution in original post

6 REPLIES 6

avatar
Master Mentor

@mike_bronson7 

Under replicated blocks

There are a couple of potential source of the problem that triggers this alert! The HDP versions earlier than HDP 3.x all use the standard default 3 replication factor for reasons you know well , the ability to rebuild the data in whatever case as opposed to the new Erasure coding policies in Hadoop 3.0.

Secondly, the cluster will rebalance itself if you gave it time 🙂 

Having said that the first question is how many data nodes were set up in this new cluster and did you enable rack awareness? 


This usually means that some files are “asking” for a specific number of target replicas that are not present or not being able to get the replica. So the question is, how i know which files are asking for a number of replicas that are not available?

The first option is use hdfs fsck:

$ hdfs fsck / -storagepolicies

****** **************output *********************
Connecting to namenode via http://xxx.com:50070/fsck?ugi=hdfs&storagepolicies=1&path=%2F
FSCK started by hdfs (auth:SIMPLE) from /192.168.0.94 for path / at Sat Oct 26 23:03:24 CEST 2019

/user/zeppelin/notebook/2EC24FF9U/note.json:
Under replicated BP-2067995211-192.168.0.101-1537740712051:blk_1073751507_10767.
Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).

******

Change the replication

$ hdfs dfs -setrep -w 1 /user/zeppelin/notebook/2EC24FF9U/note.json
Replication 1 set: /user/zeppelin/notebook/2EC24FF9U/note.json
Waiting for /user/zeppelin/notebook/2EC24FF9U/note.json ... done


You also need to check dfs.replication in hdfs-site.xml the default is configured to be 3. Note that it turns out that if you upload files through Ambari, the file actually gets the replication factor of 3.

HTH 

avatar

Dear Shelton

this are the results that we get from

 hdfs fsck / -storagepolicies


FSCK started by hdfs (auth:SIMPLE) from /192.9.200.217 for path / at Sun Oct 27 05:49:31 UTC 2019
..................
/hdp/apps/2.6.4.0-91/hive/hive.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741831

/hdp/apps/2.6.4.0-91/hive/hive.tar.gz: MISSING 1 blocks of total size 106475099 B..
/hdp/apps/2.6.4.0-91/mapreduce/hadoop-streaming.jar: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741834

/hdp/apps/2.6.4.0-91/mapreduce/hadoop-streaming.jar: MISSING 1 blocks of total size 105758 B..
/hdp/apps/2.6.4.0-91/mapreduce/mapreduce.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741825

/hdp/apps/2.6.4.0-91/mapreduce/mapreduce.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741826

/hdp/apps/2.6.4.0-91/mapreduce/mapreduce.tar.gz: MISSING 2 blocks of total size 212360343 B..
/hdp/apps/2.6.4.0-91/pig/pig.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741829

/hdp/apps/2.6.4.0-91/pig/pig.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741830

/hdp/apps/2.6.4.0-91/pig/pig.tar.gz: MISSING 2 blocks of total size 135018554 B..
/hdp/apps/2.6.4.0-91/slider/slider.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741828

/hdp/apps/2.6.4.0-91/slider/slider.tar.gz: MISSING 1 blocks of total size 47696340 B..
/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741832

/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741833

/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz: MISSING 2 blocks of total size 189992674 B..
/hdp/apps/2.6.4.0-91/tez/tez.tar.gz: CORRUPT blockpool BP-2095386762-192.9.201.8-1571956239762 block blk_1073741827

/hdp/apps/2.6.4.0-91/tez/tez.tar.gz: MISSING 1 blocks of total size 53236968 B......
/user/ambari-qa/.staging/job_1571958926657_0001/job.jar: Under replicated BP-2095386762-192.9.201.8-1571956239762:blk_1073741864_1131. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.
/user/ambari-qa/.staging/job_1571958926657_0001/job.split: Under replicated BP-2095386762-192.9.201.8-1571956239762:blk_1073741865_1132. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
...Status: CORRUPT

 

yes we check the replication factor - yes its 3

 

based on that results , can we just delete the corrupted blocks?

Michael-Bronson

avatar
Master Mentor

@mike_bronson7 

Regarding under replicated blocks, HDFS is supposed to recover them automatically (by creating missing copies to fulfill the replication factor) but in your case, your cluster-wide replication factor is 3 but the target is 10 It's suggesting have 5 data nodes while there are 10 replicas leading to the under replication alert!

According to the output you have 2 distinct problems
(a) Under replicated blocks, Target Replicas is 10 but found 5 live replica(s) [Last 2 lines]
(b) Corrupt blocks with 2 different solutions

Solution 1 under replicated

You could force the 2 blk to align with cluster-wide replication factor by adjusting using -setrep

$ hdfs dfs -setrep -w 3 [File_name]

Validate by

Now you should see 3 after the file permissions before the user:group like below

$ hdfs dfs -ls [File_name]

-rw-r--r-- 3  analyst hdfs 1068028 2019-10-27 12:30 /flighdata/airports.dat

And wait for the deletion to happen or run the below snippets sequentially

$ hdfs fsck / | grep 'Under replicated'

$ hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files

$ for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 3 $hdfsfile; done

For Corrupt files

$ hdfs fsck / | egrep -v '^\.+$' | grep -i corrupt

...............Example output............................
/user/analyst/test9: CORRUPT blockpool BP-762603225-192.168.1.2-1480061879099 block blk_1055741378
/user/analyst/data1: CORRUPT blockpool BP-762603225-192.168.1.2-1480061879099 block blk_1056741378
/user/analyst/data2: MISSING 3 blocks of total size 338192920 B.Status: CORRUPT
CORRUPT FILES: 9
CORRUPT BLOCKS: 18
Corrupt blocks: 18
The filesystem under path '/' is CORRUPT


Locate corrupted block

$ hdfs fsck / | egrep -v '^\.+$' | grep -i "corrupt blockpool"| awk '{print $1}' |sort |uniq |sed -e 's/://g' >corrupted.flst

Get the location in the above output corrupted.flst

$ hdfs fsck /user/analyst/xxxx -locations -blocks -files

 Remove the corrupted files

hdfs dfs -rm /path/to/corrupted.flst

Skip the trash to permanently delete

$ hdfs dfs -rm -skipTrash /path/to/corrupt_filename.

 

You should give the cluster sometime to rebalance in the case of under-replicated files. 

avatar

about the corrupted file 

why just not use the following?

 

hdfs fsck / -delete

Michael-Bronson

avatar
Master Mentor

@mike_bronson7 

Surely you can use that hdfs fsck / -delete but remember it will be put in the trash !!! 

avatar

may I return to my first question

until using redhat 7.2 , every thing was ok , after each scratch installation we never seen that

but when we jump to redhat 7.5

then every cluster that created was with corrupted files - any HINT - why ?

Michael-Bronson