Member since
12-02-2014
8
Posts
1
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3171 | 04-16-2021 06:19 PM |
07-16-2024
03:25 PM
@GangWar @wert_1311 I have found HDFS files that are persistently under-replicated, despite being over a year old. They are rare, but vulnerable to loss with one disk failure. To be clear, this shows the replication target, not the actual: hdfs dfs -ls filename The actual can be found with 'hdfs fsck filename -blocks -files filename' In theory, this situation should be transient, but I have found some cases. See example below where a file is 3 blocks in length and one of them only has one replica. # hdfs fsck -blocks -files /tmp/part-m-03752 OUTPUT: /tmp/part-m-03752: Under replicated BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s). /tmp/part-m-03752: Replica placement policy is violated for BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792. Block should be additionally replicated on 1 more rack(s). 0. BP-955733439-1.2.3.4-1395362440665:blk_1967769089_1100461809406 len=134217728 Live_repl=3 1. BP-955733439-1.2.3.4-1395362440665:blk_1967769276_1100461809593 len=134217728 Live_repl=3 2. BP-955733439-1.2.3.4-1395362440665:blk_1967769468_1100461809792 len=40324081 Live_repl=1 Status: HEALTHY Total size: 308759537 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 3 (avg. block size 102919845 B) Minimally replicated blocks: 3 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1 (33.333332 %) Mis-replicated blocks: 1 (33.333332 %) Default replication factor: 3 Average block replication: 2.3333333 Corrupt blocks: 0 Missing replicas: 2 (22.222221 %) Number of data-nodes: 30 Number of racks: 3 The filesystem under path '/tmp/part-m-03752' is HEALTHY # hadoop fs -ls /tmp/part-m-03752 OUTPUT: -rw-r--r-- 3 wuser hadoop 308759537 2021-12-11 16:58 /tmp/part-m-03752 [sorry, code quoting is not working for me for some reason.] Presumably, the file was incorrectly replicated when it was written because of some failure and the defaults for dfs.client.block.write.replace-datanode-on-failure props were such that new DNs were not obtained at write time to replace ones that failed. The puzzling thing here is why does it not get re-replicated after all this time?
... View more
04-16-2021
06:19 PM
Update: I moved SM to a host that has an typical load of 7-8 instead of 24. After a day on the new machine, there have been no alerts generated about SM being slow and no gaps in charts. Conclusion: The problem was SM works best on a machine with low load.
... View more