About kpalanisamy

kpalanisamy · ‎12-11-2023

High level Problem Explanation The main issue impacting Namenode performance is the global FSNamesystem lock during read and write operations. Handling tasks in larger directories, especially those involving millions of files, can lead to high lock times and overall slowness in operations like extensive file listings, retrieving block locations, handling large deletions, performing DU (Usage) and BDR snapshot activities. ------------ (Global lock - NameSystem) -------- | | | Directory set (InodeMap) Block set(BlockMap) Cluster information(DatanodeManager) Detailed Technical Explanation RPC request #getListing: DFSClient initiates getList for the given path, invoking DistributedFileSystem#listStatus. It is important to note that this process isn't atomic, and the directory entries may be retrieved from the NameNode multiple times. DFSClient retrieves the initial batch of entries in the directory and iterates through them to obtain another set. FSNamesystem#getListing called on the Namenode, and acquiring a read lock. The Namenode returns the number of directory items and release lock once it reach dfs.ls.limit, default set at 1000 items. The client validates whether it has obtained all entries for the directory. If there are remaining items, it repeats steps 1 to 4. The issue is step 3, where it acquires read lock. While this works efficiently and quickly for small directory sets less than 1000 files, but it becomes problematic for larger directory sets. In such cases, the client aggressively lookup for the remaining items. RPC request #getBlockLocations: DFSClient#getBlockLocations requests the block locations for a specified file (or the locations of all files within a given directory). FSNamesystem#getBlockLocations is called in the Namenode to retrieve block locations. Essentially, it obtains list of all Datanode hostnames where that file was written. The issue is step 2 where it acquires read lock. Obtaining the location of a single block shouldn't be any issue. However, acquiring all block locations for a more extensive directory becomes problematic due to a lock. There are instances in spark/impala where a substantial set of block locations are obtained, particularly during the initial job submission. RPC request #delete: The client initiates DFSClient#delete for a file or directories, with a recursive mode option. FSNamesystem#delete is called in the namenode. When dealing with large directories, deletion takes place incrementally. All the blocks associated with those directories are gathered and deleted in small batches. Typically, deletion occurs in a single batch for a small directory or file. But the problem comes for larger directories. The issue is step 2 where it acquires write lock. Additionally, every namespace modification needs updates in the edits. For delete example: <RECORD> <OPCODE>OP_DELETE</OPCODE> <DATA> <TXID>69947</TXID> <LENGTH>0</LENGTH> <PATH>/user/hdfs/.Trash/Current/tmp/test</PATH> <TIMESTAMP>1660859400205</TIMESTAMP> <RPC_CLIENTID>9a5ec8c2-8ca2-4195-b788-d39c52278d0c</RPC_CLIENTID> <RPC_CALLID>5</RPC_CALLID> </DATA> </RECORD> RPC request #getSnapshotDiffReport: BDR heavily relies on the snapshotDiff report. FSNamesystem#getSnapshotDiffReport calculates the differences between two snapshots, generating this report while holding a read lock. The issues is comparing this large snapshots under lock. This lock information is available in namenode log. For example, the following lookup operation is extensive, causing the namenode to pause or lock for 7.6 seconds. This diff operation involves 104k directories and 3.8 million files. INFO SnapshotDiffReport distcp-797-xxxx-old' to 'distcp-yy-xxx-new'. Total comparison dirs: 0/104527, files: 0/3858293. Time snapChildrenListing: 0.012s, actual: 7.631s, total: 12.565s. RPC request #snapshotDelete: Much like the previous scenario, BDR heavily relies on the deleteSnapshot operation. FSNamesystem#deleteSnapshot handles this deletion under the write lock, meaning that deleting a sizable snapshot could potentially impact readers. RPC request #DU: In the DU operation, a read lock is acquired to determine the size of a specified directory. The read lock is released after processing 5000 items (as defined by dfs.content-summary.limit). Following this, there is a 500-microsecond sleep (as specified by dfs.content-summary.sleep-microsec) before attempting to reacquire the read lock to calculate remaining items. However, problem is aggressive seeks to acquiring read lock for remaining items. Verification of Problem Check whether the following trace is consistently repeated in the RUNNABLE state across all jstacks. Additionally, verify the corresponding path from the audit logs. For getListing: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing IPC Server handler 20 on 8020" #189 daemon prio=5 os_prio=0 tid=0x00007f88d6612800 nid=0x2cf3 runnable [0x00007f6f27896000] java.lang.Thread.State: RUNNABLE at java.lang.String.split(String.java:2354) at java.lang.String.split(String.java:2422) at org.apache.hadoop.fs.permission.AclEntry.parseAclEntry(AclEntry.java:268) at org.apache.sentry.hdfs.SentryAuthorizationInfo.getAclEntries(SentryAuthorizationInfo.java:319) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListing(FSDirStatAndListingOp.java:257) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListingInt(FSDirStatAndListingOp.java:83) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:3735) For getBlockLocations: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations IPC Server handler 92 on 8020" #261 daemon prio=5 os_prio=0 tid=0x00007f88d65c2800 nid=0x2d3b runnable [0x00007f6f2304e000] java.lang.Thread.State: RUNNABLE at java.lang.String.intern(Native Method) at org.apache.hadoop.hdfs.protocol.ExtendedBlock.<init>(ExtendedBlock.java:46) .. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1338) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:175) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909) For delete: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete IPC Server handler 162 on 8020" #221 daemon prio=5 os_prio=0 tid=0x00007f88d6345782 nid=0x2d8c runnable [0x00007f64f55g6700] java.lang.Thread.State: RUNNABLE ... org.apache.hadoop.hdfs.server.namenode.INode.getLocalName(INode.java:575) org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:984) org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1022) org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:591) org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.isAccessAllowed(RangerHdfsAuthorizer.java:480) org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:322) org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1956) org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:92) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3999) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1078) For getSnapshotDiffReport: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport IPC Server handler 8 on 8020" #177 daemon prio=5 os_prio=0 tid=0x00007f88d6b42800 nid=0x2ce7 runnable [0x00007f6f284a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiffRecursively(DirectorySnapshottableFeature.java:367) .. at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiffRecursively(DirectorySnapshottableFeature.java:360) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff(DirectorySnapshottableFeature.java:282) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:458) at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport(FSDirSnapshotOp.java:160) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:6486) For deleteSnapshot: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1604) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:6543) org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1813) For DU: org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary IPC Server handler 26 on 8020 - priority:10 - threadId:0x00007f13d911c800 - nativeId:0x8c81 - state:RUNNABLE stackTrace: java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hdfs.server.namenode.INodeFile.computeContentSummary(INodeFile.java:556) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632) at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618) at org.apache.hadoop.hdfs.server.namenode.INode.computeAndConvertContentSummary(INode.java:448) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2166) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4401) Resolution Refrain from executing these commands on larger directories. Explore alternative approaches such as batch deletion or batch listing, etc. Consider OfflineImageViewer for your read scenario, although it won't provide real-time. Optimize BDR snapshots, considering factors like subdirectories, timing of execution, etc. If unavoidable during peak times, schedule these operations to run during non-peak hours; however, performance challenges may persist. If getListing a problem, consider reducing the "dfs.ls.limit" to 250 for improvements. Crucially, either disable or delay the update of file access time. The default setting for dfs.namenode.accesstime.precision is configured for a one-hour interval. This means that if a file is accessed, it triggers an update to the edit file, but subsequent updates for the same file are deferred until the next hour. For instance, if you're accessing multiple unique files, your read requests may transform into writes due to the edit update, incurring significant costs and a well-documented performance impact. If your use case doesn't necessitate tracking access time, consider delaying the update to occur once daily instead of every hour. Note that this access time update is specific to each file and not a generic update.

kpalanisamy · ‎09-05-2023

High level Problem Explanation The primary garbage collection challenge often arises when the heap configuration of the Namenode or Datanode is inadequate. GC pauses can trigger Namenode crash, failovers, and performance bottlenecks. In larger clusters, there is no response at the time of transitioning to an another Namenode. Detailed Technical Explanation The Namenode comprises three layers of management: Inode Management Block Management Snapshot Management All of these layers are stored in the Namenode in-memory heap, and any substantial modifications to them can lead to increased heap usage. Similarly, the Datanode also maintains block mapping information in its memory. Example Of Inode Reference <inode> <id>26890</id> <type>DIRECTORY</type> <name>training</name> <mtime>1660512737451</mtime> <permission>hdfs:supergroup:0755</permission> <nsquota>-1</nsquota><dsquota>-1</dsquota> </inode> <inode> <id>26893</id> <type>FILE</type> <name>file1</name> <replication>3</replication> <mtime>1660512801514</mtime> <atime>1660512797820</atime> <preferredBlockSize>134217728</preferredBlockSize> <permission>hdfs:supergroup:0644</permission> <storagePolicyId>0</storagePolicyId> </inode> Example Of Block Reference <blocks> <block><id>1073751671</id><genstamp>10862</genstamp><numBytes>134217728</numBytes></block> <block><id>1073751672</id><genstamp>10863</genstamp><numBytes>134217728</numBytes></block> <block><id>1073751673</id><genstamp>10864</genstamp><numBytes>134217728</numBytes></block> <block><id>1073751674</id><genstamp>10865</genstamp><numBytes>134217728</numBytes></block> <block><id>1073751675</id><genstamp>10866</genstamp><numBytes>134217728</numBytes></block> <block><id>1073751676</id><genstamp>10867</genstamp><numBytes>134217728</numBytes></block> <block><id>1073751677</id><genstamp>10868</genstamp><numBytes>116452794</numBytes></block> </blocks> Example Of Snapshot Reference <SnapshotSection> <snapshotCounter>1</snapshotCounter> <numSnapshots>1</numSnapshots> <snapshottableDir> <dir>26890</dir> </snapshottableDir> <snapshot> <id>0</id> <root> <id>26890</id> <type>DIRECTORY</type> <name>snap1</name> <mtime>1660513260644</mtime> <permission>hdfs:supergroup:0755</permission> <nsquota>-1</nsquota> <dsquota>-1</dsquota> </root> </snapshot> </SnapshotSection> High heap usage can be caused by several factors, including: Inadequate heap size for the current dataset. Lack of proper heap tuning configurations. Failure to include "-XX:CMSInitiatingOccupancyFraction" with the default value set at 92. Gradual growth in the number of inodes/files over time. Sudden spikes in HDFS data volumes. Sudden changes in HDFS snapshot activity. Job executions generating millions of small files. System slowness, including CPU or IO bottlenecks. Symptoms Namenode failure and Namenode transitions Namenode logs showing JVM metrics, with any pause exceeding 10 seconds being a concern. WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 143694ms GC pool 'ParNew' had collection(s): count=1 time=0ms GC pool 'ConcurrentMarkSweep' had collection(s): count=2 time=143973ms Occasionally, GC pauses are induced by system overhead factors, such as CPU spikes, IO contention, slow disk performance, or kernel delays. Look for the phrase "No GCs detected" in the Namenode log. WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 83241ms No GCs detected Do not misinterpret the FSNamesystem lock message; the root cause of the issue is actually GC overhead the pause before this event. INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of suppressed read-lock reports: 0 Longest read-lock held at xxxx for 143973ms via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:187) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1684) org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:134) .. Relevant Data to Check To gain insights into usage patterns, it's advisable to regularly collect heap histograms. If you have enabled the parameters -XX:+PrintClassHistogramAfterFullGC and -XX:+PrintClassHistogramBeforeFullGC in the heap configuration, there's no need for additional histogram collection. Metrics: Cloudera Manager -> Charts -> Chart Builder -> SELECT jvm_gc_rate WHERE roleType = NAMENODE and hostname = "<problamatic namenode hostname>" SELECT jvm_max_memory_mb, jvm_heap_used_mb WHERE roleType = NAMENODE and hostname = "<problamatic namenode hostname>" Minor pauses are generally tolerable, but when pause times exceed the defined "ha.health-monitor.rpc-timeout.ms" timeout, the risk of a Namenode crash increases, particularly if the Namenode writes edits after a GC pause. Also, Heap usage reaching 85% of the total heap can also lead to issues. Even slight pauses, such as 2-5 seconds occurring frequently, can pose problems. Resolution To adjust the heap size effectively, consider the current system load and anticipated future workloads. Given the diverse range of customer use cases, our standard guidelines may not always align perfectly with each scenario. Factors like differing file lengths, snapshot utilization, encryption zones, and more can influence requirements. Instead, it's advisable to analyze the current number of files, directories, and blocks in conjunction with the existing heap utilization. For instance, if you have 30 million files and directories, 40 million blocks, and 200 snapshots, and your current heap usage is 45GB out of a total heap size of 50GB, you can estimate the additional heap required for a 50% load increase. In this case, you would need an additional 23GB of heap (current_heap * additional_load_percentage/100) to accommodate the increased load. Please note that you can find the current heap usage metrics in the Namenode web UI or through CM (Cloudera Manager) charts by querying: "SELECT jvm_max_memory_mb, jvm_heap_used_mb WHERE roleType = NAMENODE and hostname = "<problematic Namenode hostname>"" To ensure smooth operation, it's wise to allocate an additional 10-15% of heap beyond the total configured heap. If Sentry is configured, consider adding a 10-20% buffer. It's crucial to maintain the total count of files and directories in the cluster below 350 million. Namenode performance tends to decline after surpassing this threshold. We generally recommend not exceeding 300 million files per cluster. Please cross-reference our guidance with official heap documents provided by CDP. If a customer has configured G1 Garbage Collection for the Namenode and there is no notable improvement in Namenode GC performance, consider using CMS. Note, During our internal testing, we observed no significant performance improvement in the Namenode when using G1 Garbage Collection but leave it to your own testings. For Datanode performance, ensure that the Java heap is set to 1GB per 1 million blocks. Additionally, verify that all Datanodes are balanced, with a roughly equal distribution of blocks. Avoid scenarios where one node holds a disproportionately high number of blocks compared to others.

kpalanisamy · ‎06-22-2023

High level Problem Explanation The occurrence of the "block missing" and "block corruption" alerts is observed in the Cloudera Manager dashboard or the namenode webui. While there isn't a common cause, these alerts can be due to any of the outlined issues. Detailed Technical Explanation When the namenode does not receive reports for all of the block replicas of a file, it marks those blocks as missing. Similarly, when all the block replicas are corrupted, the namenode marks them as corrupt. However, if at least one of the block replica(1/3) is reported as being healthy, then there is no missing or corruption alert. There are multiple scenarios in which blocks can be marked as missing and corrupt. Scenario 1 During a major upgrade, users may encounter block missing errors. This issue arises due to insufficient datanode heap. The upgrade process involves migrating the disk layout and requires additional datanode heap space. users who are upgrading to releases CDP 7.1.7 and above will not experience these issues. However, for other users, it is necessary to increase the datanode heap size by 2x from current heap. Once the upgrade is completed, the heap size can be reverted to its original value. The adjustment in heap size is required specifically for the layout upgrade from -56 (256x256) to -57 (32x32). During this process, each volume's blocks are scanned, and a list of LinkArgs and file objects are stored as String which resulting in significant heap utilization. Example of file object which stored as String, /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825_1001.meta /data01/dfs/dn/current/BP-586623041-127.0.0.1-1617017575175/current/finalized/subdir0/subdir0/blk_1073741825 If a datanode crashes with an Out of Memory error during the upgrade, then you experience a missing blocks. This occurs because the blocks were in the old layout, but the datanode assumes the blocks were in the new layout. If you encounter this issue, it is recommended to contact Cloudera for assistance in fixing the block layout. Scenario 2 User manually replaces the old fsimage in the namenode's current directory, which resulted inconsistent behavior and shows missing and corrupt blocks after reboot. Scenario 3 In rare instances, users may encounter a scenario where the dfs.checksum.type is set to NULL, so disabling the safety mechanisms in place results block corruption. HDFS incorporates a built-in checksum mechanism that is enabled by default. When a client writes data, it calculates a CRC checksum for the data. Additionally, the last datanode in the write pipeline also calculates a checksum for the received data. If the checksums from the client and datanode match, the data is considered valid and is stored on disk along with the checksum information. Checksums play a crucial role in identifying various hardware issues, including problems with network cables, incompatible MTU network settings, and faulty disks or controllers. It is highly recommended to enable checksums on an HDFS cluster to ensure data integrity. However, when the checksum.type is set to NULL, all these safety mechanisms are effectively disabled. Without checksums, HDFS is unable to detect corruption. This means that data can be silently corrupted during tasks such as data migration between hosts or due to disk failures. To identify this block corruption and checksum, you can use the "hdfs debug verify" command to verify the metadata of a block. hdfs debug verify -meta <metadata-file> [-block <block-file>] For assistance in efficiently checking the block checksums of the entire HDFS filesystem, we recommend reaching out to the Cloudera storage team. We have developed a specialized tool designed to perform this task quickly and effectively. Scenario 4 When the size of a block report exceeds the limit set by "ipc.maximum.data.length" (default value of 64 MB), the namenode rejects those block reports which results missing blocks. This rejection is typically accompanied by log entries in both the namenode and datanode logs indicating that the "Protocol message was too large." In scenarios where a single datanode holds over 15 million blocks, it becomes more likely to encounter block reports that surpass the configured limit. This can result in the rejection of such block reports by the namenode due to their excessive size. WARN org.apache.hadoop.hdfs.server.datanode.DataNode: RemoteException in offerService org.apache.hadoop.ipc.RemoteException(java.io.IOException): java.lang.IllegalStateException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.runBlockOp(BlockManager.java:5174) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1574) Scenario 5 There are instances where a user administrator may accidentally delete disk data with the intention of reclaiming disk space, but without a complete understanding of Hadoop data storage in the disk. This leads to the loss of blocks. Scenario 6 The filesystem appears to be in a healthy state, but when running FSCK, it detects a missing block. This issue arises from a snapshot reference where some of its blocks are missing. hdfs fsck / -includeSnapshots -list-corruptfileblocks Example: A file named "f1" is created, and its corresponding block ID is "blk_100005". A snapshot named "s1" is taken for "f1". Running FSCK shows that the filesystem is healthy, as expected. However, at a later point, "blk_100005" goes missing due to unknown reasons. Consequently, the filesystem becomes corrupted, and FSCK accurately reports the corrupt filename and its block. The next step involves removing the corrupt file "f1" using the command: hdfs dfs -rm -skipTrash f1. Create or reload a new file, "f1", which creates a new block labeled "blk_100006". The problem arises at this stage: FSCK continues to identify file "f1" as corrupt even after reloading it with a new block. This scenario presents a false positive, where FSCK should appropriately report missing references, whether they stem from live data or a snapshot. The issue described here HDFS-15770, a minor improvement that requires fix. Scenario 7 The occurrence of a "Too many open file" exception in the DataNode can lead to block deletion, ultimately resulting in missing blocks. This issue has been addressed in the CDP. Scenario 8 In a rare scenario that has not been conclusively proven yet, a user experienced a loss of multiple racks in their cluster due to rack failure. After restarting the DataNode following a power event, the blocks were found in the "rbw" (Replica Being Written) directory. All the blocks from the "rbw" directory transitioned into the "RWR" (Replica Waiting to be Recovered) state. As a result, the NameNode marked these blocks as corrupt, citing the reason as "Reason.INVALID_STATE". However, the issue lies in the fact that the NameNode only removes a replica from the corrupt map if the reason for corruption is "GENSTAMP_MISMATCH". In this particular scenario, the reason is different, causing a challenge in removing the replicas from the corrupt map. Scenario 9 There is an issue where certain information in an incremental block report (IBR) gets replaced with existing stored information while queuing the block report on a standby Namenode. This can result in false block corruption reports. In a specific incident, a Namenode transitioned to an active state and started reporting missing blocks with the reason "SIZE_MISMATCH" for corruption. However, upon investigation, it was discovered that the blocks were actually appended, and the sizes on the datanodes were correct. The root cause was identified as the standby Namenode queuing IBRs with modified information. Users experiencing this false block corruption issue are advised to run a manual block report, as it can help rectify the erroneous reports and restore accurate block information. This issues is addressed in CDP 7 SP1 and SP2. hdfs dfsadmin -triggerBlockReport [-incremental] <datanode_host:ipc_port> Verification of Problem As a first step, it is important to verify if the fsck command reports any corrupt blocks. hdfs fsck / -list-corruptfileblocks -files -blocks -locations Connecting to namenode via https://host:20102/fsck?ugi=hdfs&listcorruptfileblocks=1&files=1&blocks=1&locations=1&path=%2F The filesystem under path '/' has 0 CORRUPT files Note: A healthy filesystem should always have zero corrupt files. Determine the number of missing blocks that have only one replica. These blocks are of critical concern as the chances of recovery are minimal. hdfs dfsadmin -report >dfsadmin.report cat dfsadmin.report | grep "Missing blocks" | grep "with replication factor 1" Verify the operational status of all datanodes in the cluster. If three or more datanodes are offline or not functioning correctly, it can result in missing and corrupt blocks. Ensure that you check for any dead datanodes in the cluster. If three or more datanodes are in a dead state, it can lead to the presence of missing and corrupt blocks. Verify if there are multiple disk failures across the datanodes. In such cases, the presence of missing and corrupt blocks is expected. Check whether the storage capacity of the datanode is fully utilized, reaching 100%. In such instances, the datanode may become unresponsive and fail to send block reports. Verify if the block report sent by the datanode is rejected by the namenode. This can be observed in the namenode logs as a warning or error message. dataLength + is longer than maximum configured RPC length " + maxDataLength + ". RPC came from " + getHostAddress(); Confirm if the user has made changes to the node configuration groups in Cloudera Manager (CM). Such modifications could potentially result in the omission of several data disk directories, leading to issues with missing data. Verify whether the block still exists in the local filesystem or if it has been unintentionally removed by users. find <dfs.datanode.data.dir> -type f -iname <blkid>* Repeat the same step in all datanodes Ensure that you check if the mount point has been unmounted due to a filesystem failure. It is important to verify the datanode and system logs for any relevant information in this regard. Verify whether the block has been written into the root volume as a result of disk auto unmount. It is essential to consider that the data might be concealed if you remount the filesystem on top of the existing datanode directory. Note Examine all the provided scenarios and their corresponding actions. It is important to note that if a block is physically present on any of the datanodes, it is generally recoverable, except in cases where the block is corrupt.

kpalanisamy · ‎10-04-2018

We have seen performance issue and stability issue in heavy workload system, this is due to disk latency/Shared Disk. I.e frequent namenode failover, longer boot time, slower checkpoint, slower logging, Higher fsync will cause session expiry, etc. We never recommend a shared disk for Namenode, Journal node and Zookeeper. All of these services should have a dedicated disk. You can configure following disk type according to your HDFS workload. [dfs.namenode.name.dir] Namenode fsimage directory, dedicated disk => HDD 15K RPM [dfs.journalnode.edits.dir] Namenode edit log directory, dedicated disk => SSD [dataDir from zoo.cfg] Zoookeeper snapshot and transaction logs, for normal usage from ZKFC and HBase => HDD 15K RPM Zoookeeper snapshot and transaction logs, If it used by Nifi/ Accumulo/ Kafka/ Storm/ HBase and ZKFC => SSD If you are using RAID for meta-directory(dfs.namenode.name.dir & dfs.journalnode.edits.dir), then disable RAID and check the Non-RAID performance. There is a strong redundant for meta-directory (fsimage & edits are available from Standby NameNode and remaining QJNs). If RAID is not disabled for JN, then consider using different RAID. RAID 1 and RAID 10 are also good for the dfs.journalnode.edits.dir set rather than RAID 5, due to increased in write latency for small block writes. if you don't have a faster disk, then don't consider using fsimage replication. It will impact write performance even if one of the disks slower.

kpalanisamy · ‎10-04-2018

@Elias Abacioglu You can refer below guidance for configuring service port. https://community.hortonworks.com/articles/223817/how-do-you-enable-namenode-service-rpc-port-withou.html

kpalanisamy · ‎10-04-2018

The service RPC port gives the DataNodes a dedicated port to report their status via block reports and heartbeats. The port is also used by Zookeeper Failover Controllers for periodic health checks by the automatic failover logic. The port is never used by client applications hence it reduces RPC queue contention between client requests and DataNode messages. Steps: 1) Ambari -> HDFS -> Configs -> Advanced -> Custom hdfs-site -> Add Property dfs.namenode.servicerpc-address.<dfs.internal.nameservices>.nn1=<namenode1 host:rpc port> dfs.namenode.servicerpc-address.<dfs.internal.nameservices>.nn2=<namenode2 host:rpc port> dfs.namenode.service.handler.count=(dfs.namenode.handler.count / 2) This RPC port receives all DN and ZKFC requests like block report, heartbeat, liveness report, etc.. Example from hdfs-site.xml, dfs.nameservices=shva dfs.internal.nameservices=shva dfs.ha.namenodes.shva=nn1,nn2 dfs.namenode.rpc-address.shva.nn1=hwxunsecure2641.openstacklocal:8020 dfs.namenode.rpc-address.shva.nn2=hwxunsecure2642.openstacklocal:8020 dfs.namenode.handler.count=200 Service RPC host, port and handler threads: dfs.namenode.servicerpc-address.shva.nn1=hwxunsecure2641.openstacklocal:8040 dfs.namenode.servicerpc-address.shva.nn2=hwxunsecure2642.openstacklocal:8040 dfs.namenode.service.handler.count=100 2) Restart Standby Namenode. You must wait till Standby Namenode out of safemode. Note: You can check Safemode status in Standby Namenode UI. 3) Restart Active Namenode. 4) Stop Standby Namenode ZKFC controller. 5) Stop Active Namenode ZKFC controller. 6) Login Active Namenode and reset Namenode HA state. #su - hdfs $hdfs zkfc -formatZK 7) Login Standby Namenode and reset Namenode HA state. #su - hdfs $hdfs zkfc -formatZK 😎 Start Active Namenode ZKFC controller. 9) Start Standby Namenode ZKFC controller. 10) Rolling restart the Datanodes. Note: Please check, Nodemanager should not be installed in Namenode box because it uses same port 8040. If installed then you need to change service RPC port from 8040 to different port. Ref: scaling-the-hdfs-namenode

kpalanisamy · ‎10-04-2018

@Muthukumar Somasundaram Formatting is not an ideal option to solve this issue. In this case, you lost all your data.

kpalanisamy · ‎09-21-2018

@Santanu Ghosh The branch 1.x do not have namenode HA with QJN based. It is production ready and available only from hadoop-2.x. You can refer HDFS-HA umbrella jira, HDFS-3278.

kpalanisamy · ‎09-11-2018

You can safely ignore this warning if you don't have enabled service RPC. This is dedicated port in Namenode, datanode will send liveness and block report to this queue. The port is never used by client applications hence it reduces RPC queue contention between client requests and DataNode messages. You can ref this link for more, Scaling Namenode.

kpalanisamy · ‎09-11-2018

@Jagatheesh Ramakrishnan Appreciate your effort and writing this data recovery part. Can you please add a note on this article? Namenode should be stopped very immediate after file deletion otherwise it's hard to recover because namenode already send out block deletion request to datanode. So physical block might get deleted by datanode.

Online	Offline
Last Visited	‎02-03-2025 08:16 PM

Member Since	‎03-14-2016 05:07 AM
Last Visited	‎02-03-2025 08:16 PM
Posts	67
Kudos received	29

Cloudera Community

Re: name node high availability on hadoop 1.x

Re: "NameNode Service RPC Queue Latency (Hourly)"...

Re: HBase JRuby program error "NoMethodError: unde...

Performance Delays in Namenode Caused by Multiple ...

Garbage Collection Pauses in Namenode and Datanode

Explaining "block missing" and "block corruption" ...

What type of disk must be used for Namenode and Zo...

Re: Scaling the HDFS NameNode (part 1)

How do you enable Namenode service RPC port withou...

Re: NameNode always coming up in safe mode after i...

Re: name node high availability on hadoop 1.x

Re: "NameNode Service RPC Queue Latency (Hourly)"...

Re: How to Recover Accidentally deleted file (with...