Community Articles

Find and share helpful community-sourced technical articles.
avatar
Expert Contributor

High level Problem Explanation

The main issue impacting Namenode performance is the global FSNamesystem lock during read and write operations. Handling tasks in larger directories, especially those involving millions of files, can lead to high lock times and overall slowness in operations like extensive file listings, retrieving block locations, handling large deletions, performing DU (Usage) and BDR snapshot activities.

           ------------    (Global lock - NameSystem) --------
          |                        |                         |
  	                                          
Directory set (InodeMap)     Block set(BlockMap)  Cluster information(DatanodeManager)

Detailed Technical Explanation

RPC request #getListing:

  1. DFSClient initiates getList for the given path, invoking DistributedFileSystem#listStatus. It is important to note that this process isn't atomic, and the directory entries may be retrieved from the NameNode multiple times.
  2. DFSClient retrieves the initial batch of entries in the directory and iterates through them to obtain another set.
  3. FSNamesystem#getListing called on the Namenode, and acquiring a read lock. The Namenode returns the number of directory items and release lock once it reach dfs.ls.limit, default set at 1000 items. 
  4. The client validates whether it has obtained all entries for the directory. If there are remaining items, it repeats steps 1 to 4.

The issue is step 3, where it acquires read lock. While this works efficiently and quickly for small directory sets less than 1000 files, but it becomes problematic for larger directory sets. In such cases, the client aggressively lookup for the remaining items. 

RPC request #getBlockLocations:

  1. DFSClient#getBlockLocations requests the block locations for a specified file (or the locations of all files within a given directory).

  2. FSNamesystem#getBlockLocations is called in the Namenode to retrieve block locations. Essentially, it obtains list of all Datanode hostnames where that file was written. 

The issue is step 2 where it acquires read lock. Obtaining the location of a single block shouldn't be any issue. However, acquiring all block locations for a more extensive directory becomes problematic due to a lock. There are instances in spark/impala where a substantial set of block locations are obtained, particularly during the initial job submission.

RPC request #delete:

  1. The client initiates DFSClient#delete for a file or directories, with a recursive mode option.

  2. FSNamesystem#delete is called in the namenode. When dealing with large directories, deletion takes place incrementally. All the blocks associated with those directories are gathered and deleted in small batches.

  3. Typically, deletion occurs in a single batch for a small directory or file. But the problem comes for larger directories. The issue is step 2 where it acquires write lock. Additionally, every namespace modification needs updates in the edits. For delete example:

 

 

<RECORD>
  <OPCODE>OP_DELETE</OPCODE>
  <DATA>
    <TXID>69947</TXID>
    <LENGTH>0</LENGTH>
    <PATH>/user/hdfs/.Trash/Current/tmp/test</PATH>
    <TIMESTAMP>1660859400205</TIMESTAMP>
    <RPC_CLIENTID>9a5ec8c2-8ca2-4195-b788-d39c52278d0c</RPC_CLIENTID>
    <RPC_CALLID>5</RPC_CALLID>
  </DATA>
</RECORD>

 

 

RPC request #getSnapshotDiffReport:

BDR heavily relies on the snapshotDiff report. FSNamesystem#getSnapshotDiffReport calculates the differences between two snapshots, generating this report while holding a read lock. The issues is comparing this large snapshots under lock.

This lock information is available in namenode log.  For example, the following lookup operation is extensive, causing the namenode to pause or lock for 7.6 seconds. This diff operation involves 104k directories and 3.8 million files.

 

 

INFO SnapshotDiffReport distcp-797-xxxx-old' to 'distcp-yy-xxx-new'. Total comparison dirs: 0/104527, files: 0/3858293. Time snapChildrenListing: 0.012s, actual: 7.631s, total: 12.565s.

 

 

RPC request #snapshotDelete:

Much like the previous scenario, BDR heavily relies on the deleteSnapshot operation. FSNamesystem#deleteSnapshot handles this deletion under the write lock, meaning that deleting a sizable snapshot could potentially impact readers.

RPC request #DU:

In the DU operation, a read lock is acquired to determine the size of a specified directory. The read lock is released after processing 5000 items (as defined by dfs.content-summary.limit). Following this, there is a 500-microsecond sleep (as specified by dfs.content-summary.sleep-microsec) before attempting to reacquire the read lock to calculate remaining items. However, problem is aggressive seeks to acquiring read lock for remaining items.

Verification of Problem

Check whether the following trace is consistently repeated in the RUNNABLE state across all jstacks. Additionally, verify the corresponding path from the audit logs.

For getListing: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing

 

 

IPC Server handler 20 on 8020" #189 daemon prio=5 os_prio=0 tid=0x00007f88d6612800 nid=0x2cf3 runnable [0x00007f6f27896000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.String.split(String.java:2354)
        at java.lang.String.split(String.java:2422)
        at org.apache.hadoop.fs.permission.AclEntry.parseAclEntry(AclEntry.java:268)
        at org.apache.sentry.hdfs.SentryAuthorizationInfo.getAclEntries(SentryAuthorizationInfo.java:319)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListing(FSDirStatAndListingOp.java:257)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getListingInt(FSDirStatAndListingOp.java:83)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:3735)

 

 

For getBlockLocations: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations

 

 

IPC Server handler 92 on 8020" #261 daemon prio=5 os_prio=0 tid=0x00007f88d65c2800 nid=0x2d3b runnable [0x00007f6f2304e000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.String.intern(Native Method)
        at org.apache.hadoop.hdfs.protocol.ExtendedBlock.<init>(ExtendedBlock.java:46)
        ..
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1338)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:175)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)

 

 

For delete: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete

 

 

IPC Server handler 162 on 8020" #221 daemon prio=5 os_prio=0 tid=0x00007f88d6345782 nid=0x2d8c runnable [0x00007f64f55g6700]
   java.lang.Thread.State: RUNNABLE
    ...
    org.apache.hadoop.hdfs.server.namenode.INode.getLocalName(INode.java:575)
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:984)
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.getFullPathName(FSDirectory.java:1022)
    org.apache.hadoop.hdfs.server.namenode.INode.getFullPathName(INode.java:591)
    org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.isAccessAllowed(RangerHdfsAuthorizer.java:480)
    org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:322)
    org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1956)
    org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:92)
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3999)
    org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:1078)

 

 

For getSnapshotDiffReport: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport

 

 

IPC Server handler 8 on 8020" #177 daemon prio=5 os_prio=0 tid=0x00007f88d6b42800 nid=0x2ce7 runnable [0x00007f6f284a2000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiffRecursively(DirectorySnapshottableFeature.java:367)
        ..
        at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiffRecursively(DirectorySnapshottableFeature.java:360)
        at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff(DirectorySnapshottableFeature.java:282)
        at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:458)
        at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport(FSDirSnapshotOp.java:160)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:6486)

 

 

For deleteSnapshot: org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot

 

 

org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1604)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteSnapshot(FSNamesystem.java:6543)
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.deleteSnapshot(NameNodeRpcServer.java:1813)

 

 

For DU: org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary

 

 

IPC Server handler 26 on 8020 - priority:10 - threadId:0x00007f13d911c800 - nativeId:0x8c81 - state:RUNNABLE
stackTrace:
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hdfs.server.namenode.INodeFile.computeContentSummary(INodeFile.java:556)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:632)
at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:618)
at org.apache.hadoop.hdfs.server.namenode.INode.computeAndConvertContentSummary(INode.java:448)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.getContentSummary(FSDirectory.java:2166)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:4401)

 

 

Resolution

  1. Refrain from executing these commands on larger directories. Explore alternative approaches such as batch deletion or batch listing, etc.
  2. Consider OfflineImageViewer for your read scenario, although it won't provide real-time.
  3. Optimize BDR snapshots, considering factors like subdirectories, timing of execution, etc.
  4. If unavoidable during peak times, schedule these operations to run during non-peak hours; however, performance challenges may persist.
  5. If getListing a problem, consider reducing the "dfs.ls.limit" to 250 for improvements.
  6. Crucially, either disable or delay the update of file access time. The default setting for dfs.namenode.accesstime.precision is configured for a one-hour interval. This means that if a file is accessed, it triggers an update to the edit file, but subsequent updates for the same file are deferred until the next hour. For instance, if you're accessing multiple unique files, your read requests may transform into writes due to the edit update, incurring significant costs and a well-documented performance impact. If your use case doesn't necessitate tracking access time, consider delaying the update to occur once daily instead of every hour. Note that this access time update is specific to each file and not a generic update.
2,471 Views
0 Kudos