Member since
08-11-2015
13
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 204 | 05-22-2026 07:38 AM |
06-08-2026
12:27 AM
Thank you for the great insights These are the disctp options distcp -skipcrccheck -copybuffersize 32768 -update -pugp -bandwidth 50 -strategy dynamic and yes, total execution time increases almost linearly with nr of files/directories
... View more
06-05-2026
03:04 AM
Anyway, everything seems related to logs collection for JobHistory
... View more
06-05-2026
03:03 AM
And checking RM there is a huge number of call to the AM , one each 10 seconds: 2026-06-05 11:09:39,050 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=arcgis OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1780210407885_1976 CONTAINERID=container_e22_1780210407885_1976_01_000014 RESOURCE=<memory:2048, vCores:1> QUEUENAME=arcgis
2026-06-05 11:09:40,102 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:09:50,116 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
.....
2026-06-05 11:17:41,013 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1780210407885_1976_000001 with final state: FINISHING, and exit status: -1000
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from RUNNING to FINAL_SAVING on event = UNREGISTERED
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1780210407885_1976 with final state: FINISHING
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1780210407885_1976 State change from RUNNING to FINAL_SAVING on event = ATTEMPT_UNREGISTERED
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1780210407885_1976
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from FINAL_SAVING to FINISHING on event = ATTEMPT_UPDATE_SAVED
... View more
06-05-2026
02:31 AM
Checking node manager logs the time spent is between container removal and container success declaration 2026-06-05 11:08:15,452 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_e22_1780210407885_1976_01_000013]
2026-06-05 11:17:55,672 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_e22_1780210407885_1976_01_000001 succeeded
... View more
06-05-2026
02:19 AM
I'm copying several directories with several files each Yarn application full termination seems affected by the number of object transferred In this case we have 9 minutes between 100% completion and job termination 26/06/05 10:47:20 INFO mapreduce.Job: map 100% reduce 0%
26/06/05 10:56:25 INFO mapreduce.Job: Job job_1780210407885_1954 completed successfully
with
Files Copied=68744
DIR_COPY=23058 while here few seconds 26/06/05 10:43:11 INFO mapreduce.Job: map 100% reduce 0%
26/06/05 10:43:32 INFO mapreduce.Job: Job job_1780210407885_1958 completed successfully
with
Files Copied=12300
DIR_COPY=191 Is the any explanation for such a behaviour?
... View more
Labels:
- Labels:
-
Apache YARN
05-22-2026
07:38 AM
Self resolved The switch between direct single threaded copy and distcp depends on file size gt hive.exec.copyfile.maxsize The default value is 32MB
... View more
05-22-2026
03:26 AM
I'm exporting several tables and I observe that some files are copied to the target path using DistCp (slow) while other files with some other (fast) mean There is no evidence of the rational behind the choice but the other odd thing is that even if a table is made up of multiple files, hive starts 1 DistCp for each file instead of passing the entire directory Is there any option to drive the behaviour?
... View more
Labels:
- Labels:
-
Apache Hive
02-23-2026
02:40 AM
On the data node the typical stack trace were these 2026-02-20 12:01:41,486 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Lock held time above threshold: lock identifier: FsDatasetRWLock lockHeldTimeMs=8582 ms. Supp ressed 0 lock warnings. Longest suppressed LockHeldTimeMs=0. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:160) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:220) org.apache.hadoop.util.InstrumentedReadLock.unlock(InstrumentedReadLock.java:78) org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84) org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1920) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:376) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:719) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:872) java.lang.Thread.run(Thread.java:748) 2026-02-20 12:01:41,486 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Waited above threshold to acquire lock: lock identifier: FsDatasetRWLock waitTimeMs=7442 ms. Suppressed 3 lock wait warnings. Longest suppressed WaitTimeMs=414. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.util.InstrumentedLock.logWaitWarning(InstrumentedLock.java:171) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:222) org.apache.hadoop.util.InstrumentedLock.lock(InstrumentedLock.java:105) org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1646) org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212) org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1303) org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:762) org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:178) org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:112) org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) java.lang.Thread.run(Thread.java:748) 2026-02-20 11:06:02,845 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Waited above threshold to acquire lock: lock identifier: FsDatasetRWLock waitTimeMs=688 ms. S uppressed 5 lock wait warnings. Longest suppressed WaitTimeMs=397. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.util.InstrumentedLock.logWaitWarning(InstrumentedLock.java:171) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:222) org.apache.hadoop.util.InstrumentedLock.lock(InstrumentedLock.java:105) org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1750) org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:997) org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:899) org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:178) org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:112) org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291) java.lang.Thread.run(Thread.java:748) and this 2026-02-20 11:11:44,500 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Waited above threshold to acquire lock: lock identifier: FsDatasetRWLock waitTimeMs=443 ms. Suppressed 1 lock wait warnings. Longest suppressed WaitTimeMs=412. The stack trace is: java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.util.InstrumentedLock.logWaitWarning(InstrumentedLock.java:171) org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:222) org.apache.hadoop.util.InstrumentedLock.lock(InstrumentedLock.java:105) org.apache.hadoop.util.AutoCloseableLock.acquire(AutoCloseableLock.java:67) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaMap.get(ReplicaMap.java:115) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.validateBlockFile(FsDatasetImpl.java:2036) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReplica(FsDatasetImpl.java:808) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReplica(FsDatasetImpl.java:801) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getLength(FsDatasetImpl.java:794) org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkBlock(FsDatasetImpl.java:1988) org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlock(DataNode.java:2315) org.apache.hadoop.hdfs.server.datanode.DataNode.transferBlocks(DataNode.java:2372) org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:726) org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:684) org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1334) org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1380) org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1307) org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1290)
... View more
02-23-2026
02:31 AM
Hi @Asfahan yes, heap should be around 300GB but these is what NN say on webui Heap Memory used 111.53 GB of 169.41 GB Heap Memory. Max Heap Memory is 169.41 GB. For what concerns handlers, dfs_namenode_handler_count is 70 (it should be 80 with 17 datanodes) while dfs_datanode_handler_count is at it's default value of 3 On a different cluster I had this set to 24 this is the stack trace for a write-lock held in active NN 2026-02-20 11:01:44,596 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of suppressed write-lock reports: 0 Longest write-lock held at 1972-02-11 21:18:16,333+0100 for 6157ms via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1058) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:262) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:226) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1696) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.processBlocksInternal(DatanodeAdminManager.java:703) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.pruneReliableBlocks(DatanodeAdminManager.java:644) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:572) org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:506) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) Total suppressed write-lock held time: 0.0
... View more
02-23-2026
12:49 AM
Hello @Asfahan thank you for the answer, yes, I understand that the cluster it'a little oversized About the topic, I don't find any "Block report queue full" message but several write-lock with long duration but, strange enough, not during hdfs service startup What I find is a number of request coming via NFS Gateway (around 3000/minute) and several GC (Allocation Failure) in gc log in the first 20 minutes of startup and several about a the end when all the datanodes reported thei blocks The NN has 160GB of heap and DN 30GB What I found strange is dfs_datanode_handler_count set to 3, that might be the cause of the original issue that forced me to restart the service In fact, I was decommissioning one node and when I started, suddenly I've experience a huge performance degradation, even if network, hdfs and disk I/O were not so critical (cluster Net I/O peak was 280 MB/s, hdfs I/O 190 MB/s, disk I/O write peak of 300 MB/s)
... View more