Created on 10-25-2019 07:45 AM - edited 10-25-2019 11:34 AM
Ambari 2.7.3
HDP 3.1.0
Hello team. My service went down 15 days ago and I'm still not able to fix this issue. If someone has face it before, please let me know.
From last logs:
Caused by: java.net.ConnectException: Call to atsserver/ip:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atsserver/ip:17020
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:165)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at org.apache.hadoop.hbase.ipc.BufferCallBeforeInitHandler.userEventTriggered(BufferCallBeforeInitHandler.java:92)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:307)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1377)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:929)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:269)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:263)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atserver/ip:17020
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hbase.thirdparty.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
... 7 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
2019-10-10 10:03:05,448 ERROR reader.TimelineReaderServer (LogAdapter.java:error(75)) - RECEIVED SIGNAL 15: SIGTERM
2019-10-10 10:03:05,457 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.w.WebAppContext@7357a011{/,null,UNAVAILABLE}{/timeline}
2019-10-10 10:03:05,460 INFO server.AbstractConnector (AbstractConnector.java:doStop(318)) - Stopped ServerConnector@1bae316d{HTTP/1.1,[http/1.1]}{0.0.0.0:8198}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@4bff7da0{/static,jar:file:/usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-common-3.1.1.3.1.0.0-78.jar!/webapps/static,UNAVAILABLE}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@3e62d773{/logs,file:///usr/logs/hadoop-yarn/yarn/,UNAVAILABLE}
2019-10-10 10:03:05,462 INFO storage.HBaseTimelineReaderImpl (HBaseTimelineReaderImpl.java:serviceStop(108)) - closing the hbase Connection
2019-10-10 10:03:05,462 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x783a467b to zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181
2019-10-10 10:03:05,465 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x0ba2f4ec to zookeeper quorum
2019-10-10 10:03:05,467 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x26daa4f60630022 closed
2019-10-10 10:03:05,467 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down
2019-10-10 10:03:05,468 INFO reader.TimelineReaderServer (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down TimelineReaderServer at atsserver/ip
************************************************************/
I've followed the steps on this workaround and I am getting a new error message
https://community.cloudera.com/t5/Support-Questions/ATS-hbase-does-not-seem-to-start/td-p/235155 - guillaume_roger
We use to hace kerberized this cluster but it's not anymore a a few months ago. That's why this message is weird
2019-10-25 08:58:31,650 - HdfsResource['/atsv2/hbase/data'] {'security_enabled': False, 'hadoop_bin_dir': '/usr/hdp/3.1.0.0-78/hadoop/bin', 'keytab': '', 'dfs_type': 'HDFS', 'default_fs': 'hdfs://hdpnndev', 'hdfs_resource_ignore_file': '/var/lib/ambari-agent/data/.hdfs_resource_ignore', 'hdfs_site': ..., 'kinit_path_local': '/usr/bin/kinit', 'principal_name': '', 'user': 'hdfs', 'owner': 'yarnats', 'hadoop_conf_dir': '/usr/hdp/3.1.0.0-78/hadoop/conf', 'type': 'directory', 'action': ['create_on_execute'], 'immutable_paths': [u'/mr-history/done', u'/warehouse/tablespace/managed/hive', u'/warehouse/tablespace/external/hive', u'/app-logs', u'/tmp']}
2019-10-25 08:58:31,652 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://masternode:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZs_d1F 2>/tmp/tmpNf5VJV''] {'quiet': False}
Then , it got stuck several minutes on this part
2019-10-25 08:58:31,791 - call returned (0, '')
2019-10-25 08:58:31,792 - get_user_call_output returned (0, u'{\n "beans" : [ {\n "name" : "Hadoop:service=NameNode,name=FSNamesystem",\n "modelerType" : "FSNamesystem",\n "tag.Context" : "dfs",\n "tag.HAState" : "active",\n "tag.TotalSyncTimes" : "103 16 ",\n "tag.Hostname" : "zookeeper",\n "MissingBlocks" : 0,\n "MissingReplOneBlocks" : 0,\n "ExpiredHeartbeats" : 0,\n "TransactionsSinceLastCheckpoint" : 162564,\n "TransactionsSinceLastLogRoll" : 97,\n "LastWrittenTransactionId" : 136802255,\n "LastCheckpointTime" : 1571986118000,\n "CapacityTotal" : 94452535787520,\n "CapacityTotalGB" : 87966.0,\n "CapacityUsed" : 17431687303220,\n "CapacityUsedGB" : 16235.0,\n "CapacityRemaining" : 72211133116924,\n "ProvidedCapacityTotal" : 0,\n "CapacityRemainingGB" : 67252.0,\n "CapacityUsedNonDFS" : 0,\n "TotalLoad" : 128,\n "SnapshottableDirectories" : 1,\n "Snapshots" : 1,\n "NumEncryptionZones" : 0,\n "LockQueueLength" : 0,\n "BlocksTotal" : 985264,\n "NumFilesUnderConstruction" : 32,\n "NumActiveClients" : 21,\n "FilesTotal" : 1560325,\n "PendingReplicationBlocks" : 0,\n "PendingReconstructionBlocks" : 0,\n "UnderReplicatedBlocks" : 20,\n "LowRedundancyBlocks" : 20,\n "CorruptBlocks" : 0,\n "ScheduledReplicationBlocks" : 0,\n "PendingDeletionBlocks" : 597,\n "LowRedundancyReplicatedBlocks" : 20,\n "CorruptReplicatedBlocks" : 0,\n "MissingReplicatedBlocks" : 0,\n "MissingReplicationOneBlocks" : 0,\n "HighestPriorityLowRedundancyReplicatedBlocks" : 0,\n "HighestPriorityLowRedundancyECBlocks" : 0,\n "BytesInFutureReplicatedBlocks" : 0,\n "PendingDeletionReplicatedBlocks" : 597,\n "TotalReplicatedBlocks" : 985264,\n "LowRedundancyECBlockGroups" : 0,\n "CorruptECBlockGroups" : 0,\n "MissingECBlockGroups" : 0,\n "BytesInFutureECBlockGroups" : 0,\n "PendingDeletionECBlocks" : 0,\n "TotalECBlockGroups" : 0,\n "ExcessBlocks" : 0,\n "NumTimedOutPendingReconstructions" : 0,\n "PostponedMisreplicatedBlocks" : 24,\n "PendingDataNodeMessageCount" : 0,\n "MillisSinceLastLoadedEdits" : 0,\n "BlockCapacity" : 2097152,\n "NumLiveDataNodes" : 4,\n "NumDeadDataNodes" : 0,\n "NumDecomLiveDataNodes" : 0,\n "NumDecomDeadDataNodes" : 0,\n "VolumeFailuresTotal" : 0,\n "EstimatedCapacityLostTotal" : 0,\n "NumDecommissioningDataNodes" : 0,\n "StaleDataNodes" : 0,\n "NumStaleStorages" : 48,\n "TotalSyncCount" : 76,\n "NumInMaintenanceLiveDataNodes" : 0,\n "NumInMaintenanceDeadDataNodes" : 0,\n "NumEnteringMaintenanceDataNodes" : 0\n } ]\n}', u'')
2019-10-25 08:58:31,793 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpmLa9e3 2>/tmp/tmpcGNRPR''] {'quiet': False}
Finally it finishes with this message
2019-10-25 09:17:53,022 - call returned (143, "-bash: line 1: 28392 Terminated curl -s 'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' > /tmp/tmpmLa9e3 2> /tmp/tmpcGNRPR")
2019-10-25 09:17:53,022 - call['hdfs haadmin -ns hdpnndev -getServiceState nn2'] {'logoutput': True, 'user': 'hdfs'}
Created on 10-26-2019 03:38 AM - edited 10-26-2019 03:39 AM
HDP3 uses HBase as a backend for the timeline service. You can either use an external HBase or have a system HBase running on Yarn (the default).
With a server crash, your ATSv2 HBase Application data could be corrupted this can impact performance because of timeouts from the ATS
To fix that follow:
Check for any error in the yarn logs usually in /var/log/hadoop-yarn/yarn/ for anything clear to spot, for instance, not enough yarn memory (and then fix it if relevant) see these 2 files
hadoop-yarn-timelineserver-<host_name>.out
hadoop-yarn-timelineserver-<host_name>.log
Clean up hdfs ATS data as described in Remove ats-HBase before switching between clusters note there are different steps for secure [Kerberized] and unsecure clusters
Clean up zookeeper ATS data (the example here is for insecure clusters, you will probably have another znode for kerberised clusters): zookeeper-client rmr /atsv2-hbase-unsecure
Log on zookeeper here I am on an secure HDP 3.1 single node cluster
# /usr/hdp/3.1.0.0-78/zookeeper/bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
[cluster, registry, controller, brokers, storm, zookeeper, infra-solr, hbase-unsecure, admin, isr_change_notification, log_dir_event_notification, controller_epoch, hiveserver2, hiveserver2-leader, rmstore, atsv2-hbase-unsecure, consumers, ambari-metrics-cluster, latest_producer_id_block, config]
Go for the ATSv2 entry
[zk: localhost:2181(CONNECTED) 1] ls /atsv2-hbase-unsecure
[rs, splitWAL, backup-masters, table-lock, draining, master-maintenance, table]
Delete the entry
[zk: localhost:2181(CONNECTED) 2] rmr /atsv2-hbase-unsecure
Restart *all* YARN services,
Restart ambari server
You lose your ATS history with the above hdfs & zookeeper cleaning steps like (ie. job names, timing, logs…), but your actual data is perfectly safe, nothing else will be lost.
Please revert !!
Created 10-28-2019 02:39 PM
Hello Shelton.
I've followed entire given steps and service still is not coming up. Below attached outputs.
From: /var/lib/ambari-agent/data/errors-31848.txt
resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarnats -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/var/lib/ambari-agent:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.1.0.0-78/hadoop-yarn/timelineservice/*; /usr/hdp/3.1.0.0-78/hbase/bin/hbase --config /usr/hdp/3.1.0.0-78/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds
From: /var/lib/ambari-agent/data/output-31848.txt
2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854744889
2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854735924
2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854772106
2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854768736
2019-10-22 16:41:17,992 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854749025
==> /usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/gc.log-201910110639 <==
Java HotSpot(TM) 64-Bit Server VM (25.60-b23) for linux-amd64 JRE (1.8.0_60-b27), built on Aug 4 2015 12:19:40 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 131732324k(11848752k free), swap 8388604k(8279292k free)
CommandLine flags: -XX:ErrorFile=/usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/hs_err_pid%p.log -XX:InitialHeapSize=2107717184 -XX:MaxHeapSize=3435134976 -XX:MaxNewSize=1145044992 -XX:MaxTenuringThreshold=6 -XX:OldPLABSize=16 -XX:OnOutOfMemoryError=kill -9 %p -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
Heap
par new generation total 618048K, used 197776K [0x00000006f3400000, 0x000000071d2a0000, 0x0000000737800000)
eden space 549376K, 36% used [0x00000006f3400000, 0x00000006ff5243a8, 0x0000000714c80000)
from space 68672K, 0% used [0x0000000714c80000, 0x0000000714c80000, 0x0000000718f90000)
to space 68672K, 0% used [0x0000000718f90000, 0x0000000718f90000, 0x000000071d2a0000)
concurrent mark-sweep generation total 1373568K, used 0K [0x0000000737800000, 0x000000078b560000, 0x00000007c0000000)
Metaspace used 11629K, capacity 11810K, committed 11904K, reserved 1060864K
class space used 1251K, capacity 1316K, committed 1408K, reserved 1048576K
==> /usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/gc.log-201910100851 <==
Java HotSpot(TM) 64-Bit Server VM (25.60-b23) for linux-amd64 JRE (1.8.0_60-b27), built on Aug 4 2015 12:19:40 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 131732324k(1591264k free), swap 8388604k(8280060k free)
CommandLine flags: -XX:CMSInitiatingOccupancyFraction=70 -XX:ErrorFile=/usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/hs_err_pid%p.log -XX:InitialHeapSize=3435134976 -XX:MaxHeapSize=3435134976 -XX:MaxNewSize=1145044992 -XX:MaxTenuringThreshold=6 -XX:NewSize=1145044992 -XX:OldPLABSize=16 -XX:OldSize=2290089984 -XX:OnOutOfMemoryError=kill -9 %p -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:ReservedCodeCacheSize=268435456 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
2019-10-10T08:51:22.325-0700: 2.213: [GC (CMS Initial Mark) [1 CMS-initial-mark: 0K(2236416K)] 715687K(3242816K), 0.1832180 secs] [Times: user=0.48 sys=0.07, real=0.19 secs]
2019-10-10T08:51:22.508-0700: 2.396: [CMS-concurrent-mark-start]
2019-10-10T08:51:22.509-0700: 2.397: [CMS-concurrent-mark: 0.001/0.001 secs] [Times: user=0.01 sys=0.01, real=0.00 secs]
2019-10-10T08:51:22.509-0700: 2.397: [CMS-concurrent-preclean-start]
2019-10-10T08:51:22.513-0700: 2.400: [CMS-concurrent-preclean: 0.003/0.003 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-10-10T08:51:22.513-0700: 2.400: [CMS-concurrent-abortable-preclean-start]
2019-10-10T08:51:22.827-0700: 2.715: [GC (Allocation Failure) 2019-10-10T08:51:22.827-0700: 2.715: [ParNew: 894592K->37233K(1006400K), 0.0334809 secs] 894592K->37233K(3242816K), 0.0335760 secs] [Times: user=0.17 sys=0.03, real=0.03 secs]
Heap
par new generation total 1006400K, used 577717K [0x00000006f3400000, 0x0000000737800000, 0x0000000737800000)
eden space 894592K, 60% used [0x00000006f3400000, 0x00000007143d0fc8, 0x0000000729da0000)
from space 111808K, 33% used [0x0000000730ad0000, 0x0000000732f2c758, 0x0000000737800000)
to space 111808K, 0% used [0x0000000729da0000, 0x0000000729da0000, 0x0000000730ad0000)
concurrent mark-sweep generation total 2236416K, used 0K [0x0000000737800000, 0x00000007c0000000, 0x00000007c0000000)
Metaspace used 52260K, capacity 52701K, committed 53168K, reserved 1095680K
class space used 5905K, capacity 6041K, committed 6096K, reserved 1048576K
2019-10-10T08:51:24.359-0700: 4.247: [CMS-concurrent-abortable-preclean: 1.100/1.847 secs] [Times: user=4.75 sys=0.27, real=1.85 secs]
Command failed after 1 tries
From: yarn-timelineserver-gc.log
Total 180427 20331968
, 0.0109813 secs]
25.966: [GC (Allocation Failure) [PSYoungGen: 766976K->26024K(1047040K)] 786831K->45895K(2136576K), 0.0205774 secs] [Times: user=0.18 sys=0.02, real=0.02 secs]
27.877: [GC (Allocation Failure) [PSYoungGen: 987560K->37814K(1176576K)] 1007431K->57702K(2266112K), 0.0452135 secs] [Times: user=0.31 sys=0.03, real=0.05 secs]
29.872: [GC (Allocation Failure) [PSYoungGen: 1128886K->40013K(1176576K)] 1148774K->59908K(2266112K), 0.0376384 secs] [Times: user=0.25 sys=0.02, real=0.04 secs]
31.621: [GC (Allocation Failure) [PSYoungGen: 1131085K->41607K(1708032K)] 1150980K->61510K(2797568K), 0.0426743 secs] [Times: user=0.19 sys=0.02, real=0.04 secs]
34.381: [GC (Allocation Failure) [PSYoungGen: 1702023K->52721K(1713152K)] 1721926K->75113K(2802688K), 0.0671733 secs] [Times: user=0.32 sys=0.06, real=0.07 secs]
544.663: [GC (Allocation Failure) [PSYoungGen: 1713137K->24633K(2550784K)] 1735529K->55561K(3640320K), 0.0502315 secs] [Times: user=0.37 sys=0.08, real=0.05 secs]
1744.725: [GC (Allocation Failure) [PSYoungGen: 2550329K->6583K(2657792K)] 2581257K->37803K(3747328K), 0.0109360 secs] [Times: user=0.06 sys=0.05, real=0.01 secs]
3364.582: [GC (Allocation Failure) [PSYoungGen: 2603959K->7333K(2513408K)] 2635179K->38561K(3602944K), 0.0106033 secs] [Times: user=0.05 sys=0.05, real=0.01 secs]
4564.508: [GC (Allocation Failure) [PSYoungGen: 2513061K->7397K(2425856K)] 2544289K->38633K(3515392K), 0.0098975 secs] [Times: user=0.05 sys=0.05, real=0.01 secs]
5944.468: [GC (Allocation Failure) [PSYoungGen: 2425573K->7432K(2342400K)] 2456809K->38676K(3431936K), 0.0100541 secs] [Times: user=0.05 sys=0.04, real=0.01 secs]
6904.427: [GC (Allocation Failure) [PSYoungGen: 2342152K->7814K(2263040K)] 2373396K->39065K(3352576K), 0.0100246 secs] [Times: user=0.06 sys=0.05, real=0.01 secs]
7624.583: [GC (Allocation Failure) [PSYoungGen: 2262662K->7335K(2186240K)] 2293913K->38595K(3275776K), 0.0126832 secs] [Times: user=0.07 sys=0.03, real=0.01 secs]
8524.740: [GC (Allocation Failure) [PSYoungGen: 2185895K->7238K(2113536K)] 2217155K->38505K(3203072K), 0.0110849 secs] [Times: user=0.06 sys=0.05, real=0.01 secs]
9604.461: [GC (Allocation Failure) [PSYoungGen: 2113094K->7415K(2043904K)] 2144361K->38690K(3133440K), 0.0187939 secs] [Times: user=0.11 sys=0.07, real=0.02 secs]
10864.545: [GC (Allocation Failure) [PSYoungGen: 2043639K->7287K(1977344K)] 2074914K->38570K(3066880K), 0.0131232 secs] [Times: user=0.12 sys=0.04, real=0.01 secs]
From: yarn-timelineserver-gc.log
2019-10-28 14:35:06,753 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_331 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,753 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_332 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,754 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_332 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,754 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_333 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,754 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_333 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,755 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_334 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,755 WARN timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_334 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,755 INFO timeline.LogInfo (LogInfo.java:parseForStore(116)) - Parsed 1338 entities from hdfs://hdpnndev/ats/active/application_1572286585508_0006/appattempt_1572286585508_0006_000001/summarylog-appattempt_1572286585508_0006_000001 in 314 msec
From: hadoop-yarn-resourcemanager-server.log
TARGET=ClientRMService RESULT=SUCCESS
2019-10-28 14:36:08,043 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1572286585508_0005_000001 container=null queue=batchq1 clusterResource=<memory:411648, vCores:128> type=RACK_LOCAL requestedPartition=
2019-10-28 14:36:08,043 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from NEW to ALLOCATED
2019-10-28 14:36:08,043 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e247_1572286585508_0005_01_000398 of capacity <memory:3072, vCores:1> on host server:45454, which has 6 containers, <memory:70656, vCores:6> used and <memory:32256, vCores:26> available after allocation
2019-10-28 14:36:08,043 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572286585508_0005 CONTAINERID=container_e247_1572286585508_0005_01_000398 RESOURCE=<memory:3072, vCores:1>
2019-10-28 14:36:08,043 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=batch usedCapacity=0.14925392 absoluteUsedCapacity=0.11940298 used=<memory:49152, vCores:13> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,043 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=root usedCapacity=0.40298507 absoluteUsedCapacity=0.40298507 used=<memory:165888, vCores:17> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,043 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2900)) - Allocation proposal accepted
2019-10-28 14:36:08,103 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from ALLOCATED to ACQUIRED
2019-10-28 14:36:08,300 INFO allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1572286585508_0006_000001 container=null queue=batchq1 clusterResource=<memory:411648, vCores:128> type=OFF_SWITCH requestedPartition=
2019-10-28 14:36:08,300 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0006_01_000647 Container Transitioned from NEW to ALLOCATED
2019-10-28 14:36:08,300 INFO fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e247_1572286585508_0006_01_000647 of capacity <memory:3072, vCores:1> on host server:45454, which has 5 containers, <memory:18432, vCores:5> used and <memory:84480, vCores:27> available after allocation
2019-10-28 14:36:08,300 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572286585508_0006 CONTAINERID=container_e247_1572286585508_0006_01_000647 RESOURCE=<memory:3072, vCores:1>
2019-10-28 14:36:08,300 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=batch usedCapacity=0.15858229 absoluteUsedCapacity=0.12686567 used=<memory:52224, vCores:14> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,300 INFO capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=root usedCapacity=0.41044775 absoluteUsedCapacity=0.41044775 used=<memory:168960, vCores:18> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,300 INFO capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2900)) - Allocation proposal accepted
2019-10-28 14:36:08,354 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from ACQUIRED to RELEASED
2019-10-28 14:36:08,354 INFO resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive IP=10.10.81.14 OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572286585508_0005 CONTAINERID=container_e247_1572286585508_0005_01_000398 RESOURCE=<memory:3072, vCores:1>
2019-10-28 14:36:08,354 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0005
2019-10-28 14:36:08,485 INFO rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0006_01_000647 Container Transitioned from ALLOCATED to ACQUIRED
2019-10-28 14:36:08,736 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0006
2019-10-28 14:36:08,987 INFO scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0006
This are not the complete logs, just a glimpse. I hope it helps to come up with any idea.
It gives me the impression it's heap memory issue. But...
AppTimelineServer Java heap size = 8G , therefore any thought is appreciated.
Regards!