Support Questions

JLo_Hernandez · ‎10-25-2019

Ambari 2.7.3

HDP 3.1.0

Hello team. My service went down 15 days ago and I'm still not able to fix this issue. If someone has face it before, please let me know.

From last logs:

Caused by: java.net.ConnectException: Call to atsserver/ip:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atsserver/ip:17020
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:165)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at org.apache.hadoop.hbase.ipc.BufferCallBeforeInitHandler.userEventTriggered(BufferCallBeforeInitHandler.java:92)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:307)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1377)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:929)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:269)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:263)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atserver/ip:17020
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hbase.thirdparty.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
... 7 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
2019-10-10 10:03:05,448 ERROR reader.TimelineReaderServer (LogAdapter.java:error(75)) - RECEIVED SIGNAL 15: SIGTERM
2019-10-10 10:03:05,457 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.w.WebAppContext@7357a011{/,null,UNAVAILABLE}{/timeline}
2019-10-10 10:03:05,460 INFO server.AbstractConnector (AbstractConnector.java:doStop(318)) - Stopped ServerConnector@1bae316d{HTTP/1.1,[http/1.1]}{0.0.0.0:8198}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@4bff7da0{/static,jar:file:/usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-common-3.1.1.3.1.0.0-78.jar!/webapps/static,UNAVAILABLE}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@3e62d773{/logs,file:///usr/logs/hadoop-yarn/yarn/,UNAVAILABLE}
2019-10-10 10:03:05,462 INFO storage.HBaseTimelineReaderImpl (HBaseTimelineReaderImpl.java:serviceStop(108)) - closing the hbase Connection
2019-10-10 10:03:05,462 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x783a467b to zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181
2019-10-10 10:03:05,465 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x0ba2f4ec to zookeeper quorum
2019-10-10 10:03:05,467 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x26daa4f60630022 closed
2019-10-10 10:03:05,467 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down
2019-10-10 10:03:05,468 INFO reader.TimelineReaderServer (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down TimelineReaderServer at atsserver/ip
************************************************************/

I've followed the steps on this workaround and I am getting a new error message

https://community.cloudera.com/t5/Support-Questions/ATS-hbase-does-not-seem-to-start/td-p/235155 - guillaume_roger

We use to hace kerberized this cluster but it's not anymore a a few months ago. That's why this message is weird

2019-10-25 08:58:31,650 - HdfsResource['/atsv2/hbase/data'] {'security_enabled': False, 'hadoop_bin_dir': '/usr/hdp/3.1.0.0-78/hadoop/bin', 'keytab': '', 'dfs_type': 'HDFS', 'default_fs': 'hdfs://hdpnndev', 'hdfs_resource_ignore_file': '/var/lib/ambari-agent/data/.hdfs_resource_ignore', 'hdfs_site': ..., 'kinit_path_local': '/usr/bin/kinit', 'principal_name': '', 'user': 'hdfs', 'owner': 'yarnats', 'hadoop_conf_dir': '/usr/hdp/3.1.0.0-78/hadoop/conf', 'type': 'directory', 'action': ['create_on_execute'], 'immutable_paths': [u'/mr-history/done', u'/warehouse/tablespace/managed/hive', u'/warehouse/tablespace/external/hive', u'/app-logs', u'/tmp']}
2019-10-25 08:58:31,652 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://masternode:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZs_d1F 2>/tmp/tmpNf5VJV''] {'quiet': False}

Then , it got stuck several minutes on this part

2019-10-25 08:58:31,791 - call returned (0, '')
2019-10-25 08:58:31,792 - get_user_call_output returned (0, u'{\n  "beans" : [ {\n    "name" : "Hadoop:service=NameNode,name=FSNamesystem",\n    "modelerType" : "FSNamesystem",\n    "tag.Context" : "dfs",\n    "tag.HAState" : "active",\n    "tag.TotalSyncTimes" : "103 16 ",\n    "tag.Hostname" : "zookeeper",\n    "MissingBlocks" : 0,\n    "MissingReplOneBlocks" : 0,\n    "ExpiredHeartbeats" : 0,\n    "TransactionsSinceLastCheckpoint" : 162564,\n    "TransactionsSinceLastLogRoll" : 97,\n    "LastWrittenTransactionId" : 136802255,\n    "LastCheckpointTime" : 1571986118000,\n    "CapacityTotal" : 94452535787520,\n    "CapacityTotalGB" : 87966.0,\n    "CapacityUsed" : 17431687303220,\n    "CapacityUsedGB" : 16235.0,\n    "CapacityRemaining" : 72211133116924,\n    "ProvidedCapacityTotal" : 0,\n    "CapacityRemainingGB" : 67252.0,\n    "CapacityUsedNonDFS" : 0,\n    "TotalLoad" : 128,\n    "SnapshottableDirectories" : 1,\n    "Snapshots" : 1,\n    "NumEncryptionZones" : 0,\n    "LockQueueLength" : 0,\n    "BlocksTotal" : 985264,\n    "NumFilesUnderConstruction" : 32,\n    "NumActiveClients" : 21,\n    "FilesTotal" : 1560325,\n    "PendingReplicationBlocks" : 0,\n    "PendingReconstructionBlocks" : 0,\n    "UnderReplicatedBlocks" : 20,\n    "LowRedundancyBlocks" : 20,\n    "CorruptBlocks" : 0,\n    "ScheduledReplicationBlocks" : 0,\n    "PendingDeletionBlocks" : 597,\n    "LowRedundancyReplicatedBlocks" : 20,\n    "CorruptReplicatedBlocks" : 0,\n    "MissingReplicatedBlocks" : 0,\n    "MissingReplicationOneBlocks" : 0,\n    "HighestPriorityLowRedundancyReplicatedBlocks" : 0,\n    "HighestPriorityLowRedundancyECBlocks" : 0,\n    "BytesInFutureReplicatedBlocks" : 0,\n    "PendingDeletionReplicatedBlocks" : 597,\n    "TotalReplicatedBlocks" : 985264,\n    "LowRedundancyECBlockGroups" : 0,\n    "CorruptECBlockGroups" : 0,\n    "MissingECBlockGroups" : 0,\n    "BytesInFutureECBlockGroups" : 0,\n    "PendingDeletionECBlocks" : 0,\n    "TotalECBlockGroups" : 0,\n    "ExcessBlocks" : 0,\n    "NumTimedOutPendingReconstructions" : 0,\n    "PostponedMisreplicatedBlocks" : 24,\n    "PendingDataNodeMessageCount" : 0,\n    "MillisSinceLastLoadedEdits" : 0,\n    "BlockCapacity" : 2097152,\n    "NumLiveDataNodes" : 4,\n    "NumDeadDataNodes" : 0,\n    "NumDecomLiveDataNodes" : 0,\n    "NumDecomDeadDataNodes" : 0,\n    "VolumeFailuresTotal" : 0,\n    "EstimatedCapacityLostTotal" : 0,\n    "NumDecommissioningDataNodes" : 0,\n    "StaleDataNodes" : 0,\n    "NumStaleStorages" : 48,\n    "TotalSyncCount" : 76,\n    "NumInMaintenanceLiveDataNodes" : 0,\n    "NumInMaintenanceDeadDataNodes" : 0,\n    "NumEnteringMaintenanceDataNodes" : 0\n  } ]\n}', u'')
2019-10-25 08:58:31,793 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpmLa9e3 2>/tmp/tmpcGNRPR''] {'quiet': False}

Finally it finishes with this message

2019-10-25 09:17:53,022 - call returned (143, "-bash: line 1: 28392 Terminated              curl -s 'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' > /tmp/tmpmLa9e3 2> /tmp/tmpcGNRPR")
2019-10-25 09:17:53,022 - call['hdfs haadmin -ns hdpnndev -getServiceState nn2'] {'logoutput': True, 'user': 'hdfs'}

Shelton · ‎10-26-2019

@JLo_Hernandez

HDP3 uses HBase as a backend for the timeline service. You can either use an external HBase or have a system HBase running on Yarn (the default).

With a server crash, your ATSv2 HBase Application data could be corrupted this can impact performance because of timeouts from the ATS

To fix that follow:

Check for any error in the yarn logs usually in /var/log/hadoop-yarn/yarn/ for anything clear to spot, for instance, not enough yarn memory (and then fix it if relevant) see these 2 files

hadoop-yarn-timelineserver-<host_name>.out
hadoop-yarn-timelineserver-<host_name>.log

Clean up hdfs ATS data as described in Remove ats-HBase before switching between clusters note there are different steps for secure [Kerberized] and unsecure clusters

Clean up zookeeper ATS data (the example here is for insecure clusters, you will probably have another znode for kerberised clusters): zookeeper-client rmr /atsv2-hbase-unsecure

Log on zookeeper here I am on an secure HDP 3.1 single node cluster

# /usr/hdp/3.1.0.0-78/zookeeper/bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
[cluster, registry, controller, brokers, storm, zookeeper, infra-solr, hbase-unsecure, admin, isr_change_notification, log_dir_event_notification, controller_epoch, hiveserver2, hiveserver2-leader, rmstore, atsv2-hbase-unsecure, consumers, ambari-metrics-cluster, latest_producer_id_block, config]

Go for the ATSv2 entry

[zk: localhost:2181(CONNECTED) 1] ls /atsv2-hbase-unsecure
[rs, splitWAL, backup-masters, table-lock, draining, master-maintenance, table]

Delete the entry

[zk: localhost:2181(CONNECTED) 2] rmr /atsv2-hbase-unsecure

Restart *all* YARN services,

Restart ambari server

You lose your ATS history with the above hdfs & zookeeper cleaning steps like (ie. job names, timing, logs…), but your actual data is perfectly safe, nothing else will be lost.

Please revert !!

JLo_Hernandez · ‎10-28-2019

Hello Shelton.

I've followed entire given steps and service still is not coming up. Below attached outputs.

From: /var/lib/ambari-agent/data/errors-31848.txt

resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarnats -l -s /bin/bash -c 'export  PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/var/lib/ambari-agent:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.1.0.0-78/hadoop-yarn/timelineservice/*; /usr/hdp/3.1.0.0-78/hbase/bin/hbase --config /usr/hdp/3.1.0.0-78/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds

From: /var/lib/ambari-agent/data/output-31848.txt

2019-10-22 16:41:17,992 WARN  [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854744889
2019-10-22 16:41:17,992 WARN  [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854735924
2019-10-22 16:41:17,992 WARN  [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854772106
2019-10-22 16:41:17,992 WARN  [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854768736
2019-10-22 16:41:17,992 WARN  [main-EventThread] coordination.ZKSplitLogManagerCoordination$CreateRescanAsyncCallback: rc=NONODE for /atsv2-hbase-unsecure/splitWAL/RESCAN remaining retries=9223372036854749025
==> /usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/gc.log-201910110639 <==
Java HotSpot(TM) 64-Bit Server VM (25.60-b23) for linux-amd64 JRE (1.8.0_60-b27), built on Aug  4 2015 12:19:40 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 131732324k(11848752k free), swap 8388604k(8279292k free)
CommandLine flags: -XX:ErrorFile=/usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/hs_err_pid%p.log -XX:InitialHeapSize=2107717184 -XX:MaxHeapSize=3435134976 -XX:MaxNewSize=1145044992 -XX:MaxTenuringThreshold=6 -XX:OldPLABSize=16 -XX:OnOutOfMemoryError=kill -9 %p -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
Heap
 par new generation   total 618048K, used 197776K [0x00000006f3400000, 0x000000071d2a0000, 0x0000000737800000)
  eden space 549376K,  36% used [0x00000006f3400000, 0x00000006ff5243a8, 0x0000000714c80000)
  from space 68672K,   0% used [0x0000000714c80000, 0x0000000714c80000, 0x0000000718f90000)
  to   space 68672K,   0% used [0x0000000718f90000, 0x0000000718f90000, 0x000000071d2a0000)
 concurrent mark-sweep generation total 1373568K, used 0K [0x0000000737800000, 0x000000078b560000, 0x00000007c0000000)
 Metaspace       used 11629K, capacity 11810K, committed 11904K, reserved 1060864K
  class space    used 1251K, capacity 1316K, committed 1408K, reserved 1048576K
==> /usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/gc.log-201910100851 <==
Java HotSpot(TM) 64-Bit Server VM (25.60-b23) for linux-amd64 JRE (1.8.0_60-b27), built on Aug  4 2015 12:19:40 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 131732324k(1591264k free), swap 8388604k(8280060k free)
CommandLine flags: -XX:CMSInitiatingOccupancyFraction=70 -XX:ErrorFile=/usr/logs/hadoop-yarn/embedded-yarn-ats-hbase/hs_err_pid%p.log -XX:InitialHeapSize=3435134976 -XX:MaxHeapSize=3435134976 -XX:MaxNewSize=1145044992 -XX:MaxTenuringThreshold=6 -XX:NewSize=1145044992 -XX:OldPLABSize=16 -XX:OldSize=2290089984 -XX:OnOutOfMemoryError=kill -9 %p -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:ReservedCodeCacheSize=268435456 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
2019-10-10T08:51:22.325-0700: 2.213: [GC (CMS Initial Mark) [1 CMS-initial-mark: 0K(2236416K)] 715687K(3242816K), 0.1832180 secs] [Times: user=0.48 sys=0.07, real=0.19 secs] 
2019-10-10T08:51:22.508-0700: 2.396: [CMS-concurrent-mark-start]
2019-10-10T08:51:22.509-0700: 2.397: [CMS-concurrent-mark: 0.001/0.001 secs] [Times: user=0.01 sys=0.01, real=0.00 secs] 
2019-10-10T08:51:22.509-0700: 2.397: [CMS-concurrent-preclean-start]
2019-10-10T08:51:22.513-0700: 2.400: [CMS-concurrent-preclean: 0.003/0.003 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2019-10-10T08:51:22.513-0700: 2.400: [CMS-concurrent-abortable-preclean-start]
2019-10-10T08:51:22.827-0700: 2.715: [GC (Allocation Failure) 2019-10-10T08:51:22.827-0700: 2.715: [ParNew: 894592K->37233K(1006400K), 0.0334809 secs] 894592K->37233K(3242816K), 0.0335760 secs] [Times: user=0.17 sys=0.03, real=0.03 secs] 
Heap
 par new generation   total 1006400K, used 577717K [0x00000006f3400000, 0x0000000737800000, 0x0000000737800000)
  eden space 894592K,  60% used [0x00000006f3400000, 0x00000007143d0fc8, 0x0000000729da0000)
  from space 111808K,  33% used [0x0000000730ad0000, 0x0000000732f2c758, 0x0000000737800000)
  to   space 111808K,   0% used [0x0000000729da0000, 0x0000000729da0000, 0x0000000730ad0000)
 concurrent mark-sweep generation total 2236416K, used 0K [0x0000000737800000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 52260K, capacity 52701K, committed 53168K, reserved 1095680K
  class space    used 5905K, capacity 6041K, committed 6096K, reserved 1048576K
2019-10-10T08:51:24.359-0700: 4.247: [CMS-concurrent-abortable-preclean: 1.100/1.847 secs] [Times: user=4.75 sys=0.27, real=1.85 secs]

Command failed after 1 tries

From: yarn-timelineserver-gc.log

Total        180427       20331968
, 0.0109813 secs]
25.966: [GC (Allocation Failure) [PSYoungGen: 766976K->26024K(1047040K)] 786831K->45895K(2136576K), 0.0205774 secs] [Times: user=0.18 sys=0.02, real=0.02 secs]
27.877: [GC (Allocation Failure) [PSYoungGen: 987560K->37814K(1176576K)] 1007431K->57702K(2266112K), 0.0452135 secs] [Times: user=0.31 sys=0.03, real=0.05 secs]
29.872: [GC (Allocation Failure) [PSYoungGen: 1128886K->40013K(1176576K)] 1148774K->59908K(2266112K), 0.0376384 secs] [Times: user=0.25 sys=0.02, real=0.04 secs]
31.621: [GC (Allocation Failure) [PSYoungGen: 1131085K->41607K(1708032K)] 1150980K->61510K(2797568K), 0.0426743 secs] [Times: user=0.19 sys=0.02, real=0.04 secs]
34.381: [GC (Allocation Failure) [PSYoungGen: 1702023K->52721K(1713152K)] 1721926K->75113K(2802688K), 0.0671733 secs] [Times: user=0.32 sys=0.06, real=0.07 secs]
544.663: [GC (Allocation Failure) [PSYoungGen: 1713137K->24633K(2550784K)] 1735529K->55561K(3640320K), 0.0502315 secs] [Times: user=0.37 sys=0.08, real=0.05 secs]
1744.725: [GC (Allocation Failure) [PSYoungGen: 2550329K->6583K(2657792K)] 2581257K->37803K(3747328K), 0.0109360 secs] [Times: user=0.06 sys=0.05, real=0.01 secs]
3364.582: [GC (Allocation Failure) [PSYoungGen: 2603959K->7333K(2513408K)] 2635179K->38561K(3602944K), 0.0106033 secs] [Times: user=0.05 sys=0.05, real=0.01 secs]
4564.508: [GC (Allocation Failure) [PSYoungGen: 2513061K->7397K(2425856K)] 2544289K->38633K(3515392K), 0.0098975 secs] [Times: user=0.05 sys=0.05, real=0.01 secs]
5944.468: [GC (Allocation Failure) [PSYoungGen: 2425573K->7432K(2342400K)] 2456809K->38676K(3431936K), 0.0100541 secs] [Times: user=0.05 sys=0.04, real=0.01 secs]
6904.427: [GC (Allocation Failure) [PSYoungGen: 2342152K->7814K(2263040K)] 2373396K->39065K(3352576K), 0.0100246 secs] [Times: user=0.06 sys=0.05, real=0.01 secs]
7624.583: [GC (Allocation Failure) [PSYoungGen: 2262662K->7335K(2186240K)] 2293913K->38595K(3275776K), 0.0126832 secs] [Times: user=0.07 sys=0.03, real=0.01 secs]
8524.740: [GC (Allocation Failure) [PSYoungGen: 2185895K->7238K(2113536K)] 2217155K->38505K(3203072K), 0.0110849 secs] [Times: user=0.06 sys=0.05, real=0.01 secs]
9604.461: [GC (Allocation Failure) [PSYoungGen: 2113094K->7415K(2043904K)] 2144361K->38690K(3133440K), 0.0187939 secs] [Times: user=0.11 sys=0.07, real=0.02 secs]
10864.545: [GC (Allocation Failure) [PSYoungGen: 2043639K->7287K(1977344K)] 2074914K->38570K(3066880K), 0.0131232 secs] [Times: user=0.12 sys=0.04, real=0.01 secs]

From: yarn-timelineserver-gc.log

2019-10-28 14:35:06,753 WARN  timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_331 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,753 WARN  timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_332 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,754 WARN  timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_332 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,754 WARN  timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_333 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,754 WARN  timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_333 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,755 WARN  timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_334 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,755 WARN  timeline.EntityGroupFSTimelineStore (LogInfo.java:doParse(208)) - Error putting entity: dag_1572286585508_0006_334 (TEZ_DAG_ID): 6
2019-10-28 14:35:06,755 INFO  timeline.LogInfo (LogInfo.java:parseForStore(116)) - Parsed 1338 entities from hdfs://hdpnndev/ats/active/application_1572286585508_0006/appattempt_1572286585508_0006_000001/summarylog-appattempt_1572286585508_0006_000001 in 314 msec

From: hadoop-yarn-resourcemanager-server.log

TARGET=ClientRMService  RESULT=SUCCESS
2019-10-28 14:36:08,043 INFO  allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1572286585508_0005_000001 container=null queue=batchq1 clusterResource=<memory:411648, vCores:128> type=RACK_LOCAL requestedPartition=
2019-10-28 14:36:08,043 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from NEW to ALLOCATED
2019-10-28 14:36:08,043 INFO  fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e247_1572286585508_0005_01_000398 of capacity <memory:3072, vCores:1> on host server:45454, which has 6 containers, <memory:70656, vCores:6> used and <memory:32256, vCores:26> available after allocation
2019-10-28 14:36:08,043 INFO  resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive    OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1572286585508_0005    CONTAINERID=container_e247_1572286585508_0005_01_000398 RESOURCE=<memory:3072, vCores:1>
2019-10-28 14:36:08,043 INFO  capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=batch usedCapacity=0.14925392 absoluteUsedCapacity=0.11940298 used=<memory:49152, vCores:13> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,043 INFO  capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=root usedCapacity=0.40298507 absoluteUsedCapacity=0.40298507 used=<memory:165888, vCores:17> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,043 INFO  capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2900)) - Allocation proposal accepted
2019-10-28 14:36:08,103 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from ALLOCATED to ACQUIRED
2019-10-28 14:36:08,300 INFO  allocator.AbstractContainerAllocator (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - assignedContainer application attempt=appattempt_1572286585508_0006_000001 container=null queue=batchq1 clusterResource=<memory:411648, vCores:128> type=OFF_SWITCH requestedPartition=
2019-10-28 14:36:08,300 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0006_01_000647 Container Transitioned from NEW to ALLOCATED
2019-10-28 14:36:08,300 INFO  fica.FiCaSchedulerNode (FiCaSchedulerNode.java:allocateContainer(169)) - Assigned container container_e247_1572286585508_0006_01_000647 of capacity <memory:3072, vCores:1> on host server:45454, which has 5 containers, <memory:18432, vCores:5> used and <memory:84480, vCores:27> available after allocation
2019-10-28 14:36:08,300 INFO  resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive    OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1572286585508_0006    CONTAINERID=container_e247_1572286585508_0006_01_000647 RESOURCE=<memory:3072, vCores:1>
2019-10-28 14:36:08,300 INFO  capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=batch usedCapacity=0.15858229 absoluteUsedCapacity=0.12686567 used=<memory:52224, vCores:14> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,300 INFO  capacity.ParentQueue (ParentQueue.java:apply(1336)) - assignedContainer queue=root usedCapacity=0.41044775 absoluteUsedCapacity=0.41044775 used=<memory:168960, vCores:18> cluster=<memory:411648, vCores:128>
2019-10-28 14:36:08,300 INFO  capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(2900)) - Allocation proposal accepted
2019-10-28 14:36:08,354 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0005_01_000398 Container Transitioned from ACQUIRED to RELEASED
2019-10-28 14:36:08,354 INFO  resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(200)) - USER=hive    IP=10.10.81.14  OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1572286585508_0005    CONTAINERID=container_e247_1572286585508_0005_01_000398 RESOURCE=<memory:3072, vCores:1>
2019-10-28 14:36:08,354 INFO  scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0005
2019-10-28 14:36:08,485 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(490)) - container_e247_1572286585508_0006_01_000647 Container Transitioned from ALLOCATED to ACQUIRED
2019-10-28 14:36:08,736 INFO  scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0006
2019-10-28 14:36:08,987 INFO  scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(367)) - checking for deactivate of application :application_1572286585508_0006

This are not the complete logs, just a glimpse. I hope it helps to come up with any idea.

It gives me the impression it's heap memory issue. But...

AppTimelineServer Java heap size = 8G , therefore any thought is appreciated.

Regards!

Cloudera Community

Support Questions

Yarn Timeline Service V2 not starting