Support Questions

Find answers, ask questions, and share your expertise

Who agreed with this topic

Yarn Timeline Service V2 not starting

avatar
Expert Contributor

Ambari 2.7.3

HDP 3.1.0

Hello team. My service went down 15 days ago and I'm still not able to fix this issue. If someone has face it before, please let me know.

 

From last logs:

Caused by: java.net.ConnectException: Call to atsserver/ip:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atsserver/ip:17020
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:165)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at org.apache.hadoop.hbase.ipc.BufferCallBeforeInitHandler.userEventTriggered(BufferCallBeforeInitHandler.java:92)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:307)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1377)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:929)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:269)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:263)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atserver/ip:17020
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hbase.thirdparty.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
... 7 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
2019-10-10 10:03:05,448 ERROR reader.TimelineReaderServer (LogAdapter.java:error(75)) - RECEIVED SIGNAL 15: SIGTERM
2019-10-10 10:03:05,457 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.w.WebAppContext@7357a011{/,null,UNAVAILABLE}{/timeline}
2019-10-10 10:03:05,460 INFO server.AbstractConnector (AbstractConnector.java:doStop(318)) - Stopped ServerConnector@1bae316d{HTTP/1.1,[http/1.1]}{0.0.0.0:8198}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@4bff7da0{/static,jar:file:/usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-common-3.1.1.3.1.0.0-78.jar!/webapps/static,UNAVAILABLE}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@3e62d773{/logs,file:///usr/logs/hadoop-yarn/yarn/,UNAVAILABLE}
2019-10-10 10:03:05,462 INFO storage.HBaseTimelineReaderImpl (HBaseTimelineReaderImpl.java:serviceStop(108)) - closing the hbase Connection
2019-10-10 10:03:05,462 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x783a467b to zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181
2019-10-10 10:03:05,465 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x0ba2f4ec to zookeeper quorum
2019-10-10 10:03:05,467 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x26daa4f60630022 closed
2019-10-10 10:03:05,467 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down
2019-10-10 10:03:05,468 INFO reader.TimelineReaderServer (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down TimelineReaderServer at atsserver/ip
************************************************************/

 

I've followed the steps on this workaround and I am getting a new error message 

 

https://community.cloudera.com/t5/Support-Questions/ATS-hbase-does-not-seem-to-start/td-p/235155 - guillaume_roger 

We use to hace kerberized this cluster but it's not anymore a a few months ago. That's why this message is weird

 

2019-10-25 08:58:31,650 - HdfsResource['/atsv2/hbase/data'] {'security_enabled': False, 'hadoop_bin_dir': '/usr/hdp/3.1.0.0-78/hadoop/bin', 'keytab': '', 'dfs_type': 'HDFS', 'default_fs': 'hdfs://hdpnndev', 'hdfs_resource_ignore_file': '/var/lib/ambari-agent/data/.hdfs_resource_ignore', 'hdfs_site': ..., 'kinit_path_local': '/usr/bin/kinit', 'principal_name': '', 'user': 'hdfs', 'owner': 'yarnats', 'hadoop_conf_dir': '/usr/hdp/3.1.0.0-78/hadoop/conf', 'type': 'directory', 'action': ['create_on_execute'], 'immutable_paths': [u'/mr-history/done', u'/warehouse/tablespace/managed/hive', u'/warehouse/tablespace/external/hive', u'/app-logs', u'/tmp']}
2019-10-25 08:58:31,652 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://masternode:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZs_d1F 2>/tmp/tmpNf5VJV''] {'quiet': False}

 

 

Then , it got stuck several minutes on this part

 

2019-10-25 08:58:31,791 - call returned (0, '')
2019-10-25 08:58:31,792 - get_user_call_output returned (0, u'{\n  "beans" : [ {\n    "name" : "Hadoop:service=NameNode,name=FSNamesystem",\n    "modelerType" : "FSNamesystem",\n    "tag.Context" : "dfs",\n    "tag.HAState" : "active",\n    "tag.TotalSyncTimes" : "103 16 ",\n    "tag.Hostname" : "zookeeper",\n    "MissingBlocks" : 0,\n    "MissingReplOneBlocks" : 0,\n    "ExpiredHeartbeats" : 0,\n    "TransactionsSinceLastCheckpoint" : 162564,\n    "TransactionsSinceLastLogRoll" : 97,\n    "LastWrittenTransactionId" : 136802255,\n    "LastCheckpointTime" : 1571986118000,\n    "CapacityTotal" : 94452535787520,\n    "CapacityTotalGB" : 87966.0,\n    "CapacityUsed" : 17431687303220,\n    "CapacityUsedGB" : 16235.0,\n    "CapacityRemaining" : 72211133116924,\n    "ProvidedCapacityTotal" : 0,\n    "CapacityRemainingGB" : 67252.0,\n    "CapacityUsedNonDFS" : 0,\n    "TotalLoad" : 128,\n    "SnapshottableDirectories" : 1,\n    "Snapshots" : 1,\n    "NumEncryptionZones" : 0,\n    "LockQueueLength" : 0,\n    "BlocksTotal" : 985264,\n    "NumFilesUnderConstruction" : 32,\n    "NumActiveClients" : 21,\n    "FilesTotal" : 1560325,\n    "PendingReplicationBlocks" : 0,\n    "PendingReconstructionBlocks" : 0,\n    "UnderReplicatedBlocks" : 20,\n    "LowRedundancyBlocks" : 20,\n    "CorruptBlocks" : 0,\n    "ScheduledReplicationBlocks" : 0,\n    "PendingDeletionBlocks" : 597,\n    "LowRedundancyReplicatedBlocks" : 20,\n    "CorruptReplicatedBlocks" : 0,\n    "MissingReplicatedBlocks" : 0,\n    "MissingReplicationOneBlocks" : 0,\n    "HighestPriorityLowRedundancyReplicatedBlocks" : 0,\n    "HighestPriorityLowRedundancyECBlocks" : 0,\n    "BytesInFutureReplicatedBlocks" : 0,\n    "PendingDeletionReplicatedBlocks" : 597,\n    "TotalReplicatedBlocks" : 985264,\n    "LowRedundancyECBlockGroups" : 0,\n    "CorruptECBlockGroups" : 0,\n    "MissingECBlockGroups" : 0,\n    "BytesInFutureECBlockGroups" : 0,\n    "PendingDeletionECBlocks" : 0,\n    "TotalECBlockGroups" : 0,\n    "ExcessBlocks" : 0,\n    "NumTimedOutPendingReconstructions" : 0,\n    "PostponedMisreplicatedBlocks" : 24,\n    "PendingDataNodeMessageCount" : 0,\n    "MillisSinceLastLoadedEdits" : 0,\n    "BlockCapacity" : 2097152,\n    "NumLiveDataNodes" : 4,\n    "NumDeadDataNodes" : 0,\n    "NumDecomLiveDataNodes" : 0,\n    "NumDecomDeadDataNodes" : 0,\n    "VolumeFailuresTotal" : 0,\n    "EstimatedCapacityLostTotal" : 0,\n    "NumDecommissioningDataNodes" : 0,\n    "StaleDataNodes" : 0,\n    "NumStaleStorages" : 48,\n    "TotalSyncCount" : 76,\n    "NumInMaintenanceLiveDataNodes" : 0,\n    "NumInMaintenanceDeadDataNodes" : 0,\n    "NumEnteringMaintenanceDataNodes" : 0\n  } ]\n}', u'')
2019-10-25 08:58:31,793 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpmLa9e3 2>/tmp/tmpcGNRPR''] {'quiet': False}

 

 

 Finally it finishes with this message

 

2019-10-25 09:17:53,022 - call returned (143, "-bash: line 1: 28392 Terminated              curl -s 'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' > /tmp/tmpmLa9e3 2> /tmp/tmpcGNRPR")
2019-10-25 09:17:53,022 - call['hdfs haadmin -ns hdpnndev -getServiceState nn2'] {'logoutput': True, 'user': 'hdfs'}

 

 

Who agreed with this topic