Created on 10-25-2019 07:45 AM - edited 10-25-2019 11:34 AM
Ambari 2.7.3
HDP 3.1.0
Hello team. My service went down 15 days ago and I'm still not able to fix this issue. If someone has face it before, please let me know.
From last logs:
Caused by: java.net.ConnectException: Call to atsserver/ip:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atsserver/ip:17020
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:165)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at org.apache.hadoop.hbase.ipc.BufferCallBeforeInitHandler.userEventTriggered(BufferCallBeforeInitHandler.java:92)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:307)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1377)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:329)
at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:315)
at org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:929)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.failInit(NettyRpcConnection.java:179)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$500(NettyRpcConnection.java:71)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:269)
at org.apache.hadoop.hbase.ipc.NettyRpcConnection$3.operationComplete(NettyRpcConnection.java:263)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:500)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:479)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: atserver/ip:17020
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hbase.thirdparty.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
... 7 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
2019-10-10 10:03:05,448 ERROR reader.TimelineReaderServer (LogAdapter.java:error(75)) - RECEIVED SIGNAL 15: SIGTERM
2019-10-10 10:03:05,457 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.w.WebAppContext@7357a011{/,null,UNAVAILABLE}{/timeline}
2019-10-10 10:03:05,460 INFO server.AbstractConnector (AbstractConnector.java:doStop(318)) - Stopped ServerConnector@1bae316d{HTTP/1.1,[http/1.1]}{0.0.0.0:8198}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@4bff7da0{/static,jar:file:/usr/hdp/3.1.0.0-78/hadoop-yarn/hadoop-yarn-common-3.1.1.3.1.0.0-78.jar!/webapps/static,UNAVAILABLE}
2019-10-10 10:03:05,460 INFO handler.ContextHandler (ContextHandler.java:doStop(910)) - Stopped o.e.j.s.ServletContextHandler@3e62d773{/logs,file:///usr/logs/hadoop-yarn/yarn/,UNAVAILABLE}
2019-10-10 10:03:05,462 INFO storage.HBaseTimelineReaderImpl (HBaseTimelineReaderImpl.java:serviceStop(108)) - closing the hbase Connection
2019-10-10 10:03:05,462 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x783a467b to zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181,zookeeper:2181
2019-10-10 10:03:05,465 INFO zookeeper.ReadOnlyZKClient (ReadOnlyZKClient.java:close(342)) - Close zookeeper connection 0x0ba2f4ec to zookeeper quorum
2019-10-10 10:03:05,467 INFO zookeeper.ZooKeeper (ZooKeeper.java:close(684)) - Session: 0x26daa4f60630022 closed
2019-10-10 10:03:05,467 INFO zookeeper.ClientCnxn (ClientCnxn.java:run(524)) - EventThread shut down
2019-10-10 10:03:05,468 INFO reader.TimelineReaderServer (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down TimelineReaderServer at atsserver/ip
************************************************************/
I've followed the steps on this workaround and I am getting a new error message
https://community.cloudera.com/t5/Support-Questions/ATS-hbase-does-not-seem-to-start/td-p/235155 - guillaume_roger
We use to hace kerberized this cluster but it's not anymore a a few months ago. That's why this message is weird
2019-10-25 08:58:31,650 - HdfsResource['/atsv2/hbase/data'] {'security_enabled': False, 'hadoop_bin_dir': '/usr/hdp/3.1.0.0-78/hadoop/bin', 'keytab': '', 'dfs_type': 'HDFS', 'default_fs': 'hdfs://hdpnndev', 'hdfs_resource_ignore_file': '/var/lib/ambari-agent/data/.hdfs_resource_ignore', 'hdfs_site': ..., 'kinit_path_local': '/usr/bin/kinit', 'principal_name': '', 'user': 'hdfs', 'owner': 'yarnats', 'hadoop_conf_dir': '/usr/hdp/3.1.0.0-78/hadoop/conf', 'type': 'directory', 'action': ['create_on_execute'], 'immutable_paths': [u'/mr-history/done', u'/warehouse/tablespace/managed/hive', u'/warehouse/tablespace/external/hive', u'/app-logs', u'/tmp']}
2019-10-25 08:58:31,652 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://masternode:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZs_d1F 2>/tmp/tmpNf5VJV''] {'quiet': False}
Then , it got stuck several minutes on this part
2019-10-25 08:58:31,791 - call returned (0, '')
2019-10-25 08:58:31,792 - get_user_call_output returned (0, u'{\n "beans" : [ {\n "name" : "Hadoop:service=NameNode,name=FSNamesystem",\n "modelerType" : "FSNamesystem",\n "tag.Context" : "dfs",\n "tag.HAState" : "active",\n "tag.TotalSyncTimes" : "103 16 ",\n "tag.Hostname" : "zookeeper",\n "MissingBlocks" : 0,\n "MissingReplOneBlocks" : 0,\n "ExpiredHeartbeats" : 0,\n "TransactionsSinceLastCheckpoint" : 162564,\n "TransactionsSinceLastLogRoll" : 97,\n "LastWrittenTransactionId" : 136802255,\n "LastCheckpointTime" : 1571986118000,\n "CapacityTotal" : 94452535787520,\n "CapacityTotalGB" : 87966.0,\n "CapacityUsed" : 17431687303220,\n "CapacityUsedGB" : 16235.0,\n "CapacityRemaining" : 72211133116924,\n "ProvidedCapacityTotal" : 0,\n "CapacityRemainingGB" : 67252.0,\n "CapacityUsedNonDFS" : 0,\n "TotalLoad" : 128,\n "SnapshottableDirectories" : 1,\n "Snapshots" : 1,\n "NumEncryptionZones" : 0,\n "LockQueueLength" : 0,\n "BlocksTotal" : 985264,\n "NumFilesUnderConstruction" : 32,\n "NumActiveClients" : 21,\n "FilesTotal" : 1560325,\n "PendingReplicationBlocks" : 0,\n "PendingReconstructionBlocks" : 0,\n "UnderReplicatedBlocks" : 20,\n "LowRedundancyBlocks" : 20,\n "CorruptBlocks" : 0,\n "ScheduledReplicationBlocks" : 0,\n "PendingDeletionBlocks" : 597,\n "LowRedundancyReplicatedBlocks" : 20,\n "CorruptReplicatedBlocks" : 0,\n "MissingReplicatedBlocks" : 0,\n "MissingReplicationOneBlocks" : 0,\n "HighestPriorityLowRedundancyReplicatedBlocks" : 0,\n "HighestPriorityLowRedundancyECBlocks" : 0,\n "BytesInFutureReplicatedBlocks" : 0,\n "PendingDeletionReplicatedBlocks" : 597,\n "TotalReplicatedBlocks" : 985264,\n "LowRedundancyECBlockGroups" : 0,\n "CorruptECBlockGroups" : 0,\n "MissingECBlockGroups" : 0,\n "BytesInFutureECBlockGroups" : 0,\n "PendingDeletionECBlocks" : 0,\n "TotalECBlockGroups" : 0,\n "ExcessBlocks" : 0,\n "NumTimedOutPendingReconstructions" : 0,\n "PostponedMisreplicatedBlocks" : 24,\n "PendingDataNodeMessageCount" : 0,\n "MillisSinceLastLoadedEdits" : 0,\n "BlockCapacity" : 2097152,\n "NumLiveDataNodes" : 4,\n "NumDeadDataNodes" : 0,\n "NumDecomLiveDataNodes" : 0,\n "NumDecomDeadDataNodes" : 0,\n "VolumeFailuresTotal" : 0,\n "EstimatedCapacityLostTotal" : 0,\n "NumDecommissioningDataNodes" : 0,\n "StaleDataNodes" : 0,\n "NumStaleStorages" : 48,\n "TotalSyncCount" : 76,\n "NumInMaintenanceLiveDataNodes" : 0,\n "NumInMaintenanceDeadDataNodes" : 0,\n "NumEnteringMaintenanceDataNodes" : 0\n } ]\n}', u'')
2019-10-25 08:58:31,793 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpmLa9e3 2>/tmp/tmpcGNRPR''] {'quiet': False}
Finally it finishes with this message
2019-10-25 09:17:53,022 - call returned (143, "-bash: line 1: 28392 Terminated curl -s 'http://yarnatsserver:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' > /tmp/tmpmLa9e3 2> /tmp/tmpcGNRPR")
2019-10-25 09:17:53,022 - call['hdfs haadmin -ns hdpnndev -getServiceState nn2'] {'logoutput': True, 'user': 'hdfs'}