Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why Metastore Crash triggered Services Shutdown?

Why Metastore Crash triggered Services Shutdown?

Explorer

We have a CDH 5.8.0 setup running on EC2 that is provisioned by Director scripts. We use an external postgresql as the metastore.

 

Yesterday, the ec2 instance running postgresql went down and caused the following DB connection error. And after the DB instance recovered, kafka Broker and HBase Region Server turned RED. In the log, we found that services (Kafka broker, DataNode) shutdown were triggered right after the incident.

 

Today we rebooted our DB instance and reproduced the issue. HBase Region Server went down after the DB connection failure.

 

Could you please help us understand this behavior?

 

 

DB connection error:

 

ip-10-0-1-121.us-west-2.compute.internalWARNApril 28, 2017 12:16 AMCommandPusherThread
Failed to update command
javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Could not open connection
	at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387)
	at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310)
	at org.hibernate.ejb.AbstractEntityManagerImpl.throwPersistenceException(AbstractEntityManagerImpl.java:1397)
	at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:62)
	at com.cloudera.enterprise.AbstractWrappedEntityManager.beginForRollbackAndReadonly(AbstractWrappedEntityManager.java:85)
	at com.cloudera.cmf.persist.CmfEntityManager.beginForRollbackAndReadonly(CmfEntityManager.java:338)
	at com.cloudera.server.cmf.CommandPusherThread.innerLoop(CommandPusherThread.java:178)
	at com.cloudera.server.cmf.CommandPusherThread.run(CommandPusherThread.java:156)
Caused by: org.hibernate.exception.GenericJDBCException: Could not open connection
	at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:54)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:125)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:110)
	at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.obtainConnection(LogicalConnectionImpl.java:221)
	at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.getConnection(LogicalConnectionImpl.java:157)
	at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doBegin(JdbcTransaction.java:67)
	at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.begin(AbstractTransactionImpl.java:160)
	at org.hibernate.internal.SessionImpl.beginTransaction(SessionImpl.java:1426)
	at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:59)
	... 4 more
Caused by: org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:374)
	at org.postgresql.jdbc2.AbstractJdbc2Connection.execSQLUpdate(AbstractJdbc2Connection.java:263)
	at org.postgresql.jdbc2.AbstractJdbc2Connection.setTransactionIsolation(AbstractJdbc2Connection.java:829)
	at com.mchange.v2.c3p0.impl.NewProxyConnection.setTransactionIsolation(NewProxyConnection.java:701)
	at org.hibernate.service.jdbc.connections.internal.C3P0ConnectionProvider.getConnection(C3P0ConnectionProvider.java:86)
	at org.hibernate.internal.AbstractSessionImpl$NonContextualJdbcConnectionAccess.obtainConnection(AbstractSessionImpl.java:292)
	at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.obtainConnection(LogicalConnectionImpl.java:214)
	... 9 more
View Log File

 

......

    
ip-10-0-1-121.us-west-2.compute.internalWARNApril 28, 2017 12:16 AMAgentProtocolImpl
Internal failure of Heartbeat processing for host id: d9417f7c-2025-4d11-8b14-2ddee310f45f
javax.persistence.PersistenceException: org.hibernate.exception.GenericJDBCException: Could not open connection
	at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387)
	at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310)
	at org.hibernate.ejb.AbstractEntityManagerImpl.throwPersistenceException(AbstractEntityManagerImpl.java:1397)
	at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:62)
	at com.cloudera.enterprise.AbstractWrappedEntityManager.begin(AbstractWrappedEntityManager.java:65)
	at com.cloudera.cmf.persist.CmfEntityManager.begin(CmfEntityManager.java:293)
	at com.cloudera.server.cmf.AgentProtocolImpl.heartbeat(AgentProtocolImpl.java:198)
	at sun.reflect.GeneratedMethodAccessor1073.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.avro.ipc.specific.SpecificResponder.respond(SpecificResponder.java:88)
	at org.apache.avro.ipc.Responder.respond(Responder.java:149)
	at org.apache.avro.ipc.Responder.respond(Responder.java:99)
	at com.cloudera.enterprise.SafeAvroResponderServlet.doPost(SafeAvroResponderServlet.java:53)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:401)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.hibernate.exception.GenericJDBCException: Could not open connection
	at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:54)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:125)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:110)
	at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.obtainConnection(LogicalConnectionImpl.java:221)
	at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.getConnection(LogicalConnectionImpl.java:157)
	at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doBegin(JdbcTransaction.java:67)
	at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.begin(AbstractTransactionImpl.java:160)
	at org.hibernate.internal.SessionImpl.beginTransaction(SessionImpl.java:1426)
	at org.hibernate.ejb.TransactionImpl.begin(TransactionImpl.java:59)
	... 25 more
Caused by: org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
	at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2102)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1835)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:500)
	at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:374)
	at org.postgresql.jdbc2.AbstractJdbc2Connection.execSQLUpdate(AbstractJdbc2Connection.java:263)
	at org.postgresql.jdbc2.AbstractJdbc2Connection.setTransactionIsolation(AbstractJdbc2Connection.java:829)
	at com.mchange.v2.c3p0.impl.NewProxyConnection.setTransactionIsolation(NewProxyConnection.java:701)
	at org.hibernate.service.jdbc.connections.internal.C3P0ConnectionProvider.getConnection(C3P0ConnectionProvider.java:86)
	at org.hibernate.internal.AbstractSessionImpl$NonContextualJdbcConnectionAccess.obtainConnection(AbstractSessionImpl.java:292)
	at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.obtainConnection(LogicalConnectionImpl.java:214)
	... 30 more
View Log File
......    
ip-10-0-1-121.us-west-2.compute.internalERRORApril 28, 2017 12:16 AMSqlExceptionHelper
FATAL: terminating connection due to administrator command

 

 

 

Kafka Shutdown:

 

ip-10-0-1-119.us-west-2.compute.internalINFOApril 28, 2017 12:16 AMKafkaServer
[Kafka Server 33], Starting controlled shutdown
View Log File
ip-10-0-1-119.us-west-2.compute.internalINFOApril 28, 2017 12:16 AMKafkaServer
[Kafka Server 33], shutting down
View Log File
.......    
ip-10-0-1-31.us-west-2.compute.internalINFOApril 28, 2017 12:16 AMReplicaStateMachine
[Replica state machine on controller 35]: Invoking state change to OfflineReplica for replicas [Topic=idEvent,Partition=0,Replica=33]
View Log File
ip-10-0-1-31.us-west-2.compute.internalWARNApril 28, 2017 12:16 AMRequestSendThread
[Controller-35-to-broker-33-send-thread], Controller 35 epoch 113 fails to send request {controller_id=35,controller_epoch=113,delete_partitions=0,partitions=[{topic=idEvent,partition=0}]} to broker Node(33, ip-10-0-1-119.us-west-2.compute.internal, 9092). Reconnecting to broker.
java.io.IOException: Connection to 33 was disconnected before the response was read
	at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:87)
	at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1$$anonfun$apply$1.apply(NetworkClientBlockingOps.scala:84)
	at scala.Option.foreach(Option.scala:257)
	at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:84)
	at kafka.utils.NetworkClientBlockingOps$$anonfun$blockingSendAndReceive$extension$1.apply(NetworkClientBlockingOps.scala:80)
	at kafka.utils.NetworkClientBlockingOps$.recurse$1(NetworkClientBlockingOps.scala:129)
	at kafka.utils.NetworkClientBlockingOps$.kafka$utils$NetworkClientBlockingOps$$pollUntilFound$extension(NetworkClientBlockingOps.scala:139)
	at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:80)
	at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:180)
	at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:171)
	at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
View Log File
ip-10-0-1-31.us-west-2.compute.internalINFOApril 28, 2017 12:16 AMKafkaController
[Controller 35]: New leader and ISR for partition [idEvent,0] is {"leader":35,"leader_epoch":94,"isr":[34,35]}
View Log File
...    

 

 

DataNode Shutdown:

p-10-0-1-119.us-west-2.compute.internalERRORApril 28, 2017 12:16 AMDataNode
RECEIVED SIGNAL 15: SIGTERM
View Log File
ip-10-0-1-119.us-west-2.compute.internalINFOApril 28, 2017 12:16 AMDataNode
SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at ip-10-0-1-119.us-west-2.compute.internal/10.0.1.119
************************************************************/