Support Questions

Find answers, ask questions, and share your expertise

Hive Metastore getting alert due to Hive Metastroe canary

Explorer

Hi Team,

 

We have 2 Hive Metastore server services configured on cluster in node1 and node2. In node1 Hive Metastore getting alert with below message.

 

"Hive Metastore Canary"

 The Hive Metastore canary failed to create a database.

 

I am seeing below errors in servicemonitor log file.

 

2022-10-12 22:40:06,657 WARN com.cloudera.cmf.cdh6client.hive.MetastoreClientImpl: (2 skipped) Could not drop hive database: cloudera_manager_metastore_canary_test_db_hive_HIVEMETASTORE_0137966f79e5f15b3b5d4dec61b7592e
com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:770)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:757)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:940)
at com.cloudera.cmf.cdh6client.hive.MetastoreClientImpl.dropDatabase(MetastoreClientImpl.java:163)
at com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary.cleanUpFromPreviousRuns(HiveMetastoreCanary.java:484)
at com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary.doWorkWithClientConfig(HiveMetastoreCanary.java:175)
at com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary.doWorkWithClientConfig(HiveMetastoreCanary.java:52)
at com.cloudera.cmon.firehose.polling.AbstractCdhWorkUsingClientConfigs.doWork(AbstractCdhWorkUsingClientConfigs.java:45)
at com.cloudera.cmon.firehose.polling.CdhTask$InstrumentedWork.doWork(CdhTask.java:230)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper.runTask(ImpersonatingTaskWrapper.java:72)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper.access$000(ImpersonatingTaskWrapper.java:21)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper$1.run(ImpersonatingTaskWrapper.java:107)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at com.cloudera.cmf.cdh6client.security.UserGroupInformationImpl.doAs(UserGroupInformationImpl.java:42)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper.doWork(ImpersonatingTaskWrapper.java:104)
at com.cloudera.cmf.cdhclient.CdhExecutor$1.call(CdhExecutor.java:125)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
... 27 more

 

Please help me on this issue,

 

2 ACCEPTED SOLUTIONS

Expert Contributor

The Canary is just testing whether the basic operatins are working in Hive Metastore. If that shows "unhealthy" it does not necessarily mean that the jobs are failing due to the Hive Metastore not functioning (it may be just slow for example), it is however indeed a warning sign for you that something is not proper.

Please connect with beeline to the HiveServer2 and verify what is working and what is failing, then check the HiveServer2 logs and HiveMetastore logs.
You can file a support case (where you can share much more details) if this is an urgent issue.

View solution in original post

Cloudera Employee

Hello @hanumanth 

In addition to @mszurap response for HMS heap Size tuning please consider below document and make sure the underlying host is having enough memory else you will end up with memory overcommit issue. 

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_hive_tuning.html

View solution in original post

8 REPLIES 8

Explorer

any one please help me on this

Expert Contributor

Hi @hanumanth , I assume this is a CDH 6 cluster. Do you have Sentry enabled as well? Is this always happening, or just at some times?

Have you tested in beeline how long does it take to drop an example database? Does it also fail with a timeout?

I guess it is taking more than 60 seconds (that's the service monitor's default timeout), and since the default timeout for HS2 to HMS is 5 minutes it actually succeeds.

Thanks, Miklos  

Explorer

Hive Metastore server health in node 1 shows un-healthy. because of this jobs are failing. 

 

How to trouble shoot this issue??

Expert Contributor

The Canary is just testing whether the basic operatins are working in Hive Metastore. If that shows "unhealthy" it does not necessarily mean that the jobs are failing due to the Hive Metastore not functioning (it may be just slow for example), it is however indeed a warning sign for you that something is not proper.

Please connect with beeline to the HiveServer2 and verify what is working and what is failing, then check the HiveServer2 logs and HiveMetastore logs.
You can file a support case (where you can share much more details) if this is an urgent issue.

Explorer

Hey @mszurap 

 

while accessing hive from beeline due to below messages it hanged.

 

2022-10-12 09:41:26,507 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5107ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=3349ms
2022-10-12 09:41:43,321 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5098ms
GC pool 'PS MarkSweep' had collection(s): count=3 time=9940ms
2022-10-12 09:42:20,627 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 6398ms
GC pool 'PS MarkSweep' had collection(s): count=6 time=23371ms
2022-10-12 09:42:31,927 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5057ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=3303ms
2022-10-12 09:45:46,227 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 6036ms
GC pool 'PS MarkSweep' had collection(s): count=2 time=7653ms
2022-10-12 09:48:53,560 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5099ms
GC pool 'PS MarkSweep' had collection(s): count=40 time=140485ms
2022-10-12 09:54:03,673 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5401ms
GC pool 'PS MarkSweep' had collection(s): count=31 time=111314ms

Expert Contributor

These are pretty long GC pauses, I assume they are from the HMS logs. With long GC pauses of course every operation will suffer and will be slow, eventually the SMON's request will time out . Kindly review the HMS heap size and consider to increase it until you get a stable performance (without such GC pauses).

Cloudera Employee

Hello @hanumanth 

In addition to @mszurap response for HMS heap Size tuning please consider below document and make sure the underlying host is having enough memory else you will end up with memory overcommit issue. 

https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_hive_tuning.html

Community Manager

@hanumanth, have any if the replies helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.  



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: