Created 10-12-2022 11:07 PM
Hi Team,
We have 2 Hive Metastore server services configured on cluster in node1 and node2. In node1 Hive Metastore getting alert with below message.
"Hive Metastore Canary"
The Hive Metastore canary failed to create a database.
I am seeing below errors in servicemonitor log file.
2022-10-12 22:40:06,657 WARN com.cloudera.cmf.cdh6client.hive.MetastoreClientImpl: (2 skipped) Could not drop hive database: cloudera_manager_metastore_canary_test_db_hive_HIVEMETASTORE_0137966f79e5f15b3b5d4dec61b7592e
com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:770)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:757)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.dropDatabase(HiveMetaStoreClient.java:940)
at com.cloudera.cmf.cdh6client.hive.MetastoreClientImpl.dropDatabase(MetastoreClientImpl.java:163)
at com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary.cleanUpFromPreviousRuns(HiveMetastoreCanary.java:484)
at com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary.doWorkWithClientConfig(HiveMetastoreCanary.java:175)
at com.cloudera.cmon.firehose.polling.hive.HiveMetastoreCanary.doWorkWithClientConfig(HiveMetastoreCanary.java:52)
at com.cloudera.cmon.firehose.polling.AbstractCdhWorkUsingClientConfigs.doWork(AbstractCdhWorkUsingClientConfigs.java:45)
at com.cloudera.cmon.firehose.polling.CdhTask$InstrumentedWork.doWork(CdhTask.java:230)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper.runTask(ImpersonatingTaskWrapper.java:72)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper.access$000(ImpersonatingTaskWrapper.java:21)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper$1.run(ImpersonatingTaskWrapper.java:107)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at com.cloudera.cmf.cdh6client.security.UserGroupInformationImpl.doAs(UserGroupInformationImpl.java:42)
at com.cloudera.cmf.cdhclient.util.ImpersonatingTaskWrapper.doWork(ImpersonatingTaskWrapper.java:104)
at com.cloudera.cmf.cdhclient.CdhExecutor$1.call(CdhExecutor.java:125)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at com.cloudera.cdh6client.hive.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
... 27 more
Please help me on this issue,
Created 10-13-2022 03:29 AM
The Canary is just testing whether the basic operatins are working in Hive Metastore. If that shows "unhealthy" it does not necessarily mean that the jobs are failing due to the Hive Metastore not functioning (it may be just slow for example), it is however indeed a warning sign for you that something is not proper.
Please connect with beeline to the HiveServer2 and verify what is working and what is failing, then check the HiveServer2 logs and HiveMetastore logs.
You can file a support case (where you can share much more details) if this is an urgent issue.
Created 10-15-2022 01:58 AM
Hello @hanumanth
In addition to @mszurap response for HMS heap Size tuning please consider below document and make sure the underlying host is having enough memory else you will end up with memory overcommit issue.
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_hive_tuning.html
Created 10-12-2022 11:10 PM
any one please help me on this
Created 10-13-2022 01:20 AM
Hi @hanumanth , I assume this is a CDH 6 cluster. Do you have Sentry enabled as well? Is this always happening, or just at some times?
Have you tested in beeline how long does it take to drop an example database? Does it also fail with a timeout?
I guess it is taking more than 60 seconds (that's the service monitor's default timeout), and since the default timeout for HS2 to HMS is 5 minutes it actually succeeds.
Thanks, Miklos
Created 10-13-2022 02:49 AM
Hive Metastore server health in node 1 shows un-healthy. because of this jobs are failing.
How to trouble shoot this issue??
Created 10-13-2022 03:29 AM
The Canary is just testing whether the basic operatins are working in Hive Metastore. If that shows "unhealthy" it does not necessarily mean that the jobs are failing due to the Hive Metastore not functioning (it may be just slow for example), it is however indeed a warning sign for you that something is not proper.
Please connect with beeline to the HiveServer2 and verify what is working and what is failing, then check the HiveServer2 logs and HiveMetastore logs.
You can file a support case (where you can share much more details) if this is an urgent issue.
Created 10-13-2022 05:50 AM
Hey @mszurap
while accessing hive from beeline due to below messages it hanged.
2022-10-12 09:41:26,507 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5107ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=3349ms
2022-10-12 09:41:43,321 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5098ms
GC pool 'PS MarkSweep' had collection(s): count=3 time=9940ms
2022-10-12 09:42:20,627 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 6398ms
GC pool 'PS MarkSweep' had collection(s): count=6 time=23371ms
2022-10-12 09:42:31,927 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5057ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=3303ms
2022-10-12 09:45:46,227 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 6036ms
GC pool 'PS MarkSweep' had collection(s): count=2 time=7653ms
2022-10-12 09:48:53,560 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5099ms
GC pool 'PS MarkSweep' had collection(s): count=40 time=140485ms
2022-10-12 09:54:03,673 INFO org.apache.hadoop.hive.common.JvmPauseMonitor: [org.apache.hadoop.hive.common.JvmPauseMonitor$Monitor@78267545]: Detected pause in JVM or host machine (eg GC): pause of approximately 5401ms
GC pool 'PS MarkSweep' had collection(s): count=31 time=111314ms
Created 10-13-2022 06:13 AM
These are pretty long GC pauses, I assume they are from the HMS logs. With long GC pauses of course every operation will suffer and will be slow, eventually the SMON's request will time out . Kindly review the HMS heap size and consider to increase it until you get a stable performance (without such GC pauses).
Created 10-15-2022 01:58 AM
Hello @hanumanth
In addition to @mszurap response for HMS heap Size tuning please consider below document and make sure the underlying host is having enough memory else you will end up with memory overcommit issue.
https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_hive_tuning.html
Created 10-18-2022 07:37 AM
@hanumanth, have any if the replies helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Regards,
Vidya Sargur,