Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Slow or stalled queries under highly concurrent write workloads when Sentry + Hive are used, caused

Slow or stalled queries under highly concurrent write workloads when Sentry + Hive are used, caused

Explorer

We have been facing Hive Metastore canary duration and Open connections going high for some time. On debugging, we had come across "lock wait timeout exceeded" errors in logs and were able to correlate this timeout with those peaks to certain extent. 

 

As suggested in https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_hive_ki.html#tsb_2018_..., We had planned to upgrade CDH to 5.14.4 and did the same in dev environment. While dev is running with 5.14.4 for past 10 days, did come across an similar incident in prod wherein those two metrics going high incident but no lock wait timeout errors in logs. We took thread dump couple of times and restart HMS as temp fix on prod. On analysing those prod thread dumps, we observed the following:

 

1. 

 

10 threads were on BLOCKED state:

pool-5-thread-152
 - priority:5 - threadId:0x00007f35915b1800 - nativeId:0x1fb45 - state:BLOCKED
stackTrace:
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hive.hcatalog.listener.DbNotificationListener.enqueue(DbNotificationListener.java:332)
- waiting to lock <0x00000007039ef4c0> (a java.lang.Object)
at org.apache.hive.hcatalog.listener.DbNotificationListener.onAddPartition(DbNotificationListener.java:169)
at org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier$10.notify(MetaStoreListenerNotifier.java:87)
at org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:167)
at org.apache.hadoop.hive.metastore.MetaStoreListenerNotifier.notifyEvent(MetaStoreListenerNotifier.java:197)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partition_core(HiveMetaStore.java:2711)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.add_partition_with_environment_context(HiveMetaStore.java:2755)
at sun.reflect.GeneratedMethodAccessor92.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:140)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy10.add_partition_with_environment_context(Unknown Source)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$add_partition_with_environment_context.getResult(ThriftHiveMetastore.java:10316)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$add_partition_with_environment_context.getResult(ThriftHiveMetastore.java:10300)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:110)
at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:106)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) 
 
2. 700+ threads are in WAITING state
 
pool-5-thread-794 - priority:5 - threadId:0x00007f359238f800 - nativeId:0x128a9 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007044a0f10> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:275)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:682)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:660)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_all_functions(HiveMetaStore.java:6189)
at sun.reflect.GeneratedMethodAccessor82.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:140)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
at com.sun.proxy.$Proxy10.get_all_functions(Unknown Source)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_all_functions.getResult(ThriftHiveMetastore.java:11977)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$get_all_functions.getResult(ThriftHiveMetastore.java:11961)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:110)
at org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:106)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:118)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
 
Since threads were blocked on DbNotificationListener, we were assuming that both the issues are related and above said bug fix would solve the problem. But, in dev, which is running with 5.14.4 had come across similar issue but no lock wait errors in logs last week. Though we didn't we get a chance to take dump on dev last week, I am bit suspicious or losing confidence on this upgrade toward the HMS fix. Can you please share your thoughts?
 
Thanks,
Mani
 
 
1 REPLY 1

Re: Slow or stalled queries under highly concurrent write workloads when Sentry + Hive are used, cau

Explorer

We upgraded CDH to 5.14.4 in Prod couple of weeks back. Last week, when we performed data movement activities on a table which has huge no. of partitions, we had come across this issue again and log has  
lock wait timeout on "INSERT INTO PARTITIONS..." etc. Can you take this issue with your technical team?

Don't have an account?
Coming from Hortonworks? Activate your account here