Created on 10-08-2016 08:10 AM - edited 09-16-2022 03:43 AM
we run Cloudera Express 5.7.0 on a cluster of 29 machines (2 name nodes, 22 data nodes, 5 dedicated zookeeper servers). All machines run CentOs 6.7.
We have an external table in Hive, which we use in 3 different ways:
Every now and then, we notice the cleanup failing because of not being able to acquire the lock on the table. Specifically, the method is called from bash as
hive -e "ALTER TABLE table DROP PARTITION (...)"
and what we would get back is
Unable to acquire IMPLICIT, EXCLUSIVE lock default@table@...(partitions info) after 100 attempts. FAILED: Error in acquiring locks: Locks on the underlying objects cannot be acquired. retry after some time
In order to work around the issue, we restart the whole Hive service, and the problem goes away. Annoying!
I tried to debug the issue a bit, but have hard time figuring out how to continue. Since today it happened again, I thought about sharing my findings with you all and try to seek some help 🙂 I apologize in advance for the very long text.
- If I run show locks default.table partition (...) in hive, I see there is an active lock:
1 default@table@(partitions info) SHARED
3 LOCK_TIME:1475828283643 --- Fri, 07 Oct 2016 08:18:03.643 GMT
5 LOCK_QUERYSTRING: ...
By the look of things, the LOCK_QUERYSTRING is a query submitted via Hue by one of cluster users.
- in HiveHistoryServer, I can find a job, submitted at Fri Oct 07 08:19:52 UTC 2016 that runs the specific query LOCK_QUERYSTRING. Notice that it is a bit weird that submitted time is a bit later than lock_time, even though they are very close -- is it really this job? unsure. Assuming the job is correct, it is flagged as State: SUCCEEDED
- in Zookeeper, via zookeeper-client, I see the lock
[zk: localhost:2181(CONNECTED) 5] get /hive_zookeeper_namespace_hive/default/table/(partitions info)/LOCK-SHARED-0000000000
- I can find the machine in the Zookeeper ensemble, let us call it ZK1, which hosts the connection to the lock by running echo cons | ncat 127.0.0.1 2181
- machine+port number, ie 10.0.0.6:38266, points to HiveServer2. No queries are running.
- if I take a stack trace of HiveServer2 I see, among other things,
Thread 59 (HiveServer2-Handler-Pool: Thread-47-EventThread):
Blocked count: 3
Wtaited count: 76
Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@8e6f2b0
Thread 58 (HiveServer2-Handler-Pool: Thread-47-SendThread(ZK1)):
Blocked count: 1925
Wtaited count: 0
Ie, it looks like resources at HiveServer2 are doing something related to the Zookeeper machine involved in the issue.
- After restarting Hive, the stack trace has no such thing any more, and the cleanup succeeds.
I Googled a bit for epollWait and found bugs like http://bugs.java.com/view_bug.do?bug_id=6403933 , ie a bug in Java libraries could be the reason for the expected behavior. On the other hand, it seems like the bug impacts on CPU utilization, and I have no evidences of it (did not pay attention to it).
Given all this, do you have any hint for debugging the issue further?
Thanks and regards,