About filippodelt

filippodelt · ‎04-05-2017

Hi, I just want to say that we succesfully added two extra JNs to our HA cluster* by following the 3 steps procedure suggested here. For copying content over to new JNs we used rsync, very handy! Regards, Filippo * running CDH 5.9.1 on centos 6.7 machines

filippodelt · ‎11-14-2016

Hi, I think we might have the same issue. Any update with this? We are running on cloudera 5.8.2, and are not able to get running jobs in Job Browser. Checked the <blabla>:8888/logs page, nothing interesting. Checked /var/log/hue and found some [14/Nov/2016 17:16:43 +0100] decorators ERROR Error running <function check_status at 0x7f60c99bf2a8> Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hue/desktop/libs/notebook/src/notebook/decorators.py", line 81, in decorator return func(*args, **kwargs) File "/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hue/desktop/libs/notebook/src/notebook/api.py", line 167, in check_status response['query_status'] = get_api(request, snippet).check_status(notebook, snippet) File "/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hue/desktop/libs/notebook/src/notebook/connectors/hiveserver2.py", line 67, in decorator return func(*args, **kwargs) File "/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hue/desktop/libs/notebook/src/notebook/connectors/hiveserver2.py", line 248, in check_status raise QueryError(operation.errorMessage) QueryError: No error message, please check the logs. that *might* be related. Regards, Filippo

filippodelt · ‎10-13-2016

Hi, I take an opportunity for a quick update: i) we upgraded both CM and CDH to 5.8.2. I am curious to see if the problem occurs again ii) just took a stack trace of HiveServer2: what I thought could have been related to the issue seems in fact part of the normal behavior: Thread 1213 (HiveServer2-Handler-Pool: Thread-64-EventThread): State: WAITING Blocked count: 0 Wtaited count: 2 Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@12bdc49c Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:494) Thread 1212 (HiveServer2-Handler-Pool: Thread-64-SendThread(zookeeper2.int....:2181)): State: RUNNABLE Blocked count: 423 Wtaited count: 0 Stack: sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:338) org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) Cheers, Filippo

filippodelt · ‎10-10-2016

Hi, thanks for your reply! I can see how to do that using zookeeper client command. However, I am a bit concerned of few aspects of the proposed strategy: - before deleting anything, we have to identify the offending lock. That requires some manual fiddling, which takes some time and is error prone. Automating the whole process is an option, but it feels like investing energy in the wrong direction; - once we delete the node on zookeeper side, we are deliberately violating the expected semantics of the zookeeper library. How does it handle a lock disappearing all of a sudden? Is this even supposed to happen? Let's see if anyone else land in this page and is willing to share thoughts with us 🙂 Btw, with "I would open a ticket for Cloudera support" were you referring to the opportunity Cloudera Enterprise users have to talk directly to Cloudera people? Because unfortunately we are not at that stage...yet! Thanks once more and regards, Filippo

filippodelt · ‎10-08-2016

Hi, we run Cloudera Express 5.7.0 on a cluster of 29 machines (2 name nodes, 22 data nodes, 5 dedicated zookeeper servers). All machines run CentOs 6.7. We have an external table in Hive, which we use in 3 different ways: ingest new data at regular pace through the day; run queries / workflows for data mining; perform cleanup -- a procedure that triggers once per day and deletes data older than x days. Every now and then, we notice the cleanup failing because of not being able to acquire the lock on the table. Specifically, the method is called from bash as hive -e "ALTER TABLE table DROP PARTITION (...)" and what we would get back is Unable to acquire IMPLICIT, EXCLUSIVE lock default@table@...(partitions info) after 100 attempts. FAILED: Error in acquiring locks: Locks on the underlying objects cannot be acquired. retry after some time In order to work around the issue, we restart the whole Hive service, and the problem goes away. Annoying! I tried to debug the issue a bit, but have hard time figuring out how to continue. Since today it happened again, I thought about sharing my findings with you all and try to seek some help 🙂 I apologize in advance for the very long text. - If I run show locks default.table partition (...) in hive, I see there is an active lock: 1 default@table@(partitions info) SHARED 2 LOCK_QUERYID:hive_20161007081717_9543c9b3-dde1-4b8d-942c-979a091b3eae 3 LOCK_TIME:1475828283643 --- Fri, 07 Oct 2016 08:18:03.643 GMT 4 LOCK_MODE:IMPLICIT 5 LOCK_QUERYSTRING: ... By the look of things, the LOCK_QUERYSTRING is a query submitted via Hue by one of cluster users. - in HiveHistoryServer, I can find a job, submitted at Fri Oct 07 08:19:52 UTC 2016 that runs the specific query LOCK_QUERYSTRING. Notice that it is a bit weird that submitted time is a bit later than lock_time, even though they are very close -- is it really this job? unsure. Assuming the job is correct, it is flagged as State: SUCCEEDED - in Zookeeper, via zookeeper-client, I see the lock [zk: localhost:2181(CONNECTED) 5] get /hive_zookeeper_namespace_hive/default/table/(partitions info)/LOCK-SHARED-0000000000 hive_20161007081717_9543c9b3-dde1-4b8d-942c-979a091b3eae:1475828283643: IMPLICIT: (LOCK_QUERYSTRING) - I can find the machine in the Zookeeper ensemble, let us call it ZK1, which hosts the connection to the lock by running echo cons | ncat 127.0.0.1 2181 /10.0.0.6:38266[1](queued=0,recved=412004,sent=412078,sid=0x456d6bab545a8ef,lop=PING,est=1474622333769,to=40000,lcxid=0x4cfac,lzxid=0xffffffffffffffff,lresp=1475916179148,llat=0,minlat=0,avglat=0,maxlat=19) - machine+port number, ie 10.0.0.6:38266, points to HiveServer2. No queries are running. - if I take a stack trace of HiveServer2 I see, among other things, Thread 59 (HiveServer2-Handler-Pool: Thread-47-EventThread): State: WAITING Blocked count: 3 Wtaited count: 76 Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@8e6f2b0 Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:494) Thread 58 (HiveServer2-Handler-Pool: Thread-47-SendThread(ZK1)): State: RUNNABLE Blocked count: 1925 Wtaited count: 0 Stack: sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:338) org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) Ie, it looks like resources at HiveServer2 are doing something related to the Zookeeper machine involved in the issue. - After restarting Hive, the stack trace has no such thing any more, and the cleanup succeeds. I Googled a bit for epollWait and found bugs like http://bugs.java.com/view_bug.do?bug_id=6403933 , ie a bug in Java libraries could be the reason for the expected behavior. On the other hand, it seems like the bug impacts on CPU utilization, and I have no evidences of it (did not pay attention to it). Given all this, do you have any hint for debugging the issue further? Thanks and regards, Filippo

Online	Offline
Last Visited	‎06-07-2017 07:53 AM

Member Since	‎10-03-2015 08:06 AM
Last Visited	‎06-07-2017 07:53 AM
Posts	18

Cloudera Community

Re: Journal Node Sync - moving to new machine

Re: Hue Job Browser is not working

Re: Hive lock left behind

Re: Hive lock left behind

Hive lock left behind