While executing an SQL query using livy and pyspark in Zeppelin:
%livy2.pyspark sqlContext.sql("use vasig_switch") turnoutResults = sqlContext.sql("select * from roadmaster where roadmaster.roadmaster.name = 'Sv152_rechts_ZV3' and roadmaster.roadmaster.ts = 1492140329000") #this call works turnoutResults.collect()
the following exception is thrown:
An error occurred while calling o58.collectToPython. : org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:312) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60) ... at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) ...
I've checked the application log and could not find any error. I've also checked the logs of Zeppelin, livy, Hive, HBase, but without luck. Do you know what could be the cause for this exception or where I could find more information about this exception?
We are using Spark in yarn-cluster mode.
The following configuration settings are available in spark-env.sh
Might be that this configuration is not appropriate for yarn-cluster mode? Or maybe something is missing?
I am also not sure if the problem is Spark, Hive or HBase...If I only execute the query, without calling the "collect" method, then the DataFrame is created. Calling "turnoutResults.printSchema()", for example, also works. The problem is with collect(), count(), first()...
I hope someone could help with this problem, because there are some days since I am trying to solve it.
PS: the same exception is thrown if I try executing the query from Jupyter with Sparkmagic.