Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DataFrame collect method does not load data from Hive (executed in Zeppelin Notebook)

DataFrame collect method does not load data from Hive (executed in Zeppelin Notebook)

New Contributor

Hello,

While executing an SQL query using livy and pyspark in Zeppelin:

%livy2.pyspark
sqlContext.sql("use vasig_switch")
turnoutResults = sqlContext.sql("select * from roadmaster where roadmaster.roadmaster.name = 'Sv152_rechts_ZV3' and roadmaster.roadmaster.ts = 1492140329000") #this call works
turnoutResults.collect() 

the following exception is thrown:

An error occurred while calling o58.collectToPython.
: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations
	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:312)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
...
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
...

I've checked the application log and could not find any error. I've also checked the logs of Zeppelin, livy, Hive, HBase, but without luck. Do you know what could be the cause for this exception or where I could find more information about this exception?

We are using Spark in yarn-cluster mode.

The following configuration settings are available in spark-env.sh

SPARK_DRIVER_MEMORY="4G"
SPARK_EXECUTOR_MEMORY="2G"

export HADOOP_YARN_HOME=/usr/hdp/current/hadoop-yarn-client
export YARN_CONF_DIR="${YARN_CONF_DIR:-$HADOOP_YARN_HOME/conf}"
export SPARK_CONF_DIR=${SPARK_CONF_DIR:-/usr/hdp/current/spark2-thriftserver/conf}
export SPARK_LOG_DIR=/var/log/spark2
export SPARK_PID_DIR=/var/run/spark2
export SPARK_DAEMON_MEMORY=1024m

SPARK_IDENT_STRING=$USER

SPARK_NICENESS=0

export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/current/hadoop-client}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/current/hadoop-client/conf}

export HBASE_CONF_DIR=${HBASE_CONF_DIR:-/usr/hdp/current/hbase-client/conf}
export HBASE_HOME=${HBASE_HOME:-/usr/hdp/current/hbase-client}

export HBASE_CLASSPATH=${HBASE_CLASSPATH}

export JAVA_HOME=/usr/jdk64/jdk1.8.0_112

Might be that this configuration is not appropriate for yarn-cluster mode? Or maybe something is missing?

I am also not sure if the problem is Spark, Hive or HBase...If I only execute the query, without calling the "collect" method, then the DataFrame is created. Calling "turnoutResults.printSchema()", for example, also works. The problem is with collect(), count(), first()...

I hope someone could help with this problem, because there are some days since I am trying to solve it.

Thank you,

Roxana

PS: the same exception is thrown if I try executing the query from Jupyter with Sparkmagic.