Support Questions

schausson · ‎10-19-2017

Hi,

I'm trying to execute python code with SHC (spark hbase connector) to connect to hbase from a python spark-based script.

Here is a simple example I can provide to illustrate :

# readExample.py
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
catalog = ''.join("""{
    "table":{"namespace":"default", "name":"firsttable"},
    "rowkey":"key",
    "columns":{
        "firstcol":{"cf":"rowkey", "col":"key", "type":"string"},
        "secondcol":{"cf":"d", "col":"colname", "type":"string"}
    }
}""".split())
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()
df.select("secondcol").show()

In order to execute this properly, I successfully executed following command line :

spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml readExample.py

Great 🙂

Now, I would like to run this exact same example from my jupyter notebook...

After a while, I finally figured out how to proceed to pass the required "package" to spark by adding following cell at the begining of my notebook :

import os 
import findspark 
os.environ["SPARK_HOME"] = '/usr/hdp/current/spark-client' 
findspark.init('/usr/hdp/current/spark-client') 
os.environ['PYSPARK_SUBMIT_ARGS'] = ("--repositories http://repo.hortonworks.com/content/groups/public/ " "--packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 " " pyspark-shell")

...But when I ran all the cells from my notebook, I got following exception :

Py4JJavaError: An error occurred while calling o50.showString.
: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations
	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:312)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
	at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
	at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
	at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
	at org.apache.hadoop.hbase.client.MetaScanner.listTableRegionLocations(MetaScanner.java:343)
	at org.apache.hadoop.hbase.client.HRegionLocator.listRegionLocations(HRegionLocator.java:142)
	at org.apache.hadoop.hbase.client.HRegionLocator.getStartEndKeys(HRegionLocator.java:118)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:109)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:108)
	at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:77)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource.releaseOnException(HBaseResources.scala:88)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource.<init>(HBaseResources.scala:108)
	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:61)

From what I understood, this exception probably came up because the hbase client component could not use right hbase-site.xml (that defines zookeeper quorum...)

I tried to add "--files /etc/hbase/conf/hbase-site.xml" in the content of the PYSPARK_SUBMIT_ARGS environment variable, but this did not change anything...

Any idea how to pass the hbase-site.xml properly ?

Thanks for your help

jlaura · ‎11-01-2017

I just got this working after seeing similar issues due to an inability to access the Zookeeper Quorum properly. I tried adding the hbase-site.xml file using the PYSPARK_SUBMIT_ARGS and also via a SparkConf object - no joy. What did work was to manually copy the hbase-site.xml file into $SPARK_HOME/conf.

I just got this working inside of some docker containers and had to manually move the hbase-site.xml into $SPARK_HOME/conf. Adding it inside of the PYSPARK_SUBMIT_ARGS or via a SparkConf object was not working.

schausson · ‎11-06-2017

Thanks for your help !

Just an additional question : you had to manually copy hbase-site.xml into $SPARK_HOME/conf folder on ALL nodes of the cluster ?

schausson · ‎11-06-2017

And also, should I keep using "--files" option with hbase-site.xml on the command line or not ?

jlaura · ‎11-06-2017

My response went inline above - not sure why. Included is an example of the submit that I am using.

jlaura · ‎11-06-2017

I did copy the file to all of the nodes, but only because all of the nodes (containers) are based on an identical image. I have not tested pushing only to a single node.

My launching now looks like:

conf = SparkConf()
conf.set('spark.driver.extraClassPath', '/usr/local/hbase-1.2.6/lib/*')
conf.set('spark.executor.extraClassPath', '/usr/local/hbase-1.2.6/lib/*')
sc = SparkContext(master='spark://master:7077', conf=conf)
sqlcontext = SQLContext(sc)

I am figuring this out as I work through my use case, so hopefully this works on your side as well.

Cloudera Community

Support Questions

Read HBase with pyspark from jupyter notebook