Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Read HBase with pyspark from jupyter notebook

avatar
Expert Contributor

Hi,

I'm trying to execute python code with SHC (spark hbase connector) to connect to hbase from a python spark-based script.

Here is a simple example I can provide to illustrate :

# readExample.py
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
catalog = ''.join("""{
    "table":{"namespace":"default", "name":"firsttable"},
    "rowkey":"key",
    "columns":{
        "firstcol":{"cf":"rowkey", "col":"key", "type":"string"},
        "secondcol":{"cf":"d", "col":"colname", "type":"string"}
    }
}""".split())
df = sqlc.read\
.options(catalog=catalog)\
.format(data_source_format)\
.load()
df.select("secondcol").show()

In order to execute this properly, I successfully executed following command line :

spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml readExample.py

Great 🙂

Now, I would like to run this exact same example from my jupyter notebook...

After a while, I finally figured out how to proceed to pass the required "package" to spark by adding following cell at the begining of my notebook :

import os 
import findspark 
os.environ["SPARK_HOME"] = '/usr/hdp/current/spark-client' 
findspark.init('/usr/hdp/current/spark-client') 
os.environ['PYSPARK_SUBMIT_ARGS'] = ("--repositories http://repo.hortonworks.com/content/groups/public/ " "--packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 " " pyspark-shell") 

...But when I ran all the cells from my notebook, I got following exception :

Py4JJavaError: An error occurred while calling o50.showString.
: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations
	at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:312)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
	at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
	at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
	at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
	at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
	at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
	at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
	at org.apache.hadoop.hbase.client.MetaScanner.listTableRegionLocations(MetaScanner.java:343)
	at org.apache.hadoop.hbase.client.HRegionLocator.listRegionLocations(HRegionLocator.java:142)
	at org.apache.hadoop.hbase.client.HRegionLocator.getStartEndKeys(HRegionLocator.java:118)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:109)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:108)
	at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:77)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource.releaseOnException(HBaseResources.scala:88)
	at org.apache.spark.sql.execution.datasources.hbase.RegionResource.<init>(HBaseResources.scala:108)
	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:61)


From what I understood, this exception probably came up because the hbase client component could not use right hbase-site.xml (that defines zookeeper quorum...)

I tried to add "--files /etc/hbase/conf/hbase-site.xml" in the content of the PYSPARK_SUBMIT_ARGS environment variable, but this did not change anything...

Any idea how to pass the hbase-site.xml properly ?

Thanks for your help

5 REPLIES 5

avatar
New Contributor

I just got this working after seeing similar issues due to an inability to access the Zookeeper Quorum properly. I tried adding the hbase-site.xml file using the PYSPARK_SUBMIT_ARGS and also via a SparkConf object - no joy. What did work was to manually copy the hbase-site.xml file into $SPARK_HOME/conf.

I just got this working inside of some docker containers and had to manually move the hbase-site.xml into $SPARK_HOME/conf. Adding it inside of the PYSPARK_SUBMIT_ARGS or via a SparkConf object was not working.

avatar
Expert Contributor

Thanks for your help !

Just an additional question : you had to manually copy hbase-site.xml into $SPARK_HOME/conf folder on ALL nodes of the cluster ?

avatar
Expert Contributor

And also, should I keep using "--files" option with hbase-site.xml on the command line or not ?

avatar
New Contributor

My response went inline above - not sure why. Included is an example of the submit that I am using.

avatar
New Contributor

I did copy the file to all of the nodes, but only because all of the nodes (containers) are based on an identical image. I have not tested pushing only to a single node.

My launching now looks like:

conf = SparkConf()
conf.set('spark.driver.extraClassPath', '/usr/local/hbase-1.2.6/lib/*')
conf.set('spark.executor.extraClassPath', '/usr/local/hbase-1.2.6/lib/*')
sc = SparkContext(master='spark://master:7077', conf=conf)
sqlcontext = SQLContext(sc)

I am figuring this out as I work through my use case, so hopefully this works on your side as well.