Created 10-19-2017 03:23 PM
Hi,
I'm trying to execute python code with SHC (spark hbase connector) to connect to hbase from a python spark-based script.
Here is a simple example I can provide to illustrate :
# readExample.py from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) data_source_format = 'org.apache.spark.sql.execution.datasources.hbase' catalog = ''.join("""{ "table":{"namespace":"default", "name":"firsttable"}, "rowkey":"key", "columns":{ "firstcol":{"cf":"rowkey", "col":"key", "type":"string"}, "secondcol":{"cf":"d", "col":"colname", "type":"string"} } }""".split()) df = sqlc.read\ .options(catalog=catalog)\ .format(data_source_format)\ .load() df.select("secondcol").show()
In order to execute this properly, I successfully executed following command line :
spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/conf/hbase-site.xml readExample.py
Great 🙂
Now, I would like to run this exact same example from my jupyter notebook...
After a while, I finally figured out how to proceed to pass the required "package" to spark by adding following cell at the begining of my notebook :
import os import findspark os.environ["SPARK_HOME"] = '/usr/hdp/current/spark-client' findspark.init('/usr/hdp/current/spark-client') os.environ['PYSPARK_SUBMIT_ARGS'] = ("--repositories http://repo.hortonworks.com/content/groups/public/ " "--packages com.hortonworks:shc-core:1.1.1-1.6-s_2.10 " " pyspark-shell")
...But when I ran all the cells from my notebook, I got following exception :
Py4JJavaError: An error occurred while calling o50.showString. : org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:312) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155) at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89) at org.apache.hadoop.hbase.client.MetaScanner.listTableRegionLocations(MetaScanner.java:343) at org.apache.hadoop.hbase.client.HRegionLocator.listRegionLocations(HRegionLocator.java:142) at org.apache.hadoop.hbase.client.HRegionLocator.getStartEndKeys(HRegionLocator.java:118) at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:109) at org.apache.spark.sql.execution.datasources.hbase.RegionResource$$anonfun$1.apply(HBaseResources.scala:108) at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:77) at org.apache.spark.sql.execution.datasources.hbase.RegionResource.releaseOnException(HBaseResources.scala:88) at org.apache.spark.sql.execution.datasources.hbase.RegionResource.<init>(HBaseResources.scala:108) at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD.getPartitions(HBaseTableScan.scala:61)
From what I understood, this exception probably came up because the hbase client component could not use right hbase-site.xml (that defines zookeeper quorum...)
I tried to add "--files /etc/hbase/conf/hbase-site.xml" in the content of the PYSPARK_SUBMIT_ARGS environment variable, but this did not change anything...
Any idea how to pass the hbase-site.xml properly ?
Thanks for your help
Created 11-01-2017 09:05 PM
I just got this working after seeing similar issues due to an inability to access the Zookeeper Quorum properly. I tried adding the hbase-site.xml file using the PYSPARK_SUBMIT_ARGS and also via a SparkConf object - no joy. What did work was to manually copy the hbase-site.xml file into $SPARK_HOME/conf.
I just got this working inside of some docker containers and had to manually move the hbase-site.xml into $SPARK_HOME/conf. Adding it inside of the PYSPARK_SUBMIT_ARGS or via a SparkConf object was not working.
Created 11-06-2017 08:11 AM
Thanks for your help !
Just an additional question : you had to manually copy hbase-site.xml into $SPARK_HOME/conf folder on ALL nodes of the cluster ?
Created 11-06-2017 03:19 PM
And also, should I keep using "--files" option with hbase-site.xml on the command line or not ?
Created 11-06-2017 04:36 PM
My response went inline above - not sure why. Included is an example of the submit that I am using.
Created 11-06-2017 04:35 PM
I did copy the file to all of the nodes, but only because all of the nodes (containers) are based on an identical image. I have not tested pushing only to a single node.
My launching now looks like:
conf = SparkConf() conf.set('spark.driver.extraClassPath', '/usr/local/hbase-1.2.6/lib/*') conf.set('spark.executor.extraClassPath', '/usr/local/hbase-1.2.6/lib/*') sc = SparkContext(master='spark://master:7077', conf=conf) sqlcontext = SQLContext(sc)
I am figuring this out as I work through my use case, so hopefully this works on your side as well.