Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using spark-hbase-connector Package with Pyspark

avatar
Expert Contributor

Hi all, I wanted to experiment with the "it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3" Package (you can find it at spark-packages.org ). It's an interesting addon giving RDD visibility/operativity on hBase tables via Spark.

 

If I run this extension library in a standard spark-shell (with scala support), everything works smoothly :

spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \
--conf spark.hbase.host=<HBASE_HOST>

scala> import it.nerdammer.spark.hbase._
import it.nerdammer.spark.hbase._


If I try to run it in a Pyspark shell, therefore my goal is to use the extension with Python, I'm not able to import the Functions and I'm not able to use anything:

PYSPARK_DRIVER_PYTHON=ipython pyspark --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 \
--conf spark.hbase.host=<HBASE_HOST>

In [1]: from it.nerdammer.spark.hbase import *
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-37dd5a5ffba0> in <module>()
----> 1 from it.nerdammer.spark.hbase import *

ImportError: No module named it.nerdammer.spark.hbase


I have tried different combinations of environment variables, parameters, etc when launching Pyspark, but to no avail.

 

Maybe I'm just trying to do something deeply wrong here, maybe it's simply the fact that there is no Python API access to this Library. In a matter of fact, the examples on the Package's home page are all in Scala (but they say you can install the Package in Pyspark too, with the classic "--package" parameter).

Can anybody help out with the "ImportError: No module named it.nerdammer.spark.hbase" error message?

 

Thanks for any insight

1 ACCEPTED SOLUTION

avatar
Mentor
Here's one example that uses the native hbase-spark module via DataFrames in PySpark: http://community.cloudera.com/t5/Storage-Random-Access-HDFS/Include-latest-hbase-spark-in-CDH/m-p/43...

View solution in original post

2 REPLIES 2

avatar
Mentor
Here's one example that uses the native hbase-spark module via DataFrames in PySpark: http://community.cloudera.com/t5/Storage-Random-Access-HDFS/Include-latest-hbase-spark-in-CDH/m-p/43...

avatar
Expert Contributor

Thanks. Seems a good alternative, and in a matter of fact I was not aware of its availability in CDH 5.7

 

Marking the thread as solved, even if by now I don't know yet if all the features I'd need will be there in the native hbase-spark connector