Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

PySpark - Connecting to HBASE using PySpark - Package import failing

Highlighted

PySpark - Connecting to HBASE using PySpark - Package import failing

New Contributor

Facing issue when connecting to HBASE using PySpark : Failing with error as given below

py4j.protocol.Py4JJavaError: An error occurred while calling o42.load.

: java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.execution.datasources.hbase. Please find packages at http://spark.apache.org/third-party-projects.html

Versions :

HDP Version : 2.6.4.0-91

Spark Ver: 2.2.0.2.6.4.0-91

Python: 2.7.5

Jar used: /usr/hdp/2.6.4.0-91/shc/shc-core-1.1.0.2.6.4.0-91.jar

PySpark Shell

pyspark --jars /usr/hdp/2.6.4.0-91/shc/shc-core-1.1.0.2.6.4.0-91.jar

Takes to pyspark shell with the prompt. But when trying to connect to HBASE, it fails with the error mentioned above.

Sample Code Executed:

catalog = ''.join("""{'table': {'namespace': 'default','name': 'books'},'rowkey': 'key','columns': {'title': {'cf': 'rowkey', 'col': 'key', 'type': 'string'},'author': {'cf': 'info', 'col': 'author', 'type': 'string'}}}""".split())

df = sqlContext.read.options(catalog=catalog).format('org.apache.spark.sql.execution.datasources.hbase').load()

Error: Failing with error given below: Traceback (most recent call last): File "", line 1, in ImportError: No module named org.apache.spark.sql.execution.datasources.hbase