Hi Friends,
I am trying to read/write some data to HBase from Spark DataFrame using pyspark But am facing some issue,I am thinking that is version mismatch (tried with different combinations of hbase connectors, but not resolving the issue). If it is related to version incompatibility, Could anyone please share the correct spark-hbase connector version which is compatible with below mentioned HDP, HBase and Spark version?
Please note the versions which i am using currently
Name Version
Hortonworks(HDP) 3.0.1.0-187
HBase 2.0.0
Spark2 2.3.1
This is the sample pyspark code which i am trying (example.py):
--------------------------------------------------------------------------------------------------------------------
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
# ''.join(string.split()) in order to write a multi-line JSON string here.
catalog = ''.join("""{
"table":{"namespace":"default", "name":"tblEmployee", "tableCoder":"PrimitiveType"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf", "col":"col1", "type":"string"}
}
}""".split())
# Writing
df.write.options(catalog=catalog, newtable=5).format(data_source_format).save()
# Reading
df = sqlc.read.options(catalog=catalog, newtable=5).format(data_source_format).load()
-------------------------------------------------------------------------------------------------------------------
I am using below spark-submit command to execute my program
sudo spark-submit --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/3.0.1.0-187/0/hbase-site.xml example.py
--------------------------------------------------------------------------------------------------------------------------
Error Message which i am getting is as follows.
Traceback (most recent call last):
File "/home/ec2-user/src/example.py", line 23, in <module>
df.write.options(catalog=catalog, newtable=5).format(data_source_format).save()
File "/usr/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 732, in save
File "/usr/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.save.
: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357).................
Thanks!