Community Articles

dkozlowski · ‎04-21-2017

Here are the steps to get pyspark working on SHC

a) add the following into Ambari -> Spark -> Configs -> Advanced spark-env -> spark-env template

export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar:/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/spark/conf/

b) kinit as e.g. hbase user

c) run

$ pyspark --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/

d) call each line separately:

from pyspark.sql import Row
data = range(0,255)
rdd = sc.parallelize(data).map(lambda i : Row(name=i,age=i))
import json
cat = json.dumps({"table":{"namespace":"default", "name":"dk", "tableCoder":"PrimitiveType"},"rowkey":"key","columns":{"name":{"cf":"rowkey", "col":"key", "type":"string"},"age":{"cf":"cf1", "col":"age", "type":"string"}}})
print(cat)
rdd.toDF().write.option("catalog",cat).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

e) test from the HBase shell

[root@dan2 ~]# hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.5.3.0-37, rcb8c969d1089f1a34e9df11b6eeb96e69bcf878d, Tue Nov 29 18:48:22 UTC 2016

hbase(main):001:0> list 'dk'
TABLE
dk
1 row(s) in 0.4220 seconds

=> ["dk"]
hbase(main):002:0> scan 'dk'
ROW                                 COLUMN+CELL
 \x00\x00\x00\x00\x00\x00\x00\x00   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\x00\x00\x00\x00\x01   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x01
 \x00\x00\x00\x00\x00\x00\x00\x02   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x02
 \x00\x00\x00\x00\x00\x00\x00\x03   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x03
 \x00\x00\x00\x00\x00\x00\x00\x04   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x04
 \x00\x00\x00\x00\x00\x00\x00\x05   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x05
 ...
  \x00\x00\x00\x00\x00\x00\x00\xFC   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFC
 \x00\x00\x00\x00\x00\x00\x00\xFD   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFD
 \x00\x00\x00\x00\x00\x00\x00\xFE   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFE
255 row(s) in 0.5950 seconds

hbase(main):003:0>

zaz · ‎05-09-2017

hello,could you please help me to fix this error,

I tried your code but it's dosn't work for me!

i get this the error :

File "<stdin>", line 1, in <module> File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/readwriter.py", line 395, in save self._jwrite.save() File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o64.save.

dkozlowski · ‎05-10-2017

@azelmad zakaria

As this is an article, raise a separate question in HCC, refer to this one and provide the full stack trace from your console

Cloudera Community

Community Articles

Looking for a sample python code for Spark-On-HBase - HDP 2.5 with Kerberos enabled

Apache Spark

Re: Looking for a sample python code for Spark-On-HBase - HDP 2.5 with Kerberos enabled

Re: Looking for a sample python code for Spark-On-HBase - HDP 2.5 with Kerberos enabled