Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar

Here are the steps to get pyspark working on SHC

a) add the following into Ambari -> Spark -> Configs -> Advanced spark-env -> spark-env template

export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar:/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/spark/conf/

b) kinit as e.g. hbase user

c) run

$ pyspark --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/

d) call each line separately:

from pyspark.sql import Row
data = range(0,255)
rdd = sc.parallelize(data).map(lambda i : Row(name=i,age=i))
import json
cat = json.dumps({"table":{"namespace":"default", "name":"dk", "tableCoder":"PrimitiveType"},"rowkey":"key","columns":{"name":{"cf":"rowkey", "col":"key", "type":"string"},"age":{"cf":"cf1", "col":"age", "type":"string"}}})
print(cat)
rdd.toDF().write.option("catalog",cat).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

e) test from the HBase shell

[root@dan2 ~]# hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.5.3.0-37, rcb8c969d1089f1a34e9df11b6eeb96e69bcf878d, Tue Nov 29 18:48:22 UTC 2016

hbase(main):001:0> list 'dk'
TABLE
dk
1 row(s) in 0.4220 seconds

=> ["dk"]
hbase(main):002:0> scan 'dk'
ROW                                 COLUMN+CELL
 \x00\x00\x00\x00\x00\x00\x00\x00   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\x00\x00\x00\x00\x01   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x01
 \x00\x00\x00\x00\x00\x00\x00\x02   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x02
 \x00\x00\x00\x00\x00\x00\x00\x03   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x03
 \x00\x00\x00\x00\x00\x00\x00\x04   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x04
 \x00\x00\x00\x00\x00\x00\x00\x05   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x05
 ...
  \x00\x00\x00\x00\x00\x00\x00\xFC   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFC
 \x00\x00\x00\x00\x00\x00\x00\xFD   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFD
 \x00\x00\x00\x00\x00\x00\x00\xFE   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFE
255 row(s) in 0.5950 seconds

hbase(main):003:0>
4,218 Views
0 Kudos
Comments

hello,could you please help me to fix this error,

I tried your code but it's dosn't work for me!

i get this the error :

File "<stdin>", line 1, in <module> File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/readwriter.py", line 395, in save self._jwrite.save() File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o64.save.

@azelmad zakaria

As this is an article, raise a separate question in HCC, refer to this one and provide the full stack trace from your console