Community Articles

dkozlowski · ‎04-21-2017

Here are the steps to get pyspark working on SHC

a) add the following into Ambari -> Spark -> Configs -> Advanced spark-env -> spark-env template

export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar:/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/spark/conf/

b) kinit as e.g. hbase user

c) run

$ pyspark --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/

d) call each line separately:

from pyspark.sql import Row
data = range(0,255)
rdd = sc.parallelize(data).map(lambda i : Row(name=i,age=i))
import json
cat = json.dumps({"table":{"namespace":"default", "name":"dk", "tableCoder":"PrimitiveType"},"rowkey":"key","columns":{"name":{"cf":"rowkey", "col":"key", "type":"string"},"age":{"cf":"cf1", "col":"age", "type":"string"}}})
print(cat)
rdd.toDF().write.option("catalog",cat).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

e) test from the HBase shell

[root@dan2 ~]# hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.5.3.0-37, rcb8c969d1089f1a34e9df11b6eeb96e69bcf878d, Tue Nov 29 18:48:22 UTC 2016

hbase(main):001:0> list 'dk'
TABLE
dk
1 row(s) in 0.4220 seconds

=> ["dk"]
hbase(main):002:0> scan 'dk'
ROW                                 COLUMN+CELL
 \x00\x00\x00\x00\x00\x00\x00\x00   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\x00\x00\x00\x00\x01   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x01
 \x00\x00\x00\x00\x00\x00\x00\x02   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x02
 \x00\x00\x00\x00\x00\x00\x00\x03   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x03
 \x00\x00\x00\x00\x00\x00\x00\x04   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x04
 \x00\x00\x00\x00\x00\x00\x00\x05   column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x05
 ...
  \x00\x00\x00\x00\x00\x00\x00\xFC   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFC
 \x00\x00\x00\x00\x00\x00\x00\xFD   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFD
 \x00\x00\x00\x00\x00\x00\x00\xFE   column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFE
255 row(s) in 0.5950 seconds

hbase(main):003:0>

zaz · ‎05-09-2017

hello,could you please help me to fix this error,

I tried your code but it's dosn't work for me!

i get this the error :

File "<stdin>", line 1, in <module> File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/readwriter.py", line 395, in save self._jwrite.save() File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o64.save.

dkozlowski · ‎05-10-2017

@azelmad zakaria

As this is an article, raise a separate question in HCC, refer to this one and provide the full stack trace from your console

Cloudera Community

Community Articles

Looking for a sample python code for Spark-On-HBase - HDP 2.5 with Kerberos enabled

Apache Spark

Re: Looking for a sample python code for Spark-On-HBase - HDP 2.5 with Kerberos enabled

Re: Looking for a sample python code for Spark-On-HBase - HDP 2.5 with Kerberos enabled

Looking for a sample python code for Spark-On-HBas...

Kafka Producer sample code in Scala and Python

Python interaction with HBase Thrift proxy in Kerb...

HDP 2.5 Ranger service plugins fail to upload audi...

Create Phoenix Schemas in HDP 2.5

Spark job failure after Kerberos is enabled

ExecuteScript python seems to execute different co...

Connecting DbVisualizer and DataGrip to Hive with ...

What is new in HDP 2.5 HBase: enabling/disabling r...

Enable Intel's Intelligent Storage Acceleration Li...