Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Here are the steps to get pyspark working on SHC

a) add the following into Ambari -> Spark -> Configs -> Advanced spark-env -> spark-env template

export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar:/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.6.0.3-8/spark/conf/

b) kinit as e.g. hbase user

c) run

$ pyspark --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/

d) call each line separately:

from pyspark.sql import Row 
data = range(0,255) 
rdd = sc.parallelize(data).map(lambda i : Row(name=i,age=i)) 
import json 
cat = json.dumps({"table":{"namespace":"default", "name":"dk", "tableCoder":"PrimitiveType"},"rowkey":"key","columns":{"name":{"cf":"rowkey", "col":"key", "type":"string"},"age":{"cf":"cf1", "col":"age", "type":"string"}}}) 
print(cat) 
rdd.toDF().write.option("catalog",cat).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

NOTE: running the last command from above the following error comes up:

17/04/18 15:39:57 INFO ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x15b42da54290012, negotiated timeout = 60000 
17/04/18 15:39:57 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "/usr/hdp/current/spark-client/python/pyspark/sql/readwriter.py", line 395, in save 
self._jwrite.save() 
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ 
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco 
return f(*a, **kw) 
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. 
: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations 
...

To get around the problem, do the following:

go to Ambari -> HBase -> Configs -> Advanced tab -> Advanced hbase-site
change the value of zookeeper.znode.parent 
FROM /hbase-unsecure 
TO /hbase 
save the changes 
restart all required services 
re-run the pyspark -> re-run point c) and d)

e) test from the HBase shell

[root@dan261 ~]# hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.6.0.3-8, r3307790b5a22cf93100cad0951760718dee5dec7, Sat Apr  1 21:41:47 UTC 2017

hbase(main):001:0> list 'dk'
TABLE
dk
1 row(s) in 0.3880 seconds

=> ["dk"]
hbase(main):002:0> scan 'dk'
ROW                                 COLUMN+CELL
 \x00\x00\x00\x00\x00\x00\x00\x00   column=cf1:age, timestamp=1492595613501, value=\x00\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\x00\x00\x00\x00\x01   column=cf1:age, timestamp=1492595613501, value=\x00\x00\x00\x00\x00\x00\x00\x01
 \x00\x00\x00\x00\x00\x00\x00\x02   column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x02
 \x00\x00\x00\x00\x00\x00\x00\x03   column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x03
 \x00\x00\x00\x00\x00\x00\x00\x04   column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x04
...
 \x00\x00\x00\x00\x00\x00\x00\xFA   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFA
 \x00\x00\x00\x00\x00\x00\x00\xFB   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFB
 \x00\x00\x00\x00\x00\x00\x00\xFC   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFC
 \x00\x00\x00\x00\x00\x00\x00\xFD   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFD
 \x00\x00\x00\x00\x00\x00\x00\xFE   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFE
255 row(s) in 0.8570 seconds

hbase(main):003:0>
5,820 Views
Comments

Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x1c84d64d0x0, quorum=localhost:2181, baseZNode=/hbase

->ERROR ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper. It should have been written by the master. Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master.

I am getting the following error while trying the above example code. My hbase-site.xml contains zookeeper.znode.parent ='/hbase-unsecure'.

what might be the problem??

Thanks in advance.