Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar

Here are the steps to get pyspark working on SHC

a) add the following into Ambari -> Spark -> Configs -> Advanced spark-env -> spark-env template

export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar:/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.6.0.3-8/spark/conf/

b) kinit as e.g. hbase user

c) run

$ pyspark --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/

d) call each line separately:

from pyspark.sql import Row 
data = range(0,255) 
rdd = sc.parallelize(data).map(lambda i : Row(name=i,age=i)) 
import json 
cat = json.dumps({"table":{"namespace":"default", "name":"dk", "tableCoder":"PrimitiveType"},"rowkey":"key","columns":{"name":{"cf":"rowkey", "col":"key", "type":"string"},"age":{"cf":"cf1", "col":"age", "type":"string"}}}) 
print(cat) 
rdd.toDF().write.option("catalog",cat).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

NOTE: running the last command from above the following error comes up:

17/04/18 15:39:57 INFO ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x15b42da54290012, negotiated timeout = 60000 
17/04/18 15:39:57 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "/usr/hdp/current/spark-client/python/pyspark/sql/readwriter.py", line 395, in save 
self._jwrite.save() 
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ 
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco 
return f(*a, **kw) 
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o63.save. 
: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations 
...

To get around the problem, do the following:

go to Ambari -> HBase -> Configs -> Advanced tab -> Advanced hbase-site
change the value of zookeeper.znode.parent 
FROM /hbase-unsecure 
TO /hbase 
save the changes 
restart all required services 
re-run the pyspark -> re-run point c) and d)

e) test from the HBase shell

[root@dan261 ~]# hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.6.0.3-8, r3307790b5a22cf93100cad0951760718dee5dec7, Sat Apr  1 21:41:47 UTC 2017

hbase(main):001:0> list 'dk'
TABLE
dk
1 row(s) in 0.3880 seconds

=> ["dk"]
hbase(main):002:0> scan 'dk'
ROW                                 COLUMN+CELL
 \x00\x00\x00\x00\x00\x00\x00\x00   column=cf1:age, timestamp=1492595613501, value=\x00\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\x00\x00\x00\x00\x01   column=cf1:age, timestamp=1492595613501, value=\x00\x00\x00\x00\x00\x00\x00\x01
 \x00\x00\x00\x00\x00\x00\x00\x02   column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x02
 \x00\x00\x00\x00\x00\x00\x00\x03   column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x03
 \x00\x00\x00\x00\x00\x00\x00\x04   column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x04
...
 \x00\x00\x00\x00\x00\x00\x00\xFA   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFA
 \x00\x00\x00\x00\x00\x00\x00\xFB   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFB
 \x00\x00\x00\x00\x00\x00\x00\xFC   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFC
 \x00\x00\x00\x00\x00\x00\x00\xFD   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFD
 \x00\x00\x00\x00\x00\x00\x00\xFE   column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFE
255 row(s) in 0.8570 seconds

hbase(main):003:0>
5,479 Views
Comments

Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x1c84d64d0x0, quorum=localhost:2181, baseZNode=/hbase

->ERROR ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper. It should have been written by the master. Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master.

I am getting the following error while trying the above example code. My hbase-site.xml contains zookeeper.znode.parent ='/hbase-unsecure'.

what might be the problem??

Thanks in advance.