from pyspark.sql import Row
data = range(0,255)
rdd = sc.parallelize(data).map(lambda i : Row(name=i,age=i))
import json
cat = json.dumps({"table":{"namespace":"default", "name":"dk", "tableCoder":"PrimitiveType"},"rowkey":"key","columns":{"name":{"cf":"rowkey", "col":"key", "type":"string"},"age":{"cf":"cf1", "col":"age", "type":"string"}}})
print(cat)
rdd.toDF().write.option("catalog",cat).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()
NOTE: running the last command from above the following error comes up:
17/04/18 15:39:57 INFO ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x15b42da54290012, negotiated timeout = 60000
17/04/18 15:39:57 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark-client/python/pyspark/sql/readwriter.py", line 395, in save
self._jwrite.save()
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o63.save.
: org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations
...
To get around the problem, do the following:
go to Ambari -> HBase -> Configs -> Advanced tab -> Advanced hbase-site
change the value of zookeeper.znode.parent
FROM /hbase-unsecure
TO /hbase
save the changes
restart all required services
re-run the pyspark -> re-run point c) and d)
e) test from the HBase shell
[root@dan261 ~]# hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.6.0.3-8, r3307790b5a22cf93100cad0951760718dee5dec7, Sat Apr 1 21:41:47 UTC 2017
hbase(main):001:0> list 'dk'
TABLE
dk
1 row(s) in 0.3880 seconds
=> ["dk"]
hbase(main):002:0> scan 'dk'
ROW COLUMN+CELL
\x00\x00\x00\x00\x00\x00\x00\x00 column=cf1:age, timestamp=1492595613501, value=\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x01 column=cf1:age, timestamp=1492595613501, value=\x00\x00\x00\x00\x00\x00\x00\x01
\x00\x00\x00\x00\x00\x00\x00\x02 column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x02
\x00\x00\x00\x00\x00\x00\x00\x03 column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x03
\x00\x00\x00\x00\x00\x00\x00\x04 column=cf1:age, timestamp=1492595613488, value=\x00\x00\x00\x00\x00\x00\x00\x04
...
\x00\x00\x00\x00\x00\x00\x00\xFA column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFA
\x00\x00\x00\x00\x00\x00\x00\xFB column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFB
\x00\x00\x00\x00\x00\x00\x00\xFC column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFC
\x00\x00\x00\x00\x00\x00\x00\xFD column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFD
\x00\x00\x00\x00\x00\x00\x00\xFE column=cf1:age, timestamp=1492577972182, value=\x00\x00\x00\x00\x00\x00\x00\xFE
255 row(s) in 0.8570 seconds
hbase(main):003:0>
->ERROR ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper. It should have been written by the master. Check the value configured in 'zookeeper.znode.parent'. There could be a mismatch with the one configured in the master.
I am getting the following error while trying the above example code. My hbase-site.xml contains zookeeper.znode.parent ='/hbase-unsecure'.