Created on 04-21-2017 01:03 PM
Here are the steps to get pyspark working on SHC
a) add the following into Ambari -> Spark -> Configs -> Advanced spark-env -> spark-env template
export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar:/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar:/usr/hdp/2.5.3.0-37/spark/conf/
b) kinit as e.g. hbase user
c) run
$ pyspark --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/
d) call each line separately:
from pyspark.sql import Row data = range(0,255) rdd = sc.parallelize(data).map(lambda i : Row(name=i,age=i)) import json cat = json.dumps({"table":{"namespace":"default", "name":"dk", "tableCoder":"PrimitiveType"},"rowkey":"key","columns":{"name":{"cf":"rowkey", "col":"key", "type":"string"},"age":{"cf":"cf1", "col":"age", "type":"string"}}}) print(cat) rdd.toDF().write.option("catalog",cat).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()
e) test from the HBase shell
[root@dan2 ~]# hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 1.1.2.2.5.3.0-37, rcb8c969d1089f1a34e9df11b6eeb96e69bcf878d, Tue Nov 29 18:48:22 UTC 2016 hbase(main):001:0> list 'dk' TABLE dk 1 row(s) in 0.4220 seconds => ["dk"] hbase(main):002:0> scan 'dk' ROW COLUMN+CELL \x00\x00\x00\x00\x00\x00\x00\x00 column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x01 column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x01 \x00\x00\x00\x00\x00\x00\x00\x02 column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x02 \x00\x00\x00\x00\x00\x00\x00\x03 column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x03 \x00\x00\x00\x00\x00\x00\x00\x04 column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x04 \x00\x00\x00\x00\x00\x00\x00\x05 column=cf1:age, timestamp=1492525198948, value=\x00\x00\x00\x00\x00\x00\x00\x05 ... \x00\x00\x00\x00\x00\x00\x00\xFC column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFC \x00\x00\x00\x00\x00\x00\x00\xFD column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFD \x00\x00\x00\x00\x00\x00\x00\xFE column=cf1:age, timestamp=1492525198941, value=\x00\x00\x00\x00\x00\x00\x00\xFE 255 row(s) in 0.5950 seconds hbase(main):003:0>
Created on 05-09-2017 09:00 PM
hello,could you please help me to fix this error,
I tried your code but it's dosn't work for me!
i get this the error :
File "<stdin>", line 1, in <module> File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/readwriter.py", line 395, in save self._jwrite.save() File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/2.5.0.0-1245/spark/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/2.5.0.0-1245/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o64.save.
Created on 05-10-2017 04:58 AM
As this is an article, raise a separate question in HCC, refer to this one and provide the full stack trace from your console