Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pyspark hbase integration with saveAsNewAPIHadoopDataset()

Highlighted

Pyspark hbase integration with saveAsNewAPIHadoopDataset()

New Contributor

I am trying to ingest bulk data into hbase from pyspark. Below is my code

conf = {"hbase.zookeeper.qourum":"ip",\ "zookeeper.znode.parent": "/hbase-secure",\ "hbase.mapred.outputtable": "emp",\ "mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\ "mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\ "mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}

keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"

valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"

lines = sc.textFile("sldev/poc/data")

load_rdd = lines.flatMap(lambda line : line.split("\n")).map(lambda line : parse(line))#Convert the CSV line to key value pairs

load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)

When i run this code on spark-submit with yarn cluster mode, it hangs with below info

19/01/13 16:55:53 INFO Client: Application report for application_1542783151658_5714 (state: RUNNING)

After some time if i open yarn log, I am getting below error in log

INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16

ERROR yarn.ApplicationMaster: User application exited with status 143

ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM

INFO spark.SparkContext: Invoking stop() from shutdown hook

I analyzed that this happens only when i am calling saveAsNewAPIHadoopDataset(). Please give any idea how to proceed further.