question Re: spark streaming json to hive in Support Questions

spark streaming json to hive

mark_hadoop — Tue, 31 Jul 2018 20:10:22 GMT

Hi All,

I am beginner to spark and wanted to do the below.

a port 55500 is trying to send jsons as a stream (ex: {"one":"1","two":"2"}{"three":"3","four":"4"}).

I have a orc table in hive with columns given below

one, two,three,four,spark_streaming_startingtime,spark_streaming_endingtime,partition_value

I want to load the streaming values in to hive orc table.

Can you please guide me how to achieve this.

Thank you for your support.

Re: spark streaming json to hive

falbani — Tue, 31 Jul 2018 20:44:53 GMT

@Mark

I suggest you take the NetworkWordCount example as starting point. Then to transform the stream rdd into dataframe I recommend you look into flatMap, as you can map single column RDD into multiple columns after parsing the json content of each object. Finally when saving to hdfs you should consider a good batch size/repartition to avoid having small files in hdfs.

1. The NetworkWordCount code in github is located here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala

2. Here is an example of how to parse JSON using map and flatmap

https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/BasicParseJson.scala

3. Saving Dataframe as ORC is very well documented. Just avoid writing small files as this will hurt namenode and your hdfs overall.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.