Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

spark streaming json to hive

avatar
Expert Contributor

Hi All,

I am beginner to spark and wanted to do the below.

a port 55500 is trying to send jsons as a stream (ex: {"one":"1","two":"2"}{"three":"3","four":"4"}).

I have a orc table in hive with columns given below

one, two,three,four,spark_streaming_startingtime,spark_streaming_endingtime,partition_value

I want to load the streaming values in to hive orc table.

Can you please guide me how to achieve this.

Thank you for your support.

1 ACCEPTED SOLUTION

avatar

@Mark

I suggest you take the NetworkWordCount example as starting point. Then to transform the stream rdd into dataframe I recommend you look into flatMap, as you can map single column RDD into multiple columns after parsing the json content of each object. Finally when saving to hdfs you should consider a good batch size/repartition to avoid having small files in hdfs.

1. The NetworkWordCount code in github is located here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/stream...

2. Here is an example of how to parse JSON using map and flatmap

https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsp...

3. Saving Dataframe as ORC is very well documented. Just avoid writing small files as this will hurt namenode and your hdfs overall.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

View solution in original post

4 REPLIES 4

avatar

@Mark

I suggest you take the NetworkWordCount example as starting point. Then to transform the stream rdd into dataframe I recommend you look into flatMap, as you can map single column RDD into multiple columns after parsing the json content of each object. Finally when saving to hdfs you should consider a good batch size/repartition to avoid having small files in hdfs.

1. The NetworkWordCount code in github is located here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/stream...

2. Here is an example of how to parse JSON using map and flatmap

https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsp...

3. Saving Dataframe as ORC is very well documented. Just avoid writing small files as this will hurt namenode and your hdfs overall.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

avatar
Expert Contributor
@Felix Albani

Thank your for quick response, I will go through the given info

avatar
Expert Contributor
@Felix Albani

Can you help me with the pyspark version of the above please.

avatar

@Mark sure, here is the link to the pyspark network word count example:

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py

HTH