Created 07-31-2018 01:10 PM
Hi All,
I am beginner to spark and wanted to do the below.
a port 55500 is trying to send jsons as a stream (ex: {"one":"1","two":"2"}{"three":"3","four":"4"}).
I have a orc table in hive with columns given below
one, two,three,four,spark_streaming_startingtime,spark_streaming_endingtime,partition_value
I want to load the streaming values in to hive orc table.
Can you please guide me how to achieve this.
Thank you for your support.
Created 07-31-2018 01:44 PM
I suggest you take the NetworkWordCount example as starting point. Then to transform the stream rdd into dataframe I recommend you look into flatMap, as you can map single column RDD into multiple columns after parsing the json content of each object. Finally when saving to hdfs you should consider a good batch size/repartition to avoid having small files in hdfs.
1. The NetworkWordCount code in github is located here:
2. Here is an example of how to parse JSON using map and flatmap
3. Saving Dataframe as ORC is very well documented. Just avoid writing small files as this will hurt namenode and your hdfs overall.
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
Created 07-31-2018 01:44 PM
I suggest you take the NetworkWordCount example as starting point. Then to transform the stream rdd into dataframe I recommend you look into flatMap, as you can map single column RDD into multiple columns after parsing the json content of each object. Finally when saving to hdfs you should consider a good batch size/repartition to avoid having small files in hdfs.
1. The NetworkWordCount code in github is located here:
2. Here is an example of how to parse JSON using map and flatmap
3. Saving Dataframe as ORC is very well documented. Just avoid writing small files as this will hurt namenode and your hdfs overall.
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
Created 07-31-2018 01:51 PM
Thank your for quick response, I will go through the given info
Created 08-10-2018 09:08 AM
Can you help me with the pyspark version of the above please.
Created 08-10-2018 12:15 PM
@Mark sure, here is the link to the pyspark network word count example:
https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py
HTH