Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

spark streaming json to hive

Solved Go to solution

spark streaming json to hive

Expert Contributor

Hi All,

I am beginner to spark and wanted to do the below.

a port 55500 is trying to send jsons as a stream (ex: {"one":"1","two":"2"}{"three":"3","four":"4"}).

I have a orc table in hive with columns given below

one, two,three,four,spark_streaming_startingtime,spark_streaming_endingtime,partition_value

I want to load the streaming values in to hive orc table.

Can you please guide me how to achieve this.

Thank you for your support.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: spark streaming json to hive

@Mark

I suggest you take the NetworkWordCount example as starting point. Then to transform the stream rdd into dataframe I recommend you look into flatMap, as you can map single column RDD into multiple columns after parsing the json content of each object. Finally when saving to hdfs you should consider a good batch size/repartition to avoid having small files in hdfs.

1. The NetworkWordCount code in github is located here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/stream...

2. Here is an example of how to parse JSON using map and flatmap

https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsp...

3. Saving Dataframe as ORC is very well documented. Just avoid writing small files as this will hurt namenode and your hdfs overall.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

4 REPLIES 4

Re: spark streaming json to hive

@Mark

I suggest you take the NetworkWordCount example as starting point. Then to transform the stream rdd into dataframe I recommend you look into flatMap, as you can map single column RDD into multiple columns after parsing the json content of each object. Finally when saving to hdfs you should consider a good batch size/repartition to avoid having small files in hdfs.

1. The NetworkWordCount code in github is located here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/stream...

2. Here is an example of how to parse JSON using map and flatmap

https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsp...

3. Saving Dataframe as ORC is very well documented. Just avoid writing small files as this will hurt namenode and your hdfs overall.

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

Re: spark streaming json to hive

Expert Contributor
@Felix Albani

Thank your for quick response, I will go through the given info

Re: spark streaming json to hive

Expert Contributor
@Felix Albani

Can you help me with the pyspark version of the above please.

Re: spark streaming json to hive

@Mark sure, here is the link to the pyspark network word count example:

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py

HTH

Don't have an account?
Coming from Hortonworks? Activate your account here