<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: saving TCP stream in to hive using pyspark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219275#M82037</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/23208/hadoopuserhadoop.html" nodeid="23208"&gt;@Mark&lt;/A&gt; One last thing, you may want to reconsider the saving of the files ever minute. If files are small then you will endup causing a problem to HDFS Namenode in long term. This is a known issue:&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html" target="_blank"&gt;https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;We recommend to avoid writing lots of small files but rather trying to keep them at least the size of hdfs block. &lt;/P&gt;</description>
    <pubDate>Tue, 14 Aug 2018 19:42:48 GMT</pubDate>
    <dc:creator>falbani</dc:creator>
    <dc:date>2018-08-14T19:42:48Z</dc:date>
    <item>
      <title>saving TCP stream in to hive using pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219271#M82033</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am recieving data from TCP as a json stream using pyspark.&lt;/P&gt;&lt;P&gt;I want to save the files(append files and basically a file is a minute based ex:yyyyMMddHHmm (file), so all messages in one min should go to the corresponding file) and parallelly I want to save the json to orc hive table.&lt;/P&gt;&lt;P&gt;I have two questions involved&lt;/P&gt;&lt;P&gt;1.&lt;/P&gt;&lt;P&gt;*[path : '/folder/file']&lt;/P&gt;&lt;P&gt;When I receive data in Dstream I flatMap and split("\n") and then repartition(1).saveAsTextfile(path,"json")&lt;/P&gt;&lt;PRE&gt;lines = ssc.socketTextStream("localhost", 9999)
flat_map = lines.flatMap(lambda x: x.split("\n"))
flat_map.repartition(1).saveAsTextFiles(path,"json")
&lt;/PRE&gt;&lt;P&gt;The above saves to the path given, but instead of giving one single file per minute and save to the folder, this makes three folders with a _SUCCESS file and a part_00000 file in every folder, which is not expected.&lt;/P&gt;&lt;P&gt;Please help me how to solve this as expected : basically one folder per day and one file per minute under the folder?&lt;/P&gt;&lt;P&gt;2. If I want to save the json to orc hive table.. can I do it from a dstream? or I have to change the dstream to rdd and then perform some processing to save it to orc?&lt;/P&gt;&lt;P&gt;as I am new to pyspark please help with the above or with some examples.&lt;/P&gt;</description>
      <pubDate>Mon, 13 Aug 2018 21:14:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219271#M82033</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2018-08-13T21:14:52Z</dc:date>
    </item>
    <item>
      <title>Re: saving TCP stream in to hive using pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219272#M82034</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11048/falbani.html" nodeid="11048"&gt;@Felix Albani&lt;/A&gt; can you please suggest&lt;/P&gt;</description>
      <pubDate>Tue, 14 Aug 2018 15:46:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219272#M82034</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2018-08-14T15:46:21Z</dc:date>
    </item>
    <item>
      <title>Re: saving TCP stream in to hive using pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219273#M82035</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/23208/hadoopuserhadoop.html" nodeid="23208"&gt;@Mark&lt;/A&gt;,&lt;/P&gt;&lt;P&gt;Here are my suggestions:&lt;/P&gt;&lt;P&gt;1. Previous to saving the rdd, I recommend you transform it to a dataframe and use the dataframewriter:&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter" target="_blank"&gt;https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter&lt;/A&gt;&lt;/P&gt;&lt;P&gt;As per your requirement to avoid the directory and part file names, I believe this is not possible out of the box. You can write a single part file but the directory will be created by default. You can read more here:&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/142479/pyspark-creating-directory-when-trying-to-rdd-as-s.html" target="_blank"&gt;https://community.hortonworks.com/questions/142479/pyspark-creating-directory-when-trying-to-rdd-as-s.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;One of the possible solutions is to write to a temporary directory and then move the single file renaming it to the appropriate folder. You can have a single file created inside the temporary directory by using the coalesce method like this:&lt;/P&gt;&lt;P&gt;df.coalesce(1).write.format("json").mode("overwrite").save("temp_dir/test.json")&lt;/P&gt;&lt;P&gt;2. For saving json to orc hive table unless you plan to store it as string column you will need to parse the json and use flatmap to get correct columns you like to store. You can review the dataframewriter api saveAsTable method and example:&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter@saveAsTable(tableName:String):Unit" target="_blank"&gt;https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter@saveAsTable(tableName:String):Unit&lt;/A&gt;&lt;/P&gt;&lt;P&gt;And also check out this article that shows how to append to an orc table:&lt;/P&gt;&lt;P&gt;&lt;A href="http://jugsi.blogspot.com/2017/12/append-data-with-spark-to-hive-oarquet.html" target="_blank"&gt;http://jugsi.blogspot.com/2017/12/append-data-with-spark-to-hive-oarquet.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;As always if you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 14 Aug 2018 19:29:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219273#M82035</guid>
      <dc:creator>falbani</dc:creator>
      <dc:date>2018-08-14T19:29:14Z</dc:date>
    </item>
    <item>
      <title>Re: saving TCP stream in to hive using pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219274#M82036</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/23208/hadoopuserhadoop.html" nodeid="23208"&gt;@Mark&lt;/A&gt; Sorry, I just realized you were looking for pyspark solution and I provided the scala references instead. Everything I mentioned above also applies to pyspark, and the DataFrameWriter api link is here:&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter" target="_blank"&gt;https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter&lt;/A&gt;&lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;</description>
      <pubDate>Tue, 14 Aug 2018 19:36:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219274#M82036</guid>
      <dc:creator>falbani</dc:creator>
      <dc:date>2018-08-14T19:36:15Z</dc:date>
    </item>
    <item>
      <title>Re: saving TCP stream in to hive using pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219275#M82037</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/23208/hadoopuserhadoop.html" nodeid="23208"&gt;@Mark&lt;/A&gt; One last thing, you may want to reconsider the saving of the files ever minute. If files are small then you will endup causing a problem to HDFS Namenode in long term. This is a known issue:&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html" target="_blank"&gt;https://community.hortonworks.com/questions/167615/what-is-small-file-problem-in-hdfs.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;We recommend to avoid writing lots of small files but rather trying to keep them at least the size of hdfs block. &lt;/P&gt;</description>
      <pubDate>Tue, 14 Aug 2018 19:42:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219275#M82037</guid>
      <dc:creator>falbani</dc:creator>
      <dc:date>2018-08-14T19:42:48Z</dc:date>
    </item>
    <item>
      <title>Re: saving TCP stream in to hive using pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219276#M82038</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/11048/falbani.html" nodeid="11048"&gt;@Felix Albani&lt;/A&gt;&lt;P&gt;Thanks for the helping arm, I will go through them and could ask for suggsetions if required.&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Aug 2018 22:17:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/saving-TCP-stream-in-to-hive-using-pyspark/m-p/219276#M82038</guid>
      <dc:creator>mark_hadoop</dc:creator>
      <dc:date>2018-08-14T22:17:26Z</dc:date>
    </item>
  </channel>
</rss>

