<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark terasort : How to compress output data written to HDFS in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144687#M44216</link>
    <description>&lt;P&gt;saveAsTextFile also does this job. I used following code . &lt;/P&gt;&lt;PRE&gt;rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])&lt;/PRE&gt;</description>
    <pubDate>Wed, 26 Oct 2016 17:48:30 GMT</pubDate>
    <dc:creator>steevan_rodrigu</dc:creator>
    <dc:date>2016-10-26T17:48:30Z</dc:date>
    <item>
      <title>Spark terasort : How to compress output data written to HDFS</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144684#M44213</link>
      <description>&lt;P&gt;Hi &lt;/P&gt;&lt;P&gt;I am running a Terasort application on 1G data using HDP 2.5  with 3 node cluster and spark 1.6 .&lt;/P&gt;&lt;P&gt;Sorting completes fine . But the output data is not compressed. Output data size written HDFS file system is  also 1G in size.&lt;/P&gt;&lt;P&gt;I tried following spark configuration options while submitting spark job.&lt;/P&gt;&lt;P&gt;--conf spark.hadoop.mapred.output.compress=true &lt;/P&gt;&lt;P&gt;--conf spark.hadoop.mapred.output.compression.codec=true &lt;/P&gt;&lt;P&gt; --conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec &lt;/P&gt;&lt;P&gt; --conf spark.hadoop.mapred.output.compression.type=BLOCK &lt;/P&gt;&lt;P&gt;This did not help. &lt;/P&gt;&lt;P&gt;Also tried following and it did not help.&lt;/P&gt;&lt;P&gt;--conf spark.hadoop.mapreduce.output.fileoutputformat.compress=true &lt;/P&gt;&lt;P&gt;--conf spark.hadoop.mapreduce.map.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec &lt;/P&gt;&lt;P&gt;--conf spark.hadoop.mapred.output.fileoutputformat.compress.type=BLOCK &lt;/P&gt;&lt;P&gt;How to get output data written HDFS in compressed format ? &lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Steevan&lt;/P&gt;</description>
      <pubDate>Sat, 22 Oct 2016 18:14:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144684#M44213</guid>
      <dc:creator>steevan_rodrigu</dc:creator>
      <dc:date>2016-10-22T18:14:08Z</dc:date>
    </item>
    <item>
      <title>Re: Spark terasort : How to compress output data written to HDFS</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144685#M44214</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/13777/steevanrodrigues.html" nodeid="13777"&gt;@Steevan Rodrigues&lt;/A&gt; &lt;/P&gt;&lt;P&gt;When doing saveAsText file it takes a parameter for setting codec to compress with:&lt;/P&gt;&lt;PRE&gt;rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")&lt;/PRE&gt;</description>
      <pubDate>Tue, 25 Oct 2016 05:40:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144685#M44214</guid>
      <dc:creator>jwiden</dc:creator>
      <dc:date>2016-10-25T05:40:03Z</dc:date>
    </item>
    <item>
      <title>Re: Spark terasort : How to compress output data written to HDFS</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144686#M44215</link>
      <description>&lt;P&gt;Thank you for the response.  I will try saveAsTextFile()  API.&lt;/P&gt;&lt;P&gt;Terasort scala code has following line  which does not generate compressed output by default. &lt;/P&gt;&lt;PRE&gt;sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)&lt;/PRE&gt;&lt;P&gt;Replacing the above line with following code solved this issue for me. &lt;/P&gt;&lt;PRE&gt;sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])&lt;/PRE&gt;</description>
      <pubDate>Tue, 25 Oct 2016 16:43:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144686#M44215</guid>
      <dc:creator>steevan_rodrigu</dc:creator>
      <dc:date>2016-10-25T16:43:01Z</dc:date>
    </item>
    <item>
      <title>Re: Spark terasort : How to compress output data written to HDFS</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144687#M44216</link>
      <description>&lt;P&gt;saveAsTextFile also does this job. I used following code . &lt;/P&gt;&lt;PRE&gt;rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])&lt;/PRE&gt;</description>
      <pubDate>Wed, 26 Oct 2016 17:48:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-terasort-How-to-compress-output-data-written-to-HDFS/m-p/144687#M44216</guid>
      <dc:creator>steevan_rodrigu</dc:creator>
      <dc:date>2016-10-26T17:48:30Z</dc:date>
    </item>
  </channel>
</rss>

