Created 10-22-2016 11:14 AM
Hi
I am running a Terasort application on 1G data using HDP 2.5 with 3 node cluster and spark 1.6 .
Sorting completes fine . But the output data is not compressed. Output data size written HDFS file system is also 1G in size.
I tried following spark configuration options while submitting spark job.
--conf spark.hadoop.mapred.output.compress=true
--conf spark.hadoop.mapred.output.compression.codec=true
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
--conf spark.hadoop.mapred.output.compression.type=BLOCK
This did not help.
Also tried following and it did not help.
--conf spark.hadoop.mapreduce.output.fileoutputformat.compress=true
--conf spark.hadoop.mapreduce.map.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
--conf spark.hadoop.mapred.output.fileoutputformat.compress.type=BLOCK
How to get output data written HDFS in compressed format ?
Thanks
Steevan
Created 10-24-2016 10:40 PM
When doing saveAsText file it takes a parameter for setting codec to compress with:
rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
Created 10-24-2016 10:40 PM
When doing saveAsText file it takes a parameter for setting codec to compress with:
rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
Created 10-25-2016 09:43 AM
Thank you for the response. I will try saveAsTextFile() API.
Terasort scala code has following line which does not generate compressed output by default.
sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)
Replacing the above line with following code solved this issue for me.
sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable], classOf[TextOutputFormat[Text,IntWritable]], classOf[org.apache.hadoop.io.compress.SnappyCodec])
Created 10-26-2016 10:48 AM
saveAsTextFile also does this job. I used following code .
rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])