Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark terasort : How to compress output data written to HDFS

avatar

Hi

I am running a Terasort application on 1G data using HDP 2.5 with 3 node cluster and spark 1.6 .

Sorting completes fine . But the output data is not compressed. Output data size written HDFS file system is also 1G in size.

I tried following spark configuration options while submitting spark job.

--conf spark.hadoop.mapred.output.compress=true

--conf spark.hadoop.mapred.output.compression.codec=true

--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

--conf spark.hadoop.mapred.output.compression.type=BLOCK

This did not help.

Also tried following and it did not help.

--conf spark.hadoop.mapreduce.output.fileoutputformat.compress=true

--conf spark.hadoop.mapreduce.map.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

--conf spark.hadoop.mapred.output.fileoutputformat.compress.type=BLOCK

How to get output data written HDFS in compressed format ?

Thanks

Steevan

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Steevan Rodrigues

When doing saveAsText file it takes a parameter for setting codec to compress with:

rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

@Steevan Rodrigues

When doing saveAsText file it takes a parameter for setting codec to compress with:

rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

avatar

Thank you for the response. I will try saveAsTextFile() API.

Terasort scala code has following line which does not generate compressed output by default.

sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)

Replacing the above line with following code solved this issue for me.

sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])

avatar

saveAsTextFile also does this job. I used following code .

rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])