Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark terasort : How to compress output data written to HDFS

Solved Go to solution

Spark terasort : How to compress output data written to HDFS

Hi

I am running a Terasort application on 1G data using HDP 2.5 with 3 node cluster and spark 1.6 .

Sorting completes fine . But the output data is not compressed. Output data size written HDFS file system is also 1G in size.

I tried following spark configuration options while submitting spark job.

--conf spark.hadoop.mapred.output.compress=true

--conf spark.hadoop.mapred.output.compression.codec=true

--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

--conf spark.hadoop.mapred.output.compression.type=BLOCK

This did not help.

Also tried following and it did not help.

--conf spark.hadoop.mapreduce.output.fileoutputformat.compress=true

--conf spark.hadoop.mapreduce.map.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

--conf spark.hadoop.mapred.output.fileoutputformat.compress.type=BLOCK

How to get output data written HDFS in compressed format ?

Thanks

Steevan

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Spark terasort : How to compress output data written to HDFS

Expert Contributor

@Steevan Rodrigues

When doing saveAsText file it takes a parameter for setting codec to compress with:

rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
3 REPLIES 3

Re: Spark terasort : How to compress output data written to HDFS

Expert Contributor

@Steevan Rodrigues

When doing saveAsText file it takes a parameter for setting codec to compress with:

rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

Re: Spark terasort : How to compress output data written to HDFS

Thank you for the response. I will try saveAsTextFile() API.

Terasort scala code has following line which does not generate compressed output by default.

sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)

Replacing the above line with following code solved this issue for me.

sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])

Re: Spark terasort : How to compress output data written to HDFS

saveAsTextFile also does this job. I used following code .

rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])
Don't have an account?
Coming from Hortonworks? Activate your account here