Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark terasort : How to compress output data written to HDFS

avatar

Hi

I am running a Terasort application on 1G data using HDP 2.5 with 3 node cluster and spark 1.6 .

Sorting completes fine . But the output data is not compressed. Output data size written HDFS file system is also 1G in size.

I tried following spark configuration options while submitting spark job.

--conf spark.hadoop.mapred.output.compress=true

--conf spark.hadoop.mapred.output.compression.codec=true

--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

--conf spark.hadoop.mapred.output.compression.type=BLOCK

This did not help.

Also tried following and it did not help.

--conf spark.hadoop.mapreduce.output.fileoutputformat.compress=true

--conf spark.hadoop.mapreduce.map.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

--conf spark.hadoop.mapred.output.fileoutputformat.compress.type=BLOCK

How to get output data written HDFS in compressed format ?

Thanks

Steevan

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Steevan Rodrigues

When doing saveAsText file it takes a parameter for setting codec to compress with:

rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

@Steevan Rodrigues

When doing saveAsText file it takes a parameter for setting codec to compress with:

rdd.saveAsTextFile(filename,compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

avatar

Thank you for the response. I will try saveAsTextFile() API.

Terasort scala code has following line which does not generate compressed output by default.

sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)

Replacing the above line with following code solved this issue for me.

sorted.saveAsHadoopFile(outputFile,classOf[Text],classOf[IntWritable],
classOf[TextOutputFormat[Text,IntWritable]],
classOf[org.apache.hadoop.io.compress.SnappyCodec])

avatar

saveAsTextFile also does this job. I used following code .

rdd.saveAsTextFile(outputFile, classOf[SnappyCodec])