Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark on YARN: codec params

Highlighted

Spark on YARN: codec params

Contributor

The YARN configuration page in Cloudera Manager has the following parameters:

 

Compression Codec of MapReduce Job Output

mapreduce.output.fileoutputformat.compress.codec

 

and

 

Compression Codec of MapReduce Map Output

mapreduce.map.output.compress.codec

 

These allow the user to specify codecs for Mapreduce jobs. They don't seem to be used when running a Spark job. Are there corresponding parameters for Spark jobs on YARN?

 

Thanks.

 

13 REPLIES 13

Re: Spark on YARN: codec params

Expert Contributor

This would depend on what writer you are using within Spark.  If you are writing a DataFrame to parquet you would use spark.sql.parquet.compression.codec.  Avro would have a similar setting.

 

Writing an RDD would use the same setting as the MapReduce settings you mentioned, as it is using the same code underneath.  You can also specify the codec directly in the function call when writing the file: saveAsText api

Re: Spark on YARN: codec params

Contributor

Thanks for your reply.

 

I'm already setting spark.io.compression.codec (through a --conf argument to spark-submit).

It's taking effect. 

 

However, if I don't set this, but only set the parameters that I mentioned earlier, I don't see it taking effect. i.e. the new codec isn't being invoked at all. Is there any boolean flag to turn on/off the compression for YARN?

 

Thanks

Re: Spark on YARN: codec params

Expert Contributor

YARN is a resource manager and the settings aren't specific to YARN itself, the same would apply if Spark was running as standalone or mesos as well.  

 

To enable compression on output, you should also set mapred.output.compress=true.  Note, if you are setting the configuration through spark cli arguments, you should prepend those settings with spark.hadoop.

Re: Spark on YARN: codec params

Contributor

Thanks again.

 

From your reply, I would infer that a) and b) below are equivalent.

a) Provide --conf arguments to spark-submit as follows;

 

spark-submit --conf spark.hadoop.mapred.output.compress=true \

--conf spark.hadoop.mapreduce.map.output.compress.codec=lz4

 

b) Through the YARN configuration GUI of CDH Manager, choose

 

 mapred.output.compress=true

 mapreduce.map.output.compress.codec=lz4

 

Please confirm that they are indeed equivalent.

 

Thanks.

 

Re: Spark on YARN: codec params

Expert Contributor

The codec configuration would be spark.hadoop.

park.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec

 

Yes they are equivalent, spark will take all configurations that start with spark.hadoop remove spark.hadoop from the configuration name and add it to hadoop configuration that is displayed in the GUI.

Re: Spark on YARN: codec params

Contributor

One other clarification:

 

I'm changing the codec in the spark-submit as follows:

 

spark-submit --conf spark.io.compression.codec=lz4 (to compress internal data)

I will now also add --conf spark.hadoop.mapred.output.compression.codec=lz4 (to compress the output)

 

Is there any other configuration option to specify a compression codec, or are these the only 2?

 

Thanks.

Re: Spark on YARN: codec params

Expert Contributor

You'll need park.hadoop.mapred.output.compress=true as well to enable output compression.  Also, the

spark.hadoop.mapred.output.compression.codec setting will take the fully qualified class name and not just the short name.  As for other codec params, you have settings to enable/disable compression for shuffle and spills which are both enabled by default.

Re: Spark on YARN: codec params

Contributor

 

On the YARN configuration menu, I made the following changes:

- mapreduce.output.fileoutputformat.compress true

- mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.SnappyCodec

- mapreduce.output.fileoutputformat.compress.type BLOCK

 

I'm running a Terasort job using Spark, on a Cloudera 5.5.1 cluster.

I would have liked to see that the output sorted files are compressed. However, they are not. The files are just as big as without the above changes. 

 

What am I missing?

 

Thanks

 

Re: Spark on YARN: codec params

Expert Contributor

What are you using to write the files?  Some of the functions to write files exposed by spark will accept the compression codec so this may not be necessary.  The saveAsTextFile also uses the old API so the configurations would be mapred instead of mapreduce (eg map.output.fileoutputformat.compress).  Also, if you are setting these in command line, you will need to prepend "spark.hadoop." before the settings