The YARN configuration page in Cloudera Manager has the following parameters:
Compression Codec of MapReduce Job Output
Compression Codec of MapReduce Map Output
These allow the user to specify codecs for Mapreduce jobs. They don't seem to be used when running a Spark job. Are there corresponding parameters for Spark jobs on YARN?
This would depend on what writer you are using within Spark. If you are writing a DataFrame to parquet you would use spark.sql.parquet.compression.codec. Avro would have a similar setting.
Writing an RDD would use the same setting as the MapReduce settings you mentioned, as it is using the same code underneath. You can also specify the codec directly in the function call when writing the file: saveAsText api
Thanks for your reply.
I'm already setting spark.io.compression.codec (through a --conf argument to spark-submit).
It's taking effect.
However, if I don't set this, but only set the parameters that I mentioned earlier, I don't see it taking effect. i.e. the new codec isn't being invoked at all. Is there any boolean flag to turn on/off the compression for YARN?
YARN is a resource manager and the settings aren't specific to YARN itself, the same would apply if Spark was running as standalone or mesos as well.
To enable compression on output, you should also set mapred.output.compress=true. Note, if you are setting the configuration through spark cli arguments, you should prepend those settings with spark.hadoop.
From your reply, I would infer that a) and b) below are equivalent.
a) Provide --conf arguments to spark-submit as follows;
spark-submit --conf spark.hadoop.mapred.output.compress=true \
b) Through the YARN configuration GUI of CDH Manager, choose
Please confirm that they are indeed equivalent.
The codec configuration would be spark.hadoop.
Yes they are equivalent, spark will take all configurations that start with spark.hadoop remove spark.hadoop from the configuration name and add it to hadoop configuration that is displayed in the GUI.
One other clarification:
I'm changing the codec in the spark-submit as follows:
spark-submit --conf spark.io.compression.codec=lz4 (to compress internal data)
I will now also add --conf spark.hadoop.mapred.output.compression.codec=lz4 (to compress the output)
Is there any other configuration option to specify a compression codec, or are these the only 2?
You'll need park.hadoop.mapred.output.compress=true as well to enable output compression. Also, the
spark.hadoop.mapred.output.compression.codec setting will take the fully qualified class name and not just the short name. As for other codec params, you have settings to enable/disable compression for shuffle and spills which are both enabled by default.
On the YARN configuration menu, I made the following changes:
- mapreduce.output.fileoutputformat.compress true
- mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.SnappyCodec
- mapreduce.output.fileoutputformat.compress.type BLOCK
I'm running a Terasort job using Spark, on a Cloudera 5.5.1 cluster.
I would have liked to see that the output sorted files are compressed. However, they are not. The files are just as big as without the above changes.
What am I missing?
What are you using to write the files? Some of the functions to write files exposed by spark will accept the compression codec so this may not be necessary. The saveAsTextFile also uses the old API so the configurations would be mapred instead of mapreduce (eg map.output.fileoutputformat.compress). Also, if you are setting these in command line, you will need to prepend "spark.hadoop." before the settings