Thanks for your reply, and for pointing out these subtleties that I wasn't aware of.
We're using saveAsNewAPIHadoopFile.
Is this the wrong API to use for compressed output? I think the TeraSort code that we're using is public domain, rather than our proprietary code.
Sorry, I missed that you mentioned you were using TeraSort. This isn't provided within Spark, are you referring to this https://github.com/ehiggs/spark-terasort?
If so, this provides a custom OutputFormat which would not compress the data even with the configurations. The TextOutputFormat included with Spark\Hadoop uses the configs to compress.
Yes, this is the TeraSort we're using, with its custom output format. So, if I understand correctly, you're suggesting that I change the last line from TeraOutputFormat to TextOutputFormat, and the configurations will then take effect?
I'm not completely familiar with this library and why it uses a custom OutputFormat, it appears the goal is to keep the file as compact as possible (no whitespace), so it looks like you could simply replace TeraOutputFormat with org.apache.hadoop.mapreduce.lib.output.TextOutputFormat within TeraGen for producing data. You would need to make an equivalent change within TeraSort for the TeraInputFormat if you were trying to read the data you produced in the previous step.