Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark on YARN: codec params

Highlighted

Re: Spark on YARN: codec params

Contributor

Thanks for your reply, and for pointing out these subtleties that I wasn't aware of.

 

We're using saveAsNewAPIHadoopFile. 

 

Is this the wrong API to use for compressed output? I think the TeraSort code that we're using is public domain, rather than our proprietary code.

 

Thanks.

Re: Spark on YARN: codec params

Expert Contributor

Sorry, I missed that you mentioned you were using TeraSort.  This isn't provided within Spark, are you referring to this https://github.com/ehiggs/spark-terasort?

 

If so, this provides a custom OutputFormat which would not compress the data even with the configurations.  The TextOutputFormat included with Spark\Hadoop uses the configs to compress.

Re: Spark on YARN: codec params

Contributor

Yes, this is the TeraSort we're using, with its custom output format. So, if I understand correctly, you're suggesting that I change the last line from TeraOutputFormat to TextOutputFormat, and the configurations will then take effect?

 

Thanks.

 

 

 

Re: Spark on YARN: codec params

Expert Contributor

I'm not completely familiar with this library and why it uses a custom OutputFormat, it appears the goal is to keep the file as compact as possible (no whitespace), so it looks like you could simply replace TeraOutputFormat with org.apache.hadoop.mapreduce.lib.output.TextOutputFormat within TeraGen for producing data.  You would need to make an equivalent change within TeraSort for the TeraInputFormat if you were trying to read the data you produced in the previous step.