Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark: Coalesce or CombineFileInputFormat


Spark: Coalesce or CombineFileInputFormat

New Contributor


I have a use case where I have to handle millions of small files in Spark. Files are stored in S3 and I have no control of the application generating these files. So far I have found 3 options to handle small files in Spark

1. S3DistCp

2. Using Coalesce

3. Using CombineFileInputFormat


Here is my performance analysis of all 3:

S3DistCp: Inefficient, since it copies all data locally and moves it back to S3.

Coalesce: From the code, it looks like it randomly chunks partitions so it might not work in some scenarios. Couldn't find a way to partition data such that all partitions are equisized.

CombineFileInputFormat: From the code, it looks like it should work well, but would like to hear from the experience of community. 


Q. What are performance implications of Coalesce and CombineFileInputFormat? When to pick one over the other? Is there any way to use coalesce to have equi sized partitions?




Re: Spark: Coalesce or CombineFileInputFormat

Expert Contributor

The performance compared to each other will be similar as they have similar mechanics.  Both will require some data to be read across the network, but combining files will always require this.


I may suggest keeping it simple as the difference between the two would be minimal.  Spark has easy support for coalesce and it would take a bit more work to use CombineFileInputFormat, so you may want to use coalesce.

Don't have an account?
Coming from Hortonworks? Activate your account here