I have a use case where I have to handle millions of small files in Spark. Files are stored in S3 and I have no control of the application generating these files. So far I have found 3 options to handle small files in Spark
2. Using Coalesce
3. Using CombineFileInputFormat
Here is my performance analysis of all 3:
S3DistCp: Inefficient, since it copies all data locally and moves it back to S3.
Coalesce: From the code, it looks like it randomly chunks partitions so it might not work in some scenarios. Couldn't find a way to partition data such that all partitions are equisized.
CombineFileInputFormat: From the code, it looks like it should work well, but would like to hear from the experience of community.
Q. What are performance implications of Coalesce and CombineFileInputFormat? When to pick one over the other? Is there any way to use coalesce to have equi sized partitions?
The performance compared to each other will be similar as they have similar mechanics. Both will require some data to be read across the network, but combining files will always require this.
I may suggest keeping it simple as the difference between the two would be minimal. Spark has easy support for coalesce and it would take a bit more work to use CombineFileInputFormat, so you may want to use coalesce.