Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Handling Small Files in Spark


Handling Small Files in Spark

How can we handle small file issues in spark?

Based on my understanding we can make use of colaesce, re-partition we can reduce the small files being created. But the problem is I will not know what amount of data will I be getting on a daily basis. In that case sometimes the no of partitions which we would mention will work and sometimes It won't work.

Also reduceBy, distributeBy, clusterBy also will aid us. But again I want the data to be distributed. If I'm distributing the data based on some fields then again few distribution key will be skewed. If there any other way or best approach to handle the small files here such that files will be created at random but wont end up in small files.

Don't have an account?
Coming from Hortonworks? Activate your account here