Based on my understanding we can make use of colaesce, re-partition we can reduce the small files being created. But the problem is I will not know what amount of data will I be getting on a daily basis. In that case sometimes the no of partitions which we would mention will work and sometimes It won't work.
Also reduceBy, distributeBy, clusterBy also will aid us. But again I want the data to be distributed. If I'm distributing the data based on some fields then again few distribution key will be skewed. If there any other way or best approach to handle the small files here such that files will be created at random but wont end up in small files.