06-08-2018 01:20 PM
Is there a routine or best practice to determine the right number of output partition?
When you do a write, Spark writes out as many output partitions as there were input partitions. If the resultant output is small, you can end up with lots of small files that waste space and clog up the Name node.
Spark provides the Cloalesce and Refactoring commands to remedy this; however, you have to determine the number of partitions to refactor or coalecse to. That requires knowing the end data size AFTER compression.
The easy but expensive way is to write out the data and divide the file size by 128 MB. The more efficient way is to get the data size while in memory (doable), estimate the results after compression (also doable, but questonable), and divide that by 128 MBs. It's clunky.
With so many users likely to experience this I expected to find guidance or even stock code to make the calculations, but no luck so far. Any insights would be appreciated.