@Eugene Koifman That helped reduce the spilled_rows from 11 billion to 5 billion. I was under the impression that inserting data into a partition is faster with a distribute by. This was useful.
Also, I heard compressing the intermediary files helps reduce the spilled_rows. Is that correct?
set
mapreduce.map.output.compress = true
set
mapreduce.output.fileoutputformat.compress = true
Or anything else we can do to optimize the query?