I wrote a query to copy a very large table (ORC compressed -> ORC compressed).
The query took a very long time to run, and I noticed that only 238 reducers were spun up, which was the bottleneck: the average reduce time was almost 20 minutes, where the average map time was only 8 minutes (874 mappers).
In the end, the new copy of the table is stored across 238 files in HDFS, which explains the 238 reducers.
But I'm trying to understand why 238.
I see that the HFDS files backing the file follow an interesting pattern. Every other one is about 530MB, and every other one is about 1.6GB.
I have my "max reducers" setting in Hive set to 416, which is 13 data nodes x 16 disks per node x 2.
So, I would expect to see 416 reducers.
However, I also see a "data per reducer" setting in Hive. It looks like someone recently increased this to 1GB.
The table I copied is about 238GB, so it makes sense to me that, based on this setting, I got 238 reducers.
So my question is: what is the relationship between:
hdfs file chunk size (ours is set to 64 mb)
max reducers (ours is 416)
and "data per reducer" ?
It seems like the "data per reducer" trumps everything and the hdfs file chunk size is not taken into account at all.
Should "data per reducer" be the same as hdfs file chunk size or a small multiple of it?
You are right that the hdfs block size is completely ignored here. Every reducer ( which is defined by the reducer data size or by a hardcoded number) writes one file. Block size defines number of mappers. ( sometimes, Tez also has grouping sizes which can merge blocks together )
For more reference around fast ORC creation this presentation I made should help: