I have a M/R (map-only job) job that is run against the same local input files in 2 different HDP clusters. Both the clusters have the exact same config.
In cluster1, HDFS: Number of Read Operations = number of mappers * 9
In cluster2, HDFS: Number of Read Operations = number of mappers * 10.
The job run on cluster 1 is ~30% faster than that on cluster 2.
The above multiplication factors remain the same even if the number of Mappers are increased / decreased. I have pretty much checked all other configs and couldn't find any difference in configs / files.
I am curious to know what determines the above multiplication factor (9 on cluster1 & 10 on cluster2). Already spent some time to get the performance equal on both the clusters. Any input is appreciated. Thanks.