When we run MR job on a large dataset, Mapper generates large chunks of intermediate data and framework pass this intermediate data to the Reducer for further processing. This leads to enormous network congestion.
The MapReduce framework provides a function known as Combiner that plays a vital role in reducing network congestion. In MR job, Combiner does local aggregation on the mapper output. This helps to minimise the data transfer between mapper and reducer. Therefore, increase the efficiency of a MapReduce program.
Back to the original question, invoking of a Combiner function totally depends on the size of the input file. If larger the input data, large will be the intermediate output from Mapper. Now if this entire output will be send directly to reducer, it will take more time to process this large amount of data.
So, basically combiner is invoked to reduce the intermediate output from mapper, so that reducer has to process less data and thus give final output in less time. Combiner can be executed zero, one or many times, so that a given MR job should not depend on the combiner executions and should always produce the same results. Now the number of combiners is not predefined. It can be 0 or multiple, depending on the size of data.
There is no specific rule in Hadoop on how many times a combiner should be called. Sometimes it may not be called at all, while Sometimes it may be used once, twice or more depending on the number and size of the output file generated by the mapper.