I have a basic question on the number of mappers. This is to understand how the number of mappers are arrived:
I have a partitioned sequential hive table. This table is partitioned by date and cut off number.
I read the data for 3 days whose size is about 634gb. So, the number of mappers I expected was 2536 and the system created 2304 mappers(nearing to the expected one) - split size is 256MB.
Now, I increase the split size to 1 GB and the number of mapper created was 622(nearing to the expected one).
But, now when I increase the split size to 2GB, I was expecting the mappers to be around 300, but the actual count was 512.
Could you help me to understand as why increasing the split size hasn't reduced the number of mappers.
Is it because of the last file whose size is less than the logical split size that creates a separate mapper or is there anything else?
I would like to understand this as I have around 14TB of data to be processed and this creates around 80,000 mappers for which the application master is not at all created. I am performing analysis to reduce the number of mappers.
Could you suggest any other option to reduce the number of mappers. The map and reduce size is 6gb and 8gb and application master 10 gb.
One more question, would there be any difference in the number of mapper when accessed through HCatalog?
Is this ORC data? For Application Master to run, you can set the following property and then run the query. This should reduce the memory pressure on AM, trying to figure out how many mappers , going through all hdfs files.
How many partitions would you have in total for 14TB data?
hive.exec.orc.split.strategy Default Value: HYBRID Added In: Hive 1.2.0 with HIVE-10114 What strategy ORC should use to create splits for execution. The available options are "BI", "ETL" and "HYBRID".<br> The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.
Other Article on how mappers are determined: