1) While are using a custom InputFormat which extends ‘org.apache.hadoop.mapred.FileInputFormat’ and having ‘isSplitable’ as false.
Expected : 7 splits [As FileInputFormat doesn't split file smaller than blockSize (128 MB) so there should be one split per file]
Actual: 4 splits
2) Default value for 'hive.input.format' is CombineHiveInputFormat.
After setting ‘set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;’, there are 7 splits as expected.
From above two points, it looks hive uses ‘CombineHiveInputFormat’ on top of the custom InputFormat to determine number of splits.
How splits were calculated:
For deciding the number of mappers when using CombineInputFormat, data locality plays a role. Now to find where those files belong we can get it from command:
Hive is picking up blocks from these 4 DNs. Files on 1 DN are combined into 1 task. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default splitting behavior in Hadoop: each block is a locally processed split.
The reason it has picked the first block location for each blocks while combining is any Hadoop Client will use the first block location and will consider the next only if reading the first fails. Usually NameNode will return the block locations of a block sorted based upon the distance between the client and location. NameNode will give all block locations but CombineHiveInputFormat / Hadoop Client / MapReduce Program uses the first block location.