One quick question... I've been doing some tests, with both Hive on Tez and Hive on MapReduce, concerning the read of input compressed zip files from an external table. (I am using HDP 2.5.3)
With Tez as the execution engine, and through the manipulation of the parameters tez.grouping.*, Hive can merge the compressed input files, but with MR as the execution engine, I cannot merge the input files using the CombineHiveInputFormat, supported by the setting of the parameters mapreduce.input.fileinputformat.split.*.
Long story short, Tez can merge zip files, whereas MR cannot? Is there any format that will allow for MR to merge the compressed input files?
Thank you very much for taking notice to my question!
MapReduce runs fine, the problem is that it generates as much Mappers as the number of input files and this creates a big overhead regarding the processing time - for 50 files it generates 50 Mappers while Tez generates around 15 (in accordance to the tez.grouping.* parameters that I have defined).
Whn you run the hive query with MR as the engine Can you please verify whats the fileInputFormat.
What you need to do is to explicitly set CombineFileInputFormat in the hive Query ( The choice of combine will depenend on the type of your file, example for text you can use CombineTextInputFormat)