Support Questions
Find answers, ask questions, and share your expertise

Merge compressed input files

Merge compressed input files

Explorer

One quick question... I've been doing some tests, with both Hive on Tez and Hive on MapReduce, concerning the read of input compressed zip files from an external table. (I am using HDP 2.5.3)

With Tez as the execution engine, and through the manipulation of the parameters tez.grouping.*, Hive can merge the compressed input files, but with MR as the execution engine, I cannot merge the input files using the CombineHiveInputFormat, supported by the setting of the parameters mapreduce.input.fileinputformat.split.*.

Long story short, Tez can merge zip files, whereas MR cannot? Is there any format that will allow for MR to merge the compressed input files?

Much appreciated!

3 REPLIES 3
Highlighted

Re: Merge compressed input files

@Void Messiah

Could you please let us know if the MapReduce fail or it does not merge the files?

Highlighted

Re: Merge compressed input files

Explorer

Hi Sindhu,

Thank you very much for taking notice to my question!

MapReduce runs fine, the problem is that it generates as much Mappers as the number of input files and this creates a big overhead regarding the processing time - for 50 files it generates 50 Mappers while Tez generates around 15 (in accordance to the tez.grouping.* parameters that I have defined).

Highlighted

Re: Merge compressed input files

Whn you run the hive query with MR as the engine Can you please verify whats the fileInputFormat.

What you need to do is to explicitly set CombineFileInputFormat in the hive Query ( The choice of combine will depenend on the type of your file, example for text you can use CombineTextInputFormat)

Don't have an account?