07-07-2014 08:23 AM
Has anyone been using hive.hadoop.supports.splittable.combineinputformat with Hive .10 (CDH4.3.2) and non-indexed lzo (non-splittable) compressed files?
We recently tried out this parameter with non-indexed lzo files and at first it appeared to be great, reducing the number of mappers to read the input data and doing a great job of merging data into files that were about the size of the max input split size.
Unfortunately, it was found that results from some hive queries with this parameter in place and using non-indexed lzo file started to give different results.
Here are a couple of older Jira's that seemed to indicate we could use this parameter with non-splittable compressed files.
Perhaps this is a new bug or we did something wrong.
07-07-2014 09:15 AM
We found this Jira: https://issues.apache.org/jira/browse/MAPREDUCE-5537
Which was to fix a bug that seems similar to what we saw, howver, that was back in Hive .8.
Perhaps it made it back into Hive .10?