Eric, Thanks for the quick response. We were taking the rather simplistic time measurement of the total runtime on the Hadoop Job Tracker page so the times were all inclusive. Our input sources are approximately 50GB in size. On the test that runs faster, the "homegrown" container, there are numerous files. The true Avro container file is one 50GB file which furthered my surprise. I expected the single large file to handily outperform many smaller files. We are going to rerun our test again but this time with one big file for "homegrown" case. Again, I am suspecting that something is wrong somewhere as the numbers just don't make sense. Thanks, Jason
... View more
We are currently in the early stages of our Hadoop implementation and are trying to make decisions around how we will store our data. So far we have tried two formats - a custom wrapper around ProtoBuf encoded objects written into Avro Container files and straight Avro Container files containing Avro encoded objects. The former packs up 64MB of Protobuf data into each record and serves no other purpose. We simply use custom code in each Mapper to deserialize the 64MB payload into an array of objects. Currently we are seeing, much to my surprise, a 100% performance benefit from the custom wrapper approach. This leads me to believe there is something wrong in the Avro job configuration leading to twice the amount of deserialization but I have no idea where to start; documentation and examples are sparse at best. We use the new mapreduce API in our code. Any ideas on how to begin diagnosing this issue would be greatly appreciated. I can provide code samples once I know what pieces of the code may be relevant to the problem.
... View more