We are exploring other options outside of hive for processing some truly huge xml records. We keep running into OOM errors all over the place right now even with 16GB containers currently. Currently we are storing our ginormous xml inside of avro sequence files in a hive table and we have a MR job using the hcatinputformat to read the data. Performance is subpar at best, since our container sizes are so large we can only run 8 per node, and even still we OOM enough to warrant increasing more.
We have idea's to help the writing out to ORC for the destination data (which will still be in hive) so that is less likely to error or OOM, so now our focus is on the input side and reducing memory allocation there. Our current thought pattern is to attempt to go to a streaming sax style parse of the xml so that the entire record is never resident in memory from the input side. The actual object result that comes out of it will be, and the serialization outbound will remain. Looking through the various inputformats in hadoop I think that the way things are written there are at least 2 copies of the record in ram long before we see it in the mapper portion and make our copy and subsequent serialization copy. Sadly this copying I've traced all the way down into the sequence file readers themselves and I don't see any good way to remove it and still use a sequence file.
So my current thought is to move all the work down into a custom inputformat that is the thinnest possible layer around an inputstream. Looking into the fileinputformat classes most of them look to be unusable as well as they all typically use the linerecordreader underneath them and that does a full byte copy of the entire record.
With that stated, is there a way to unit test an inputformat/recordreader? MRUnit doesn't have anything obvious for testing an inputformat.
Just some thoughts here, before writing your own custom input format. Seems like your XML file is incredibly large and you want to prevent the entire XML from being loaded into memory. StAX is definitely one way but you loose parallelism. if you want to preprocess your XML, try: