Created 09-28-2015 01:07 PM
When using Hive (v.14) on Avro, org.apache.avro.file.DataFileReader throws java.io.IOException: Not a data file. - when encountering a 0 byte file. This 0 byte file is the result of file rotation during Storm bolt writes to HDFS.
"This issue is that org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader creates a new org.apache.avro.file.DataFileReader and DataFileReader throws an exception when trying to read an empty file (because the empty file lacks the magic number marking it as avro). It seems like it be straight forward to modify AvroGenericRecordReader to detect an empty file and then behave sensibly. For example, next() would always return false; getPos() would return zero, etc."
Is alterting AvroGenericRecordReader feasible here?
Kris
Created 10-09-2015 03:46 PM
It was suggested to skip such files in the Avro's native reader itself. But the Avro project declined that option in https://issues.apache.org/jira/browse/AVRO-1530 and suggested clients ignore zero length files.
The issue has been patched on the Hive side:
https://issues.apache.org/jira/browse/HIVE-11977
-Darwin
Created 09-28-2015 03:21 PM
From what you described the issue should be dealt within Storm bolt by avoiding writing empty files. Hive in some sense is doing the right thing by throwing error on empty file. From a fix standpoint I would think modifying on the Storm side would be easier as you just need to recompile your topology with the fix rather than having to recompile all of Hive for this.
Created 10-09-2015 03:46 PM
It was suggested to skip such files in the Avro's native reader itself. But the Avro project declined that option in https://issues.apache.org/jira/browse/AVRO-1530 and suggested clients ignore zero length files.
The issue has been patched on the Hive side:
https://issues.apache.org/jira/browse/HIVE-11977
-Darwin
Created 10-13-2015 11:25 PM
Yes, I have been working with Aaron on this one.