I am collecting HTTP/JSON data and converting it into a ORC file eventually, so i can use Hive table to read this file. I am able to successfully do this without a problem, but i generate a lot of ORC files and Hive queries are slower, so i decided to use mergeContent processor, once i start using this processor before convertCSVtoAVRO processor, CsV to AVRO sporadically cannot convert some records and throws out a warning message and that data is lost. This is not consistent, Sometimes all the data is processed correctly and sometimes a few records are not processed, i tested it with the same data set and everytime its a different record.
Can you add your MergeContent configuration here? I suspect maybe the last line of one file and the first of the other are being put on the same line. Can you verify whether the last line of your CSV file(s) have a line delimiter? If not you may need to add one before running MergeContent.
Hi Matt, this happens randomly, within my sample data set i ran it mutiple times, every time this fails at a different point. Interesting thing is, when i change the number of bins to 1, then it does not any merge at all, when i remove merge content processor its absolutely fine. screen-shot-2017-10-18-at-52930-pm.pngscreen-shot-2017-10-18-at-53030-pm.png
Matt, Thanks for helping me with this. The real problem was with InferAvroSchema processor as it uses Kite to determine the data type of the record. if you have nulls or zeros as a record value, this inferAvroSchema is not consistent, and during a merge if a bin consists of some data of double or float data type, and some zeros, ConvertJSONtoAVRO fails as the schema inferred in incorrect. It would be wise to configure the schema manually in the ConvertJSONtoAVRO schema instead of using InferAvroSchema, if that makes sense..