When I fetch 49 records at a time, 49 flow files converted to ORC go into the putHDFS processor but the result is one small file in HDFS. When I do a select count(*) there is only 1 record in the external table. There are no failed flow files coming out of the HDFS processor.
I included a mergecontent processor after the convert to orc and before put to HDFS, a bigger file of merged content is put into HDFS but that results in an error(see below) when I select count(*) from the external table. I suspect the merged file is corrupted, but putting the original 49 files into HDFS doesn’t appear to be working properly either. The External table is simply defined with fields STORED as ORC LOCATION ‘hdfspath’. I tried the DDL with and without orc.compress=’SNAPPY’, even removed all compression from any processor and the results were the same as the mergedcontent file throws the below error every time when selecting from the external table.
Vertex failed, vertexName=Map 1, vertexId=vertex_1502140036873_0283_2_00, diagnostics=[Task failed, taskId=task_1502140036873_0283_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
Caused by: java.lang.RuntimeException: java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).