Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NiFI Converting JSON to Avro to ORC and saving in HDFS

avatar
Master Guru

When I fetch 49 records at a time, 49 flow files converted to ORC go into the putHDFS processor but the result is one small file in HDFS. When I do a select count(*) there is only 1 record in the external table. There are no failed flow files coming out of the HDFS processor.

I included a mergecontent processor after the convert to orc and before put to HDFS, a bigger file of merged content is put into HDFS but that results in an error(see below) when I select count(*) from the external table. I suspect the merged file is corrupted, but putting the original 49 files into HDFS doesn’t appear to be working properly either. The External table is simply defined with fields STORED as ORC LOCATION ‘hdfspath’. I tried the DDL with and without orc.compress=’SNAPPY’, even removed all compression from any processor and the results were the same as the mergedcontent file throws the below error every time when selecting from the external table.

Status: Failed

Vertex failed, vertexName=Map 1, vertexId=vertex_1502140036873_0283_2_00, diagnostics=[Task failed, taskId=task_1502140036873_0283_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).

Caused by: java.lang.RuntimeException: java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).

1 ACCEPTED SOLUTION

avatar
Master Guru

Are there any failures in the PutHDFS processor? Seems to me (unless the flowfiles have the same filename and Conflict Resolution Strategy is "append") that you should have 49 small flow files in HDFS (not that that's ideal).

You won't be able to use MergeContent with ORC files as there is no strategy for that (same goes for MergeRecord until an OrcRecordSetWriter is implemented). If your flow files are Avro (going into ConvertAvroToORC), you could try MergeContent before ConvertAvroToORC and use the Avro merge strategy.

View solution in original post

2 REPLIES 2

avatar
Master Guru

Are there any failures in the PutHDFS processor? Seems to me (unless the flowfiles have the same filename and Conflict Resolution Strategy is "append") that you should have 49 small flow files in HDFS (not that that's ideal).

You won't be able to use MergeContent with ORC files as there is no strategy for that (same goes for MergeRecord until an OrcRecordSetWriter is implemented). If your flow files are Avro (going into ConvertAvroToORC), you could try MergeContent before ConvertAvroToORC and use the Avro merge strategy.

avatar
Master Guru

I moved it to the AVRO and made sure the filenames are unique.