Support Questions

Find answers, ask questions, and share your expertise

Splitting a Nifi flowfile into multiple flowfiles

avatar
Expert Contributor

Hi All,

I have the following requirement:

Split a single NiFi flowfile into multiple flowfiles, eventually to insert the contents (after extracting the contents from the flowfile) of each of the flowfiles as a separate row in a Hive table.

Sample input flowfile:

MESSAGE_HEADER | A | B | C

LINE|1 | ABCD | 1234

LINE|2 | DEFG | 5678

LINE|3 | HIJK | 9012

.

.

.

Desired output files:

Flowfile 1:

MESSAGE_HEADER | A | B | C

LINE|1 | ABCD | 1234

Flowfile 2:

MESSAGE_HEADER | A | B | C

LINE|2 | DEFG | 5678

Flowfile 3:

MESSAGE_HEADER | A | B | C

LINE|3 | HIJK | 9012

.

.

.

The number of lines in the flowfile is not known ahead of time.

I would like to know what's the best way to accomplish this with the different NiFi processors that are available; The splitting can be done at the flowfile level or after the contents of the flowfile are extracted out of the flowfile, but before Hive insert statements are created.

Thanks.

1 ACCEPTED SOLUTION

avatar

@Raj B The SplitText processor has a "Header Line Count" property. If you set this to 1, you should be able to achieve what you want in generating multiple flow files, each with the same header. That said, if you're intending to insert these into Hive, you could actually use ConvertCSVToAvro too, setting the delimiter to '|' and then you'd have the data in batches which should give you better throughput.

View solution in original post

4 REPLIES 4

avatar

@Raj B The SplitText processor has a "Header Line Count" property. If you set this to 1, you should be able to achieve what you want in generating multiple flow files, each with the same header. That said, if you're intending to insert these into Hive, you could actually use ConvertCSVToAvro too, setting the delimiter to '|' and then you'd have the data in batches which should give you better throughput.

avatar
Expert Contributor

@jfrazee Thank you; I'm going the SplitText route for now, it seems to work;

for the purposes of saving the split files, for later reference, how do I assign different names (I'm thinking may be pre or postpend UUID to the file name) to the child/split flowfiles; when I looked at it, all of the child files are getting the same name as the parent flowfile, which is causing child flowfiles to be overwritten.

avatar
Contributor

@jfrazee @Raj B

how did you save it in file? Getfile -> splitText -> PutFile ?

avatar
Expert Contributor

@mel mendoza, in my case, after splitting the files, I was doing further processing on the split files; but if your requirement is to store/write the split files, you could use PutFile or PutHDFS to write to local file system or HDFS.