Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

how to run processor once on many flowfiles

avatar
Expert Contributor

Hi,

I have a ingest flow which initially ingest approx 50 files, they are about 300mb each. After ingest i want to run some hive commands to create various hive tables to display. but i only need to do that once and not 50 times. I have been searching for some kind of trigger to start a PutHiveQl processor once without a useful hit.

How could i acomplish that

1 ACCEPTED SOLUTION

avatar
Super Mentor

@Simon Jespersen

You need to trigger the PutHiveQL processor only once after ingesting all files? If that is the case, the approach that comes to mind is as follows:

Route the success relationship of your ingest processor twice. Route one as you would normally do for your existing dataflow and route the second "success" relationship to a ReplaceText processor. This does not introduce duplicate data or much additional IO. The file content is still only ingested once, but there are two FlowFile pointing at the same content.

The "success" relationship that feeds into the ReplaceText processor will be your putHiveQL trigger flow. We are going to use the ReplaceText processor to remove the content of those FlowFiles (down that path only. What happens in the background is new FlowFiles are created at this point, but since they are all zero byte there is little IO involved). Then you can use a MergeContent processor to merge all those 0 byte FlowFiles in to 1 FlowFile. Finally route the "merged" relationship of the MergeContent processor to your PutHiveQl processor.

So the Flow would look something like this:

15488-screen-shot-2017-05-17-at-100912-am.png

The ReplaceText would be configured as follows:

15489-screen-shot-2017-05-17-at-101033-am.png

and your MergeContent processor would be configured something like this:

15511-screen-shot-2017-05-17-at-101127-am.png

As always, NiFi provides many ways to accomplish a variety of different dataflow needs. This is just one suggestion.

Thanks,

Matt

View solution in original post

2 REPLIES 2

avatar
Super Mentor

@Simon Jespersen

You need to trigger the PutHiveQL processor only once after ingesting all files? If that is the case, the approach that comes to mind is as follows:

Route the success relationship of your ingest processor twice. Route one as you would normally do for your existing dataflow and route the second "success" relationship to a ReplaceText processor. This does not introduce duplicate data or much additional IO. The file content is still only ingested once, but there are two FlowFile pointing at the same content.

The "success" relationship that feeds into the ReplaceText processor will be your putHiveQL trigger flow. We are going to use the ReplaceText processor to remove the content of those FlowFiles (down that path only. What happens in the background is new FlowFiles are created at this point, but since they are all zero byte there is little IO involved). Then you can use a MergeContent processor to merge all those 0 byte FlowFiles in to 1 FlowFile. Finally route the "merged" relationship of the MergeContent processor to your PutHiveQl processor.

So the Flow would look something like this:

15488-screen-shot-2017-05-17-at-100912-am.png

The ReplaceText would be configured as follows:

15489-screen-shot-2017-05-17-at-101033-am.png

and your MergeContent processor would be configured something like this:

15511-screen-shot-2017-05-17-at-101127-am.png

As always, NiFi provides many ways to accomplish a variety of different dataflow needs. This is just one suggestion.

Thanks,

Matt

avatar
Expert Contributor

@Matt Clarke Thank you very very much, your answer was very useful for me