Support Questions

simon_jespersen · ‎05-17-2017

Hi,

I have a ingest flow which initially ingest approx 50 files, they are about 300mb each. After ingest i want to run some hive commands to create various hive tables to display. but i only need to do that once and not 50 times. I have been searching for some kind of trigger to start a PutHiveQl processor once without a useful hit.

How could i acomplish that

MattWho · ‎05-17-2017

@Simon Jespersen

You need to trigger the PutHiveQL processor only once after ingesting all files? If that is the case, the approach that comes to mind is as follows:

Route the success relationship of your ingest processor twice. Route one as you would normally do for your existing dataflow and route the second "success" relationship to a ReplaceText processor. This does not introduce duplicate data or much additional IO. The file content is still only ingested once, but there are two FlowFile pointing at the same content.

The "success" relationship that feeds into the ReplaceText processor will be your putHiveQL trigger flow. We are going to use the ReplaceText processor to remove the content of those FlowFiles (down that path only. What happens in the background is new FlowFiles are created at this point, but since they are all zero byte there is little IO involved). Then you can use a MergeContent processor to merge all those 0 byte FlowFiles in to 1 FlowFile. Finally route the "merged" relationship of the MergeContent processor to your PutHiveQl processor.

So the Flow would look something like this:

The ReplaceText would be configured as follows:

and your MergeContent processor would be configured something like this:

As always, NiFi provides many ways to accomplish a variety of different dataflow needs. This is just one suggestion.

Thanks,

Matt

View solution in original post

MattWho · ‎05-17-2017

@Simon Jespersen

You need to trigger the PutHiveQL processor only once after ingesting all files? If that is the case, the approach that comes to mind is as follows:

Route the success relationship of your ingest processor twice. Route one as you would normally do for your existing dataflow and route the second "success" relationship to a ReplaceText processor. This does not introduce duplicate data or much additional IO. The file content is still only ingested once, but there are two FlowFile pointing at the same content.

The "success" relationship that feeds into the ReplaceText processor will be your putHiveQL trigger flow. We are going to use the ReplaceText processor to remove the content of those FlowFiles (down that path only. What happens in the background is new FlowFiles are created at this point, but since they are all zero byte there is little IO involved). Then you can use a MergeContent processor to merge all those 0 byte FlowFiles in to 1 FlowFile. Finally route the "merged" relationship of the MergeContent processor to your PutHiveQl processor.

So the Flow would look something like this:

The ReplaceText would be configured as follows:

and your MergeContent processor would be configured something like this:

As always, NiFi provides many ways to accomplish a variety of different dataflow needs. This is just one suggestion.

Thanks,

Matt

simon_jespersen · ‎05-19-2017

@Matt Clarke Thank you very very much, your answer was very useful for me

Cloudera Community

Support Questions

how to run processor once on many flowfiles