Created 05-17-2017 01:12 PM
Hi,
I have a ingest flow which initially ingest approx 50 files, they are about 300mb each. After ingest i want to run some hive commands to create various hive tables to display. but i only need to do that once and not 50 times. I have been searching for some kind of trigger to start a PutHiveQl processor once without a useful hit.
How could i acomplish that
Created on 05-17-2017 02:12 PM - edited 08-18-2019 02:53 AM
You need to trigger the PutHiveQL processor only once after ingesting all files? If that is the case, the approach that comes to mind is as follows:
Route the success relationship of your ingest processor twice. Route one as you would normally do for your existing dataflow and route the second "success" relationship to a ReplaceText processor. This does not introduce duplicate data or much additional IO. The file content is still only ingested once, but there are two FlowFile pointing at the same content.
The "success" relationship that feeds into the ReplaceText processor will be your putHiveQL trigger flow. We are going to use the ReplaceText processor to remove the content of those FlowFiles (down that path only. What happens in the background is new FlowFiles are created at this point, but since they are all zero byte there is little IO involved). Then you can use a MergeContent processor to merge all those 0 byte FlowFiles in to 1 FlowFile. Finally route the "merged" relationship of the MergeContent processor to your PutHiveQl processor.
So the Flow would look something like this:
The ReplaceText would be configured as follows:
and your MergeContent processor would be configured something like this:
As always, NiFi provides many ways to accomplish a variety of different dataflow needs. This is just one suggestion.
Thanks,
Matt
Created on 05-17-2017 02:12 PM - edited 08-18-2019 02:53 AM
You need to trigger the PutHiveQL processor only once after ingesting all files? If that is the case, the approach that comes to mind is as follows:
Route the success relationship of your ingest processor twice. Route one as you would normally do for your existing dataflow and route the second "success" relationship to a ReplaceText processor. This does not introduce duplicate data or much additional IO. The file content is still only ingested once, but there are two FlowFile pointing at the same content.
The "success" relationship that feeds into the ReplaceText processor will be your putHiveQL trigger flow. We are going to use the ReplaceText processor to remove the content of those FlowFiles (down that path only. What happens in the background is new FlowFiles are created at this point, but since they are all zero byte there is little IO involved). Then you can use a MergeContent processor to merge all those 0 byte FlowFiles in to 1 FlowFile. Finally route the "merged" relationship of the MergeContent processor to your PutHiveQl processor.
So the Flow would look something like this:
The ReplaceText would be configured as follows:
and your MergeContent processor would be configured something like this:
As always, NiFi provides many ways to accomplish a variety of different dataflow needs. This is just one suggestion.
Thanks,
Matt
Created 05-19-2017 08:17 AM
@Matt Clarke Thank you very very much, your answer was very useful for me