Support Questions

patelismail23 · ‎07-13-2017

Context

I have configured a multinode nifi cluster with three nodes; I have a processor that runs an operation, let's say, on each flow run it creates hive partitions with a time stamp of when the job starts which is basically an update attribute processor after get file processor which sets an attribute timestamp which is later used to create partitions in Hive

getFile ->MergeContent(into Single file) -> SetTimeStamp -> Create HDFS Directories from timestamp -> PutHDFS-CreateHivePartition

Problem:

Each node sets the current time stamp and creates a partition, as there is milliseconds difference between nodes when the flowfile reaches to update attribute. so the number of partitions created for single scheduled ingestion is equal to the number of nodes.

Here i want to capture a single time stamp of the flow file which reaches SetTimeStamp first this will be same for flowfiles across cluster for that scheduled job.This way i will get single partition .

For this I configured SetTimeStamp processor to run on primary node, this worked fine for pimary node flowfiles.But for other nodes flowfiles get queued for SetTimeStamp and hence there is partial injection.

Why do flow files get queued ?

How do i bypass the setAttributeProcessor for flowfiles on non primary nodes?

@Matt Clarke Any help from would be appricaited

Wynner · ‎07-18-2017

@ismail patel

Flow files are queued of course because the processor is only running on the primary node.

GetFile in a cluster is not the best way to get data into your flow. It would be better to use a ListFile and then a FetchFile processor.

Configure each processor in that flow to run on primary node only or have two flow paths, one on primary node only and then a second flow that runs on all nodes.

Cloudera Community

Support Questions

How to handle Flow files that get queued on the nodes other than primary node for the processors configured to run on primary node