About patelismail23

patelismail23 · ‎07-13-2017

Context I have configured a multinode nifi cluster with three nodes; I have a processor that runs an operation, let's say, on each flow run it creates hive partitions with a time stamp of when the job starts which is basically an update attribute processor after get file processor which sets an attribute timestamp which is later used to create partitions in Hive getFile ->MergeContent(into Single file) -> SetTimeStamp -> Create HDFS Directories from timestamp -> PutHDFS-CreateHivePartition Problem: Each node sets the current time stamp and creates a partition, as there is milliseconds difference between nodes when the flowfile reaches to update attribute. so the number of partitions created for single scheduled ingestion is equal to the number of nodes. Here i want to capture a single time stamp of the flow file which reaches SetTimeStamp first this will be same for flowfiles across cluster for that scheduled job.This way i will get single partition . For this I configured SetTimeStamp processor to run on primary node, this worked fine for pimary node flowfiles.But for other nodes flowfiles get queued for SetTimeStamp and hence there is partial injection. Why do flow files get queued ? How do i bypass the setAttributeProcessor for flowfiles on non primary nodes? @Matt Clarke Any help from would be appricaited

patelismail23 · ‎05-08-2017

This worked, it distributed files across nodes but why do we need route-on-attribute to set back pressure ?. I directly connected list-hdfs connected to RPG it and set back pressure to 5 and flow files were distributed across nodes

patelismail23 · ‎05-05-2017

Thanks, matt this is really helpful. I tried a flow where "generate flow file proc" would generate 100 files each of 1gb which is then connected to RPG but the RPG graphs shows both node 1 and node 2 but the files end up always in the node1. I disconnected node1 and the part of the flow which supposed to run in parallel stopped.At the connection between generate flow file and RPG flowfiles were accumulated of 80 GB even when node2 was up it did not take any files from the the queue . When node1 was again connected it consuming flowfiles from the queue

patelismail23 · ‎05-05-2017

I have 2 node nifi cluster with similar configuration. I create a simple Flow as 1.Flow 2.getFromHDFS The listHDFS which creates flow files this is set to run on the primary node. Which sends data to RPG where the address is set to one of the nodes in the cluster. what I understood from the docs is, In clustered mode flow files get distributed on nodes and each node operates on some non-empty set of flow files received from RPG but in my case its always one node which gets 100% of the flowfiles and in few cases the other node get just 1 flow file, this is even true for 1000 files each of 100 MB 1.RPG 2. getHDFS grp (supposed to be run on each node) How can I get getHDFS run in parallel on both the machines operating on different set of flow files? .

Online	Offline
Last Visited	‎07-17-2017 10:40 AM

Member Since	‎05-05-2017 03:32 PM
Last Visited	‎07-17-2017 10:40 AM
Posts	4
Kudos received	4

Cloudera Community

How to handle Flow files that get queued on the n...

Re: Flow runs on single node on a Nifi cluster

Re: Flow runs on single node on a Nifi cluster

Flow runs on single node on a Nifi cluster