Created 03-20-2019 03:16 PM
Hi All,
@mattclarke,@mattburgess,@markpayne
I want to generate sequence number in my nifi cluster (3 nodes), I was using update attribute processor with store state locally option , but this is not serving my purpose as each node is generating its own value incrementally and this is creating duplicate values while loading data to target table.I would be grateful if i can get alternate solution to achieve this batchid generation in cluster mode.
Thanks in advance!!
Created 03-22-2019 01:45 PM
If you need atomic sequencing and still want to use a paralell system you are gonna have to push that sequencing off onto a system capable of atomic sequencing.
probably the easiest way is to write a stored procedure with a transaction in an rdbms. and do an executesql in your flow.
Don't use a cache as MC says caches are not designed for transactional atomic stuff. Only use cache for actually caching stuff (you can get the value but its expensive and cheaper just to store for a bit in the cache.
Created 03-22-2019 01:09 AM
Instead of using UpdateAttribute processor's state use DistributedMapCache and you can fetch the stored value across the cluster.
Use PutDistributedMapCache processor to store the value that got assigned recently then use FetchDistributedMapCache processor to Fetch the store value then apply your logic(increment..etc) to assign new value then overwrite the already stored value in DistributedMapCache using PutDistributedMapCache processor.
Use this and this links as references for configuring Distributedmapcache processors/controllers.
Created 03-22-2019 04:50 AM
@shu,@Mattclarke,@markpayne How do i generate the sequence number to be used as a stored value as you suggested.As per my knowledge there is only one processor in nifi to generate sequence number and that is update attribute which in cluster mode will again produce different values across all nodes.
Created 03-22-2019 12:18 PM
The issue with this distributed cache solution is timing. You have a bit of a race condition to consider here. Each of the nodes in a NiFi cluster runs their own copy of the flow.xml.gz and process their own set of FlowFiles. While the Distributed cache allows you to have all your NiFi nodes reading and writing to the same Distributed Map Cache server, they are doing it at their own pace. This means that multiple nodes may end up pulling the same cache value incrementing it and writing it back to the cache server so you do not end up with a true count. You may also have a condition where node 1 fetches value X from cache server and for some reason the flow on that node 1 is delayed getting to putDistributeCache to to update value. Mean time some other node has already fetched and updated the cache multiple times. Now node 1 puts and overwrites newer value with an older count.
-
Dealing with such a "race" condition is going to be difficult here because of how NiFi clustering works.
Created 03-22-2019 12:36 PM
-
The question here what is the use case for needing to have a sequence number across entire cluster. Why not generate a sequential number per node to keep track of per node batches. Maybe use the NiFi hostname in the sequence number identifier?
-
Maybe you can share some more details on the full use case for these sequence numbers. Why you are generating and what they are being used for.
-
If you use the distributedMapCache, you could keep three different sequential number cached values (each node has its own sequence number stored in a cache entry by hostname.
-
You could then build a flow that fetches all three value add adds them together for you on an hourly/daily/weekly schedule?
-
Thank you,
Matt
Created 04-03-2019 01:59 PM
Thanks Matt for your view on this,the ask is to generate a batchid which should be a sequence number, so whenever querydb processor fetches records from source db (sqlserver) a batchid should be added to the flowfile so that all records have a same batchid when loaded to target table,this will help in auditing of records.but here in cluster mode it seems difficult to achieve this using updateattribute processor.i liked your idea of appending node hostname with the sequence but if i could generate atomic values across all nodes it would be much better.
Created 03-22-2019 01:45 PM
If you need atomic sequencing and still want to use a paralell system you are gonna have to push that sequencing off onto a system capable of atomic sequencing.
probably the easiest way is to write a stored procedure with a transaction in an rdbms. and do an executesql in your flow.
Don't use a cache as MC says caches are not designed for transactional atomic stuff. Only use cache for actually caching stuff (you can get the value but its expensive and cheaper just to store for a bit in the cache.
Created 04-03-2019 01:55 PM
Thanks David, Idea looks good ,I will try this.