Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to append HDFS file using putHDFS where NiFi is hosted on clustered mode?

avatar

Hi,

My NiFi is hosted on 3 node cluster, my requirement is to append data at the end of the file, since my NiFi is working on clustered mode how we should make sure that only one node should write data to the file? There should not be any conflict in write operation.

Thanks,

1 ACCEPTED SOLUTION

avatar
Super Mentor

@Rahoul A

Unfortunately, you can only have one client writing/appending to the same file in HDFS at a time. The nature of this append capability in HDFS does not mesh well with the NIFi architecture of concurrent parallel operations across multiple nodes. NiFi nodes each run their own copy of the dataflows and work on their own unique set of FlowFiles. While NiFi nodes do communicate health and status heartbeats to the elected cluster coordinator, dataflow specific information like which node is currently appending to a very specific filename in the same target HDFS cluster is not shared. And from a performance design aspect, it makes sense not to do this.

-

So, aside from the above work-around which reduces the likelihood of conflict, you can also:

1. After whatever preprocessing you perform on the data in NiFi before pushing to HDFS, route all data to the a dedicated node (with a failover node, think postHTTP with failure feeding another postHTTP) in your cluster for the final step of appending to your target HDFS.

2. Install an edge standalone instance of NiFi that simply receives the processed data from your NiFi cluster and writes/appends it to HDFS.

-

Thanks,

Matt

View solution in original post

4 REPLIES 4

avatar
Master Guru

@Rahoul A

You can use Control Rate processor before PutHDFS processor and configure Control Rate Processor to release flowfile for the desired time like 1 min ..etc.

if you need to append data to the file then we need to make sure we are having same filename to get same filename every time we can use Update attribute processor to change the filename and in PutHDFS processor we need to configure the below property

Conflict Resolution Strategy

append //if processor finds same filename it appends the data to the file.


Control Rate Processor configs:-

68503-controlrate.png

By using these configurations we are releasing 1 flowfile for every one minute so at any point of time we are going to have one node write/append data to the file.

Flow:-

other Processors --> ControlRate Processor --> PutHDFS

avatar

@Shu, Thanks for your answer,

We are currently doing the same thing we save setup time driven flags to make sure only one node can write the data to HDFS. I was expecting something else apart from workaround as there will be performance complications. If there is any recommended way of doing this please let me know.

avatar
Super Mentor

@Rahoul A

Unfortunately, you can only have one client writing/appending to the same file in HDFS at a time. The nature of this append capability in HDFS does not mesh well with the NIFi architecture of concurrent parallel operations across multiple nodes. NiFi nodes each run their own copy of the dataflows and work on their own unique set of FlowFiles. While NiFi nodes do communicate health and status heartbeats to the elected cluster coordinator, dataflow specific information like which node is currently appending to a very specific filename in the same target HDFS cluster is not shared. And from a performance design aspect, it makes sense not to do this.

-

So, aside from the above work-around which reduces the likelihood of conflict, you can also:

1. After whatever preprocessing you perform on the data in NiFi before pushing to HDFS, route all data to the a dedicated node (with a failover node, think postHTTP with failure feeding another postHTTP) in your cluster for the final step of appending to your target HDFS.

2. Install an edge standalone instance of NiFi that simply receives the processed data from your NiFi cluster and writes/appends it to HDFS.

-

Thanks,

Matt

avatar

@Matt Clarke, seems quite refined approach. happy to see your response.