I have a NiFi cluster with 4 nodes.
The defined dataflow (image attached) has a MergeContent, which gather together incoming flowfiles into a zip (every minute), and a PutHDFS, which puts the zip file into HDFS.
The result I was expecting was that only one zip file would be created in HDFS, with all the flowfiles of the last minute. The real result I got is that every node creates and tries to put its own zip file. Since the zip filename (set in the updateAttribute processor) is unique for the whole cluster, files try to overwrite themselves, and I get Error messages.
I tried setting the Conflict Resolution Strategy property (within PutHDFS processor) to append, but I get the next error:
PutHDFS[id=f696f-05b-100-9b2-51019b97c5] Failed to write to HDFS due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutHDFS[id=f6e6968f-015b-1000-95b2-510198b97c50]: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /example/2017-06-01/.example_1202.zip (inode 32117): File does not exist. Holder DFSClient_NONMAPREDUCE_-6309245_9 does not have any open files.
The objective of the flow is that only one zip file is put into HDFS, which contains the flowfiles collected during the last minute in the 4 cluster nodes. Is this not possible? Where is the mistake in my dataflow? Any idea of how to do it?
Thanks in advance.
This could be accomplished by running all processors on the master node only. Of course, this would mean everything is running on a single node and the others are sitting idle. Alternatively you could setup two clusters, one for ingest and a single node for egress. The ingest cluster would handle processing and then use site-to-site to write to the egress cluster which would then zip and write to HDFS.
I am not a fan of either solution honestly. Is it possible to write the raw data to HDFS and then use HAR to compress?
The egress "cluster" should not be a NiFi cluster. You will get you best performance by installing a standalone NiFi versus a one node cluster. Clusters require zookeeper while standalone NiFi instances do not. The Site-to-Site protocol will load balance to all nodes in a target cluster.