- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
NiFi Cluster with PutHDFS - append error
- Labels:
-
Apache Hadoop
-
Apache NiFi
Created on 06-01-2017 01:44 PM - edited 08-17-2019 11:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
I have a NiFi cluster with 4 nodes.
The defined dataflow (image attached) has a MergeContent, which gather together incoming flowfiles into a zip (every minute), and a PutHDFS, which puts the zip file into HDFS.
The result I was expecting was that only one zip file would be created in HDFS, with all the flowfiles of the last minute:
Example: /example/2017-06-01/example_1202.zip
The real result I got is that every node creates and tries to put its own zip file. Since the zip filename (set in the updateAttribute processor) is unique for the whole cluster, files try to overwrite themselves, and I get Error messages.
I tried setting the Conflict Resolution Strategy property (within PutHDFS processor) to append, but I get the next error:
PutHDFS[id=f696f-05b-100-9b2-51019b97c5] Failed to write to HDFS due to org.apache.nifi.processor.exception.ProcessException: IOException thrown from PutHDFS[id=f6e6968f-015b-1000-95b2-510198b97c50]: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /example/2017-06-01/.example_1202.zip (inode 32117): File does not exist. Holder DFSClient_NONMAPREDUCE_-6309245_9 does not have any open files.
The objective of the flow is that files received in any of the 4 nodes, are collected every minute, compressed into a .zip file, and put into HDFS. Is my dataflow not valid? Where is the mistake? Any idea of how to do it?
Thanks in advance.
Created 06-01-2017 01:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Every node in a NiFi cluster runs its own copy of the cluster flow, has its own repositories, and works on its own set of FlowFiles. Nodes in a NiFi cluster are unaware of any FlowFiles being processed by other nodes in the cluster.
What you are seeing is normal expected behavior of your dataflow.
Thanks,
Matt
Created 06-01-2017 02:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Multiple nodes can write to the same path in HDFS, but not the same file at the same time.
The lease error you saw above is moist likely the result of one node completing writing .example_1202.zip and then renaming it example_1202.zip. In between that time, a different node saw and then tried to start appending to .example_1202.zip, but it was moved/renamed before that could happen. It essentiall becomes a race condition since nodes do not communicate thsi kind of information with one another.
You could write 4 zip files to HDFS every minute. You could just name each filename uniquely based on NiFi hostname writing file.
Thanks, Matt
Created 06-01-2017 02:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only way to create one single zip file is to have one node perform the zipping of all the files. This sounds less then ideal. How large are each of these individual zip files and how many FlowFiles on average go in to each zip file?
Created 06-01-2017 03:10 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Matt Clarke
Every zip could be around 200MB, and contains 40K flowfiles aprox.
Is there any way the cluster can route the flowfiles to the Primary Node? This way, the primary node could be the only responsible of the HDFS writting.
Alvaro
Created 06-01-2017 03:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The primary node could change at anytime.
You could use postHTTP and listenHTTP processor to route FlowFiles from multiple nodes to a single node. My concern would be heap usage to merge (zip) 160K FlowFiles on a single NiFi node. The FlowFile metadata for all those FlowFiles being zipped would be help in heap memory until the zip is complete.
Any objection to having a zip of zips?
In other words you could still create 4 unique zip files (1 per node each with unique filename), then send these zipped files to one node to be zipped once more in to a new zip with the single name you want written into HDFS.
Thanks,
Matt
Created 06-01-2017 03:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You could add something specific to the host machine into the filename through expression language.
One option would be to update the filename to ${hostname(false)}${filename}
Or you could define a variable in the bootstrap.conf of each node like java.arg.16=-Dsystem.id=<SOME_NUMBER> and then set the filename to be ${system.id}-${filename}
Also, appending to zip files will not work even if everything was running a single NiFi node. An HDFS append just writes raw bytes to the end of a file, so you'll end up with a single file that actually has the bytes of multiple zip files and won't be able to be processed by any tools that expect to read zip files.