Created 10-03-2016 08:04 PM
I want to write to HDFS with NiFi but NiFi is on a different network so I have to go over WebHDFS (via Knox). I'm trying to use the InvokeHTTP processor and am testing with a simple upstream GetFile. I've tried follow redirects true and including the file in the PUT body but it fails, presumably because the processor can't follow the redirect properly, as outlined in https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE. So I am going down the path of two InvokeHTTP calls, the first to create the inode (with follow redirects false and no body in the PUT) and the second to PUT the body to the Location returned in the first response.
The first call works and I get a Location header with the datanode that will write my file. But I can't figure out how to pull that Location string out of the response header. The response code (307) and a few other fields are accessible but not that Location string (since it's in the header, and the body is empty). The only reason I know it's coming back is from turning on NiFi debug logging and poring over nifi-app.log (ie it's definitely not an attribute on the flowfile).
This is NiFi 1.0, HDP 2.2, and Java 1.8.0_77.
Any ideas?
Created 10-05-2016 01:53 PM
There was a discussion about this at one point which resulted in this JIRA:
https://issues.apache.org/jira/browse/NIFI-1924
It was determined that rather than creating new processors, it should be possible to change the scheme of the filesystem from hdfs:// to webhdfs:// and still use the existing processors.
It is unclear to me whether this ended up fully working or not.
Created 10-05-2016 04:51 AM
I have not tried this yet but just a suggestion. @Oliver Meyn have you tried using putHDFS processor? from the target cluster pull the 'core-site.xml' and 'hdfs-site.xml' and store them in a location on the your nifi cluster. reference them in the processor. verify the dns is resolved. if dns can can not be resolved them use IP in site.xml
Created 10-05-2016 01:11 PM
Because NiFi is in a different network the access rules block it from even seeing the cluster machines. On top of which it can't see the kdc (so couldn't authenticate) for the cluster network. It has to be through Knox (which means WebHDFS). I'm surprised this appears to be an edge case - would have thought many orgs have different, heavily firewalled networks talking to their clusters.
Created 10-05-2016 01:53 PM
There was a discussion about this at one point which resulted in this JIRA:
https://issues.apache.org/jira/browse/NIFI-1924
It was determined that rather than creating new processors, it should be possible to change the scheme of the filesystem from hdfs:// to webhdfs:// and still use the existing processors.
It is unclear to me whether this ended up fully working or not.
Created 10-05-2016 02:42 PM
Nice find @Bryan Bende - not an instant solution but there is hope.
Created 10-09-2017 12:06 PM
@Oliver Meyn: I'm sitting in front of the same problem: NiFi -> WebHDFS. Did you find a solution?
Created 10-09-2017 05:24 PM
Sadly no, @Tilmann Piffl. We ended up with one NiFi outside the HDP cluster network and one inside the cluster network. Then we had the two talk to each other over Site-to-Site and the internal one could write to HDFS directly with PutHDFS.
Created 10-10-2017 06:36 AM
Thanks, @Oliver Meyn, I hadn't thought about this approach. It might be a last resort for us, too.