Created 04-13-2018 11:22 PM
Can NiFi PutHDFS support calls to HDFS via Knox? I can get the processor to use webhdfs:// and swebhdfs://, but the Knox gateway requires that it use https://. When I try to use https I get the following error:
Caused by: java.io.IOException: No FileSystem for scheme: https at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$1.run(AbstractHadoopProcessor.java:304)
When I try to use swebhdfs:// instead, I get 404 errors:
2018-04-13 15:33:01,886 ERROR [Timer-Driven Process Thread-5] o.apache.nifi.processors.hadoop.PutHDFS PutHDFS[id=9fe036d7-0161-1000-6e71-1c8d864a510b] Failed to write to HDFS due to java.io.IOException: server:8443: Unexpected HTTP response: code=404 != 200, op=GETFILESTATUS, message=Not Found: {} java.io.IOException: server:8443: Unexpected HTTP response: code=404 != 200, op=GETFILESTATUS, message=Not Found
If I run the same call from the same computer using curl, it returns back fine:
curl -i -k -u guest:guest-password -X GET 'https://server:8443/gateway/default/webhdfs/v1/?op=GETFILESTATUS'
Any help would be greatly appreciated, thanks.
Created 04-16-2018 12:00 AM
Hi @Damon Jones,
When we need to connect the HDFS with native library support PutHDFS does the job, where as Knox do support the HDFS operations via the REST Calls, hence it is quiet convenient to use PostHTTP or invokeHTTP processor.
Hope this helps !!
Created 04-16-2018 11:05 PM
Thanks for the response @bkosaraju. If I call Knox using PostHTTP or invokeHTTP processors, would I need to implement an instance of webhdfs protocol behind them? Is that a simple task?
Created 04-16-2018 11:44 PM
Hi @Damon Jones,
No need to implement as as Knox does it for you. you just need to post the request as like all other curls(invokeHttp or PostHttp dose)
here is the reference for the same with some examples that may help
https://cwiki.apache.org/confluence/display/KNOX/Examples+WebHDFS
Created 04-17-2018 12:03 AM
InvokeHttp should be preferred @Damon Jones
Created 04-19-2018 07:40 PM
Thanks @bkosaraju and @Otto Fowler that looks like it will work. I'll give it a try.
Created 04-20-2018 03:59 AM
So in trying to write a file using curl, I cannot seem to get it to work. I run the first command:
curl -i -X PUT -k -u guest:guest-password -X PUT 'https://server:8443/gateway/default/webhdfs/v1/tmp/out1.txt?user.name=client1&op=CREATE'
I get back the following response:
HTTP/1.1 307 Temporary Redirect Date: Fri, 20 Apr 2018 03:42:05 GMT Set-Cookie: JSESSIONID=11clfd3vvp60x18w677o1nh79m;Path=/gateway/default;Secure;HttpOnly Expires: Thu, 01 Jan 1970 00:00:00 GMT Set-Cookie: rememberMe=deleteMe; Path=/gateway/default; Max-Age=0; Expires=Thu, 19-Apr-2018 03:42:06 GMT Cache-Control: no-cache Expires: Fri, 20 Apr 2018 03:42:06 GMT Date: Fri, 20 Apr 2018 03:42:06 GMT Pragma: no-cache Expires: Fri, 20 Apr 2018 03:42:06 GMT Date: Fri, 20 Apr 2018 03:42:06 GMT Pragma: no-cache X-FRAME-OPTIONS: SAMEORIGIN Location: https://server:8443/gateway/default/webhdfs/data/v1/webhdfs/v1/tmp/out1.txt?_=AAAACAAAABAAAADQ6aROxC... Content-Type: application/octet-stream Server: Jetty(6.1.26.hwx) Content-Length: 0
I then run the second curl command:
curl -i -k -T /tmp/out1.txt -u guest:guest-password -X PUT 'https://server:8443/gateway/default/webhdfs/v1/tmp/out1.txt?op=CREATE&overwrite=false'
I get back the following response:
HTTP/1.1 307 Temporary Redirect Date: Fri, 20 Apr 2018 03:43:18 GMT Set-Cookie: JSESSIONID=kzbc8io0byk65k8j9tsatl2h;Path=/gateway/default;Secure;HttpOnly Expires: Thu, 01 Jan 1970 00:00:00 GMT Set-Cookie: rememberMe=deleteMe; Path=/gateway/default; Max-Age=0; Expires=Thu, 19-Apr-2018 03:43:18 GMT Cache-Control: no-cache Expires: Fri, 20 Apr 2018 03:43:18 GMT Date: Fri, 20 Apr 2018 03:43:18 GMT Pragma: no-cache Expires: Fri, 20 Apr 2018 03:43:18 GMT Date: Fri, 20 Apr 2018 03:43:18 GMT Pragma: no-cache X-FRAME-OPTIONS: SAMEORIGIN Location: https://server:8443/gateway/default/webhdfs/data/v1/webhdfs/v1/tmp/out1.txt?_=AAAACAAAABAAAADQ6aROxC... Content-Type: application/octet-stream Server: Jetty(6.1.26.hwx) Content-Length: 0 Connection: close
The file is not appearing in the /tmp/ HDFS directory. Is there something I'm doing wrong in the curl command in regard to the URL?
Thanks.
Created 04-23-2018 10:26 PM
Figured out my issue. For the second curl call, used to upload the file, the url after -X PUT should be the location url from the previous curl call. In hindsight it makes sense now. Thanks for the help!
Created 04-23-2018 10:27 PM
Post that helped me: http://www-01.ibm.com/support/docview.wss?uid=swg21976974