Support Questions
Find answers, ask questions, and share your expertise

Can NiFi PutHDFS support calls to HDFS via Knox?

Explorer

Can NiFi PutHDFS support calls to HDFS via Knox? I can get the processor to use webhdfs:// and swebhdfs://, but the Knox gateway requires that it use https://. When I try to use https I get the following error:

Caused by: java.io.IOException: No FileSystem for scheme: https
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
        at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$1.run(AbstractHadoopProcessor.java:304)

When I try to use swebhdfs:// instead, I get 404 errors:

2018-04-13 15:33:01,886 ERROR [Timer-Driven Process Thread-5] o.apache.nifi.processors.hadoop.PutHDFS PutHDFS[id=9fe036d7-0161-1000-6e71-1c8d864a510b] Failed to write to HDFS due to java.io.IOException: server:8443: Unexpected HTTP response: code=404 != 200, op=GETFILESTATUS, message=Not Found: {}
java.io.IOException: server:8443: Unexpected HTTP response: code=404 != 200, op=GETFILESTATUS, message=Not Found

If I run the same call from the same computer using curl, it returns back fine:

curl -i -k -u guest:guest-password -X GET 'https://server:8443/gateway/default/webhdfs/v1/?op=GETFILESTATUS'

Any help would be greatly appreciated, thanks.

8 REPLIES 8

Super Collaborator

Hi @Damon Jones,

When we need to connect the HDFS with native library support PutHDFS does the job, where as Knox do support the HDFS operations via the REST Calls, hence it is quiet convenient to use PostHTTP or invokeHTTP processor.

Hope this helps !!

Explorer

Thanks for the response @bkosaraju. If I call Knox using PostHTTP or invokeHTTP processors, would I need to implement an instance of webhdfs protocol behind them? Is that a simple task?

Super Collaborator

Hi @Damon Jones,

No need to implement as as Knox does it for you. you just need to post the request as like all other curls(invokeHttp or PostHttp dose)

here is the reference for the same with some examples that may help

https://cwiki.apache.org/confluence/display/KNOX/Examples+WebHDFS

Contributor

InvokeHttp should be preferred @Damon Jones

Explorer

Thanks @bkosaraju and @Otto Fowler that looks like it will work. I'll give it a try.

Explorer

So in trying to write a file using curl, I cannot seem to get it to work. I run the first command:

curl -i -X PUT -k -u guest:guest-password -X PUT 'https://server:8443/gateway/default/webhdfs/v1/tmp/out1.txt?user.name=client1&op=CREATE'

I get back the following response:

HTTP/1.1 307 Temporary Redirect
Date: Fri, 20 Apr 2018 03:42:05 GMT
Set-Cookie: JSESSIONID=11clfd3vvp60x18w677o1nh79m;Path=/gateway/default;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/gateway/default; Max-Age=0; Expires=Thu, 19-Apr-2018 03:42:06 GMT
Cache-Control: no-cache
Expires: Fri, 20 Apr 2018 03:42:06 GMT
Date: Fri, 20 Apr 2018 03:42:06 GMT
Pragma: no-cache
Expires: Fri, 20 Apr 2018 03:42:06 GMT
Date: Fri, 20 Apr 2018 03:42:06 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Location: https://server:8443/gateway/default/webhdfs/data/v1/webhdfs/v1/tmp/out1.txt?_=AAAACAAAABAAAADQ6aROxC...
Content-Type: application/octet-stream
Server: Jetty(6.1.26.hwx)
Content-Length: 0

I then run the second curl command:

curl -i -k -T /tmp/out1.txt -u guest:guest-password -X PUT 'https://server:8443/gateway/default/webhdfs/v1/tmp/out1.txt?op=CREATE&overwrite=false'

I get back the following response:

HTTP/1.1 307 Temporary Redirect
Date: Fri, 20 Apr 2018 03:43:18 GMT
Set-Cookie: JSESSIONID=kzbc8io0byk65k8j9tsatl2h;Path=/gateway/default;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/gateway/default; Max-Age=0; Expires=Thu, 19-Apr-2018 03:43:18 GMT
Cache-Control: no-cache
Expires: Fri, 20 Apr 2018 03:43:18 GMT
Date: Fri, 20 Apr 2018 03:43:18 GMT
Pragma: no-cache
Expires: Fri, 20 Apr 2018 03:43:18 GMT
Date: Fri, 20 Apr 2018 03:43:18 GMT
Pragma: no-cache
X-FRAME-OPTIONS: SAMEORIGIN
Location: https://server:8443/gateway/default/webhdfs/data/v1/webhdfs/v1/tmp/out1.txt?_=AAAACAAAABAAAADQ6aROxC...
Content-Type: application/octet-stream
Server: Jetty(6.1.26.hwx)
Content-Length: 0
Connection: close

The file is not appearing in the /tmp/ HDFS directory. Is there something I'm doing wrong in the curl command in regard to the URL?

Thanks.

Explorer

Figured out my issue. For the second curl call, used to upload the file, the url after -X PUT should be the location url from the previous curl call. In hindsight it makes sense now. Thanks for the help!

Explorer
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.