Reply
AKB
Contributor
Posts: 55
Registered: ‎04-11-2018

Unable to access HDFS Namenode from Python library - Max retries exceeded with url

[ Edited ]

CDH 5.15 Single Node custer installed using CM on CentOS 7.x on AWS Ec2 instance.  8 CPU, 64 GB RAM.

 

Verfied WebHDFS is running and I am connecting from a remote machine (non-hadoop client), after being connected to the environment using SSH key.

 

using PyWebHDFSClient library to list, read and write files off HDFS.

 

The following code works -

hdfs = PyWebHdfsClient(host='IP_ADDR', port='50070', user_name='hdfs', timeout=1)  # your Namenode IP & username here
my_dir = 'ds-datalake/misc'
pprint(hdfs.list_dir(my_dir))

{u'FileStatuses': {u'FileStatus': [{u'accessTime': 1534856157369L,
u'blockSize': 134217728,
u'childrenNum': 0,
u'fileId': 25173,
u'group': u'supergroup',
u'length': 28,
u'modificationTime': 1534856157544L,
u'owner': u'centos',
u'pathSuffix': u'sample.txt',
u'permission': u'644',
u'replication': 3,
u'storagePolicy': 0,
u'type': u'FILE'}]}}

 

But, when I try to read/write at same location, using something like this:

 

my_file = 'ds-datalake/misc/sample.txt'
print(hdfs.read_file(my_file))

 

 I get the following error:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='HOST_NAME', port=50075): Max retries exceeded with url: /webhdfs/v1/ds-datalake/misc/sample.txt?op=OPEN&user.name=hdfs&namenoderpcaddress=HOST_NAME:8020&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000000068F4828>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

 

This is what the HDFS folder looks like:

hadoop fs -ls /ds-datalake/misc
Found 1 items
-rwxrwxrwx 3 centos supergroup 28 2018-08-21 12:55 /ds-datalake/misc/sample.txt

 

Can you please help me? I have two single node test clusters and this happens on both. HDFS Namenode UI comes up fine from the CM site and all services look healthy.

 

Thanks.

Posts: 1,827
Kudos: 406
Solutions: 292
Registered: ‎07-31-2013

Re: Unable to access HDFS Namenode from Python library - Max retries exceeded with url

It appears as though your remote (client) machine has network access and/or
DNS resolution only for the NameNode host, but not to the DataNode hosts.

When using the WebHDFS protocol at the NameNode, a CREATE call or a READ
call will typically result in the NameNode sending back a 30x (307
typically) code to redirect your client to a chosen target DataNode service
that will handle the rest of the data-oriented work. The NameNode only
handles metadata requests, and does not desire to be burdened with actual
data streaming overheads so it redirects the clients to one of the 'worker'
WebHDFS servlet hosts (i.e. DataNodes).

This is documented at
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
and
you should be able to verify this in your error - the HOST_NAME that you've
masked away for port 50075 is a DataNode service host/port.

Ensure your client can connect to and name-resolve all DataNode
hostnames/port besides just the NameNode for the WebHDFS client to work.

If you need a more one-stop-gateway solution, run a HTTPFS service and
point your client code to just that web host:port, instead of using the
NameNode web address. The HTTPFS service's WebHDFS API will not require
redirection, as it would act as a 'proxy' and handle all calls for you from
one location.
AKB
Contributor
Posts: 55
Registered: ‎04-11-2018

Re: Unable to access HDFS Namenode from Python library - Max retries exceeded with url

[ Edited ]

It is a single node cluster, NN is the DN.


Also, why is it able to list the directoty contents but cannot seem to read/write from it?

Posts: 1,827
Kudos: 406
Solutions: 292
Registered: ‎07-31-2013

Re: Unable to access HDFS Namenode from Python library - Max retries exceeded with url

Yes, but is your client able to (a) resolve the hostname of the DN/NN (you
seem to be using an IP in your code) and (b) does it have permission
(firewall, etc.) to connect to the DN web port?
AKB
Contributor
Posts: 55
Registered: ‎04-11-2018

Re: Unable to access HDFS Namenode from Python library - Max retries exceeded with url

Another thing came to mind. I am using Elastic IP for the public IP address which is what I put in the code. It does resolve to the private IP as I can see in the error.

 

requests.exceptions.ConnectionError: HTTPConnectionPool(host='ip-172-31-26-58.ec2.internal', port=50075): Max retries exceeded with url: /webhdfs/v1/tmp/sample.txt?op=OPEN&user.name=hdfs&namenoderpcaddress=ip-172-31-26-58.ec2.internal:8020&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000000007693828>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

 

Security group is also configured to allow entry for these ports from my work IP address range.

AKB
Contributor
Posts: 55
Registered: ‎04-11-2018

Re: Unable to access HDFS Namenode from Python library - Max retries exceeded with url

I found this which is somewhat relevant, I think - https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresse...

 

But, my problem is, I am trying to connect from a remote non-hadoop edge node machine, so there is no hadoop config files here.

AKB
Contributor
Posts: 55
Registered: ‎04-11-2018

Re: Unable to access HDFS Namenode from Python library - Max retries exceeded with url

Solution found.

 

In the hosts file of the python client machine,

add public IP and private host name

 

This is appropriate for a cloud service like AWS.

 

Python lib works fine now.

 

Thanks for @bgooley help on another thread that resolved this too. 

Highlighted
New Contributor
Posts: 1
Registered: ‎12-12-2018

Re: Unable to access HDFS Namenode from Python library - Max retries exceeded with url

I am having the same problem. Can you please expain about 'hosts file' and how can i add IP and hostname? Are we still using IP and hostname of Namenode?

Announcements