Support Questions

Find answers, ask questions, and share your expertise

Unable to access HDFS Namenode from Python library - Max retries exceeded with url

avatar
Contributor

CDH 5.15 Single Node custer installed using CM on CentOS 7.x on AWS Ec2 instance.  8 CPU, 64 GB RAM.

 

Verfied WebHDFS is running and I am connecting from a remote machine (non-hadoop client), after being connected to the environment using SSH key.

 

using PyWebHDFSClient library to list, read and write files off HDFS.

 

The following code works -

hdfs = PyWebHdfsClient(host='IP_ADDR', port='50070', user_name='hdfs', timeout=1)  # your Namenode IP & username here
my_dir = 'ds-datalake/misc'
pprint(hdfs.list_dir(my_dir))

{u'FileStatuses': {u'FileStatus': [{u'accessTime': 1534856157369L,
u'blockSize': 134217728,
u'childrenNum': 0,
u'fileId': 25173,
u'group': u'supergroup',
u'length': 28,
u'modificationTime': 1534856157544L,
u'owner': u'centos',
u'pathSuffix': u'sample.txt',
u'permission': u'644',
u'replication': 3,
u'storagePolicy': 0,
u'type': u'FILE'}]}}

 

But, when I try to read/write at same location, using something like this:

 

my_file = 'ds-datalake/misc/sample.txt'
print(hdfs.read_file(my_file))

 

 I get the following error:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='HOST_NAME', port=50075): Max retries exceeded with url: /webhdfs/v1/ds-datalake/misc/sample.txt?op=OPEN&user.name=hdfs&namenoderpcaddress=HOST_NAME:8020&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000000068F4828>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

 

This is what the HDFS folder looks like:

hadoop fs -ls /ds-datalake/misc
Found 1 items
-rwxrwxrwx 3 centos supergroup 28 2018-08-21 12:55 /ds-datalake/misc/sample.txt

 

Can you please help me? I have two single node test clusters and this happens on both. HDFS Namenode UI comes up fine from the CM site and all services look healthy.

 

Thanks.

7 REPLIES 7

avatar
Mentor
It appears as though your remote (client) machine has network access and/or
DNS resolution only for the NameNode host, but not to the DataNode hosts.

When using the WebHDFS protocol at the NameNode, a CREATE call or a READ
call will typically result in the NameNode sending back a 30x (307
typically) code to redirect your client to a chosen target DataNode service
that will handle the rest of the data-oriented work. The NameNode only
handles metadata requests, and does not desire to be burdened with actual
data streaming overheads so it redirects the clients to one of the 'worker'
WebHDFS servlet hosts (i.e. DataNodes).

This is documented at
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
and
you should be able to verify this in your error - the HOST_NAME that you've
masked away for port 50075 is a DataNode service host/port.

Ensure your client can connect to and name-resolve all DataNode
hostnames/port besides just the NameNode for the WebHDFS client to work.

If you need a more one-stop-gateway solution, run a HTTPFS service and
point your client code to just that web host:port, instead of using the
NameNode web address. The HTTPFS service's WebHDFS API will not require
redirection, as it would act as a 'proxy' and handle all calls for you from
one location.

avatar
Contributor

It is a single node cluster, NN is the DN.


Also, why is it able to list the directoty contents but cannot seem to read/write from it?

avatar
Mentor
Yes, but is your client able to (a) resolve the hostname of the DN/NN (you
seem to be using an IP in your code) and (b) does it have permission
(firewall, etc.) to connect to the DN web port?

avatar
Contributor

Another thing came to mind. I am using Elastic IP for the public IP address which is what I put in the code. It does resolve to the private IP as I can see in the error.

 

requests.exceptions.ConnectionError: HTTPConnectionPool(host='ip-172-31-26-58.ec2.internal', port=50075): Max retries exceeded with url: /webhdfs/v1/tmp/sample.txt?op=OPEN&user.name=hdfs&namenoderpcaddress=ip-172-31-26-58.ec2.internal:8020&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000000007693828>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

 

Security group is also configured to allow entry for these ports from my work IP address range.

avatar
Contributor

I found this which is somewhat relevant, I think - https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresse...

 

But, my problem is, I am trying to connect from a remote non-hadoop edge node machine, so there is no hadoop config files here.

avatar
Contributor

Solution found.

 

In the hosts file of the python client machine,

add public IP and private host name

 

This is appropriate for a cloud service like AWS.

 

Python lib works fine now.

 

Thanks for @bgooley help on another thread that resolved this too. 

avatar
New Contributor

I am having the same problem. Can you please expain about 'hosts file' and how can i add IP and hostname? Are we still using IP and hostname of Namenode?