CDH 5.15 Single Node custer installed using CM on CentOS 7.x on AWS Ec2 instance. 8 CPU, 64 GB RAM.
Verfied WebHDFS is running and I am connecting from a remote machine (non-hadoop client), after being connected to the environment using SSH key.
using PyWebHDFSClient library to list, read and write files off HDFS.
The following code works -
hdfs = PyWebHdfsClient(host='IP_ADDR', port='50070', user_name='hdfs', timeout=1) # your Namenode IP & username here
my_dir = 'ds-datalake/misc'
pprint(hdfs.list_dir(my_dir))
{u'FileStatuses': {u'FileStatus': [{u'accessTime': 1534856157369L,
u'blockSize': 134217728,
u'childrenNum': 0,
u'fileId': 25173,
u'group': u'supergroup',
u'length': 28,
u'modificationTime': 1534856157544L,
u'owner': u'centos',
u'pathSuffix': u'sample.txt',
u'permission': u'644',
u'replication': 3,
u'storagePolicy': 0,
u'type': u'FILE'}]}}
But, when I try to read/write at same location, using something like this:
my_file = 'ds-datalake/misc/sample.txt'
print(hdfs.read_file(my_file))
I get the following error:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='HOST_NAME', port=50075): Max retries exceeded with url: /webhdfs/v1/ds-datalake/misc/sample.txt?op=OPEN&user.name=hdfs&namenoderpcaddress=HOST_NAME:8020&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000000068F4828>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))
This is what the HDFS folder looks like:
hadoop fs -ls /ds-datalake/misc
Found 1 items
-rwxrwxrwx 3 centos supergroup 28 2018-08-21 12:55 /ds-datalake/misc/sample.txt
Can you please help me? I have two single node test clusters and this happens on both. HDFS Namenode UI comes up fine from the CM site and all services look healthy.
Thanks.
Created 08-22-2018 04:12 AM
Created on 08-22-2018 06:31 AM - edited 08-22-2018 06:36 AM
It is a single node cluster, NN is the DN.
Also, why is it able to list the directoty contents but cannot seem to read/write from it?
Created 08-22-2018 06:37 AM
Created 08-22-2018 06:41 AM
Another thing came to mind. I am using Elastic IP for the public IP address which is what I put in the code. It does resolve to the private IP as I can see in the error.
requests.exceptions.ConnectionError: HTTPConnectionPool(host='ip-172-31-26-58.ec2.internal', port=50075): Max retries exceeded with url: /webhdfs/v1/tmp/sample.txt?op=OPEN&user.name=hdfs&namenoderpcaddress=ip-172-31-26-58.ec2.internal:8020&offset=0 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000000007693828>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))
Security group is also configured to allow entry for these ports from my work IP address range.
Created 08-22-2018 07:03 AM
I found this which is somewhat relevant, I think - https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresse...
But, my problem is, I am trying to connect from a remote non-hadoop edge node machine, so there is no hadoop config files here.
Created 08-23-2018 06:27 AM
Solution found.
In the hosts file of the python client machine,
add public IP and private host name
This is appropriate for a cloud service like AWS.
Python lib works fine now.
Thanks for @bgooley help on another thread that resolved this too.
Created 12-12-2018 10:20 PM
I am having the same problem. Can you please expain about 'hosts file' and how can i add IP and hostname? Are we still using IP and hostname of Namenode?