02-02-2018 12:14 AM
I am trying to move data from HDFS to a S3 bucket. I am using last version of CM/CDH (5.14.0). I have been able to copy data using the tool aws:
aws s3api put-object
And also with the python SDK but I cannot copy data with hadoop distcp. I have added the following extra properties to core-site.xml in HDFS service.
<property> <name>fs.s3a.access.key</name> <value>X</value> </property> <property> <name>fs.s3a.secret.key</name> <value>X</value> </property> <property> <name>fs.s3a.endpoint</name> <value>s3.us-east-2.amazonaws.com</value> </property>
Nothing happens when I execute a command like
hadoop distcp /blablabla s3a://bucket-name/
but it hangs for a while (I guess is trying several times). Same thing when I try to just list files in the bucket with
hadoop fs -ls s3a://bucket-name
I am sure it is not a credentials problem since I can connect using the same access and secret key with the python SKD and aws tool.
Anyone facing a similar issue? Thanks!
02-07-2018 03:59 PM
Distcp can take some time to complete depending on your source data.
One thing to try would be to list a public bucket. I believe if you have no credentials set you'll see an error, but if you have any valid credentials you should be able to list it:
hadoop fs -ls s3a://landsat-pds/
Also make sure you've deployed your client configs in Cloudera Manager (CM).