Reply
Explorer
Posts: 6
Registered: ‎03-09-2017
Accepted Solution

Unable to move data to a S3 bucket using last CDH (5.14.0)

Hi!

 

I am trying to move data from HDFS to a S3 bucket. I am using last version of CM/CDH (5.14.0). I have been able to copy data using the tool aws:

 

aws s3api put-object

And also with the python SDK but I cannot copy data with hadoop distcp. I have added the following extra properties to core-site.xml in HDFS service.

 

s3a.png

<property>
    <name>fs.s3a.access.key</name>
    <value>X</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>X</value>
</property>
<property>
    <name>fs.s3a.endpoint</name>
    <value>s3.us-east-2.amazonaws.com</value>
</property>

Nothing happens when I execute a command like

hadoop distcp /blablabla s3a://bucket-name/

but it hangs for a while (I guess is trying several times). Same thing when I try to just list files in the bucket with

 

hadoop fs -ls s3a://bucket-name

I am sure it is not a credentials problem since I can connect using the same access and secret key with the python SKD and aws tool.

 

Anyone facing a similar issue? Thanks!

Highlighted
Cloudera Employee
Posts: 5
Registered: ‎09-09-2015

Re: Unable to move data to a S3 bucket using last CDH (5.14.0)

Distcp can take some time to complete depending on your source data.

 

One thing to try would be to list a public bucket.  I believe if you have no credentials set you'll see an error, but if you have any valid credentials you should be able to list it:

 

hadoop fs -ls s3a://landsat-pds/

 

Also make sure you've deployed your client configs in Cloudera Manager (CM).

Explorer
Posts: 6
Registered: ‎03-09-2017

Re: Unable to move data to a S3 bucket using last CDH (5.14.0)

Hi Aaron! Thanks for answering.

 

At the end it wasn't a problem with Hadoop or the configuration (credentials were correct and config files deploy in all nodes). It was just that IT was blocking all the traffic to the private bucket. Even after asking them to allow those IPs it didn't work so I install CNLM in all nodes and specified the proxy using:

 

-Dfs.s3a.proxy.host="localhost" -Dfs.s3a.proxy.port="3128"

 

After that I was able to move 3 TB in less than a day.

Announcements