Support Questions

Find answers, ask questions, and share your expertise

Unable to move data to a S3 bucket using last CDH (5.14.0)

avatar
Explorer

Hi!

 

I am trying to move data from HDFS to a S3 bucket. I am using last version of CM/CDH (5.14.0). I have been able to copy data using the tool aws:

 

aws s3api put-object

And also with the python SDK but I cannot copy data with hadoop distcp. I have added the following extra properties to core-site.xml in HDFS service.

 

s3a.png

<property>
    <name>fs.s3a.access.key</name>
    <value>X</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>X</value>
</property>
<property>
    <name>fs.s3a.endpoint</name>
    <value>s3.us-east-2.amazonaws.com</value>
</property>

Nothing happens when I execute a command like

hadoop distcp /blablabla s3a://bucket-name/

but it hangs for a while (I guess is trying several times). Same thing when I try to just list files in the bucket with

 

hadoop fs -ls s3a://bucket-name

I am sure it is not a credentials problem since I can connect using the same access and secret key with the python SKD and aws tool.

 

Anyone facing a similar issue? Thanks!

1 ACCEPTED SOLUTION

avatar
Explorer

Hi Aaron! Thanks for answering.

 

At the end it wasn't a problem with Hadoop or the configuration (credentials were correct and config files deploy in all nodes). It was just that IT was blocking all the traffic to the private bucket. Even after asking them to allow those IPs it didn't work so I install CNLM in all nodes and specified the proxy using:

 

-Dfs.s3a.proxy.host="localhost" -Dfs.s3a.proxy.port="3128"

 

After that I was able to move 3 TB in less than a day.

View solution in original post

2 REPLIES 2

avatar
Cloudera Employee

Distcp can take some time to complete depending on your source data.

 

One thing to try would be to list a public bucket.  I believe if you have no credentials set you'll see an error, but if you have any valid credentials you should be able to list it:

 

hadoop fs -ls s3a://landsat-pds/

 

Also make sure you've deployed your client configs in Cloudera Manager (CM).

avatar
Explorer

Hi Aaron! Thanks for answering.

 

At the end it wasn't a problem with Hadoop or the configuration (credentials were correct and config files deploy in all nodes). It was just that IT was blocking all the traffic to the private bucket. Even after asking them to allow those IPs it didn't work so I install CNLM in all nodes and specified the proxy using:

 

-Dfs.s3a.proxy.host="localhost" -Dfs.s3a.proxy.port="3128"

 

After that I was able to move 3 TB in less than a day.