Reply
New Contributor
Posts: 1
Registered: ‎06-15-2015

Distcp from HDFS to AWS S3 using S3A taking longer for multiple small files.

Hi, we are performing some tests copying HDFS data to AWS S3 using S3A and it is taking about 3 hours for 136 small files totalling to 7GB size and we are seeing multiple connection timeouts in all our mapper jobs.

 

2016-05-20 10:40:43,148 INFO [s3a-transfer-shared--pool1-t3] com.cloudera.com.amazonaws.http.AmazonHttpClient: Unable to execute HTTP request: Connection timed out
java.net.SocketException: Connection timed out

 Here is the command I'm using and please note that I used fs.s3a.connection.timeout here but still seems to be hitting timeouts.

 

hadoop distcp -D mapred.task.timeout=1800000 -Dfs.s3a.awsAccessKeyId=xxxx -Dfs.s3a.awsSecretAccessKey=xxx -Dfs.s3a.connection.timeout=1800000 -log /grp/cai_dba/dev/core/pawsdistcplogs hdfs://nameservice1/grp/cai_dba/dev/core/pawsdistcptests/ s3a://ah-distcp-poc-task/weblogs

 

Here are other tests that worked fine..

 

1) Able to copy single 4 GB file successfully withing 4 mins using same distcp/s3a method from same hadoop clsuter.

2) I copied all the 136 files (7 GB total) from above test case on to local filesystem of one of the hosts in the same network as our hadoop cluster and perform direct copy to S3 using aws client copy command. Entire 7 GB copied over successfully in about 8 mins.

 

So, it seems to me that there is some bottlenect with distcp/s3a on how it handles multiple files.

Has anyone experienced this issue and have any ideas with regards to tuning s3a parameters ?

 

 

 

Highlighted
New Contributor
Posts: 1
Registered: ‎07-08-2017

Re: Distcp from HDFS to AWS S3 using S3A taking longer for multiple small files.

try this for this solution

 

-D fs.s3a.fast.upload=true

 

OR

 

<property>

  <name>fs.s3a.fast.upload</name>

  <value>true</value>

</property>

Announcements