Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Distcp from HDFS to AWS S3 using S3A taking longer for multiple small files.

Highlighted

Distcp from HDFS to AWS S3 using S3A taking longer for multiple small files.

New Contributor

Hi, we are performing some tests copying HDFS data to AWS S3 using S3A and it is taking about 3 hours for 136 small files totalling to 7GB size and we are seeing multiple connection timeouts in all our mapper jobs.

 

2016-05-20 10:40:43,148 INFO [s3a-transfer-shared--pool1-t3] com.cloudera.com.amazonaws.http.AmazonHttpClient: Unable to execute HTTP request: Connection timed out
java.net.SocketException: Connection timed out

 Here is the command I'm using and please note that I used fs.s3a.connection.timeout here but still seems to be hitting timeouts.

 

hadoop distcp -D mapred.task.timeout=1800000 -Dfs.s3a.awsAccessKeyId=xxxx -Dfs.s3a.awsSecretAccessKey=xxx -Dfs.s3a.connection.timeout=1800000 -log /grp/cai_dba/dev/core/pawsdistcplogs hdfs://nameservice1/grp/cai_dba/dev/core/pawsdistcptests/ s3a://ah-distcp-poc-task/weblogs

 

Here are other tests that worked fine..

 

1) Able to copy single 4 GB file successfully withing 4 mins using same distcp/s3a method from same hadoop clsuter.

2) I copied all the 136 files (7 GB total) from above test case on to local filesystem of one of the hosts in the same network as our hadoop cluster and perform direct copy to S3 using aws client copy command. Entire 7 GB copied over successfully in about 8 mins.

 

So, it seems to me that there is some bottlenect with distcp/s3a on how it handles multiple files.

Has anyone experienced this issue and have any ideas with regards to tuning s3a parameters ?

 

 

 

1 REPLY 1
Highlighted

Re: Distcp from HDFS to AWS S3 using S3A taking longer for multiple small files.

New Contributor

try this for this solution

 

-D fs.s3a.fast.upload=true

 

OR

 

<property>

  <name>fs.s3a.fast.upload</name>

  <value>true</value>

</property>

Don't have an account?
Coming from Hortonworks? Activate your account here