Hi, we are performing some tests copying HDFS data to AWS S3 using S3A and it is taking about 3 hours for 136 small files totalling to 7GB size and we are seeing multiple connection timeouts in all our mapper jobs.
2016-05-20 10:40:43,148 INFO [s3a-transfer-shared--pool1-t3] com.cloudera.com.amazonaws.http.AmazonHttpClient: Unable to execute HTTP request: Connection timed out java.net.SocketException: Connection timed out
Here is the command I'm using and please note that I used fs.s3a.connection.timeout here but still seems to be hitting timeouts.
hadoop distcp -D mapred.task.timeout=1800000 -Dfs.s3a.awsAccessKeyId=xxxx -Dfs.s3a.awsSecretAccessKey=xxx -Dfs.s3a.connection.timeout=1800000 -log /grp/cai_dba/dev/core/pawsdistcplogs hdfs://nameservice1/grp/cai_dba/dev/core/pawsdistcptests/ s3a://ah-distcp-poc-task/weblogs
Here are other tests that worked fine..
1) Able to copy single 4 GB file successfully withing 4 mins using same distcp/s3a method from same hadoop clsuter.
2) I copied all the 136 files (7 GB total) from above test case on to local filesystem of one of the hosts in the same network as our hadoop cluster and perform direct copy to S3 using aws client copy command. Entire 7 GB copied over successfully in about 8 mins.
So, it seems to me that there is some bottlenect with distcp/s3a on how it handles multiple files.
Has anyone experienced this issue and have any ideas with regards to tuning s3a parameters ?
try this for this solution