Reply
New Contributor
Posts: 4
Registered: ‎04-10-2015

Does Amazon S3 have any limitations for number of parallel connections for data transfer to S3

Hello Team,

I am trying to upload 1 TB+ data from HDFS to Amazon S3 using distcp , however, getting ConnectionRefused Errors for initial set of mappers. Here's the stack trace for the exception.

 

15/04/20 00:12:30 INFO mapreduce.Job: Task Id : attempt_1427365508461_14803_m_000148_0, Status : FAILED
Error: com.cloudera.org.apache.http.conn.HttpHostConnectException: Connection to https://BUCKET_NAME.s3.amazonaws.com:443 refused
at com.cloudera.org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
at com.cloudera.org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:151)
at com.cloudera.org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:125)
at com.cloudera.org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
at com.cloudera.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at com.cloudera.org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at com.cloudera.org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)

 

If I reduce the number of mappers to below 20 or so, I don't get this very often, however, reducing the mappers kills the performance of data transfer.

 

In addition to this I keep getting IOExpetions as shown below, with running mappers , however they are retried and resolved though take long time.

 

.hadoop.mapred.MapTask: Processing split: /user/insights/.staging/_distcp1556681122/fileList.seq:38708+283
2015-04-20 03:50:49,443 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying hdfs://dc1-had03:8020/tmp/workingDirs/workdir-reports/JOB_NAME-0013119-150206122614146-oozie-oozi-W/2015-04-16/output/part-m-00134.avro to s3n://BUCKET_NAME/DIRECTORY_NAME/2015-04-16/output/part-m-00134.avro
2015-04-20 03:50:50,072 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3n://BUCKET_NAME/DIRECTORY_NAME/.distcp.tmp.attempt_1427365508461_14803_m_000315_1
2015-04-20 03:50:50,737 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem: OutputStream for key 'DIRECTORY_NAME/.distcp.tmp.attempt_1427365508461_14803_m_000315_1' writing to tempfile '/sdb/yarn/nodemanager/local/s3/output-2432717870025242816.tmp'
2015-04-20 03:51:02,480 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem: OutputStream for key 'DIRECTORY_NAME/.distcp.tmp.attempt_1427365508461_14803_m_000315_1' closed. Now beginning upload
2015-04-20 03:55:31,274 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: I/O exception (java.net.SocketException) caught when processing request: Connection reset
2015-04-20 03:55:31,274 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: Retrying request
2015-04-20 03:55:31,694 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark
2015-04-20 03:55:31,694 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: Retrying request

 

Any help on this would be greatly appreciated !

 

Thanks in advance.

 

-Jagdish

Posts: 1,903
Kudos: 435
Solutions: 305
Registered: ‎07-31-2013

Re: Does Amazon S3 have any limitations for number of parallel connections for data transfer to S3

The error appears to come from the S3 service; do you know if S3 may throttle your connections if there are too many active in a given time frame? Given your description, that seems the most likely cause to me.