Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Does Amazon S3 have any limitations for number of parallel connections for data transfer to S3

Does Amazon S3 have any limitations for number of parallel connections for data transfer to S3

New Contributor

Hello Team,

I am trying to upload 1 TB+ data from HDFS to Amazon S3 using distcp , however, getting ConnectionRefused Errors for initial set of mappers. Here's the stack trace for the exception.

 

15/04/20 00:12:30 INFO mapreduce.Job: Task Id : attempt_1427365508461_14803_m_000148_0, Status : FAILED
Error: com.cloudera.org.apache.http.conn.HttpHostConnectException: Connection to https://BUCKET_NAME.s3.amazonaws.com:443 refused
at com.cloudera.org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
at com.cloudera.org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:151)
at com.cloudera.org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:125)
at com.cloudera.org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
at com.cloudera.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at com.cloudera.org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at com.cloudera.org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)

 

If I reduce the number of mappers to below 20 or so, I don't get this very often, however, reducing the mappers kills the performance of data transfer.

 

In addition to this I keep getting IOExpetions as shown below, with running mappers , however they are retried and resolved though take long time.

 

.hadoop.mapred.MapTask: Processing split: /user/insights/.staging/_distcp1556681122/fileList.seq:38708+283
2015-04-20 03:50:49,443 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying hdfs://dc1-had03:8020/tmp/workingDirs/workdir-reports/JOB_NAME-0013119-150206122614146-oozie-oozi-W/2015-04-16/output/part-m-00134.avro to s3n://BUCKET_NAME/DIRECTORY_NAME/2015-04-16/output/part-m-00134.avro
2015-04-20 03:50:50,072 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3n://BUCKET_NAME/DIRECTORY_NAME/.distcp.tmp.attempt_1427365508461_14803_m_000315_1
2015-04-20 03:50:50,737 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem: OutputStream for key 'DIRECTORY_NAME/.distcp.tmp.attempt_1427365508461_14803_m_000315_1' writing to tempfile '/sdb/yarn/nodemanager/local/s3/output-2432717870025242816.tmp'
2015-04-20 03:51:02,480 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem: OutputStream for key 'DIRECTORY_NAME/.distcp.tmp.attempt_1427365508461_14803_m_000315_1' closed. Now beginning upload
2015-04-20 03:55:31,274 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: I/O exception (java.net.SocketException) caught when processing request: Connection reset
2015-04-20 03:55:31,274 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: Retrying request
2015-04-20 03:55:31,694 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: I/O exception (java.io.IOException) caught when processing request: Resetting to invalid mark
2015-04-20 03:55:31,694 INFO [main] com.cloudera.org.apache.http.impl.client.DefaultHttpClient: Retrying request

 

Any help on this would be greatly appreciated !

 

Thanks in advance.

 

-Jagdish

1 REPLY 1

Re: Does Amazon S3 have any limitations for number of parallel connections for data transfer to S3

Master Guru
The error appears to come from the S3 service; do you know if S3 may throttle your connections if there are too many active in a given time frame? Given your description, that seems the most likely cause to me.