Reply
New Contributor
Posts: 1
Registered: ‎04-29-2016

distcp with s3 timed out - cdh 5.7

Hello everyone,

Just wondering is there any known issue with distcp with s3? We are trying to distcp some data from HDFS to S3 and we are getting the follow error:

 

Error: org.apache.http.conn.ConnectTimeoutException: Connect to testabcd.s3.amazonaws.com:443 timed out
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:416)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:151)
at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:125)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:643)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:479)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:942)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2148)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2075)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1093)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:548)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at org.apache.hadoop.fs.s3native.$Proxy14.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:472)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1412)
at org.apache.hadoop.tools.mapred.CopyMapper.setup(CopyMapper.java:114)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

 

It's weird that we do get file already exists error if the folder is exist, so that means it's able to make connection but when it comes to copy, it fails. Not only distcp, we also tried to dump data on S3 through distcp, same error.

 

Is there any particular setting we need to enable it? We tried with HDP in the same AWS VPC and it worked. So may be we are missing some config here.

 

Any helps would be higly appreciated.

 

Thanks.

Expert Contributor
Posts: 61
Registered: ‎02-03-2016

Re: distcp with s3 timed out - cdh 5.7

I am using Spark 1.6.0 in CDH 5.7.0, and I am having all kinds of issues using the AWS libraries that come with it. I would like to know the answer you get too.

 

Cheers,

Ben

Highlighted
Posts: 1,886
Kudos: 425
Solutions: 300
Registered: ‎07-31-2013

Re: distcp with s3 timed out - cdh 5.7

Please try using S3A (s3a://) instead of S3/S3N (s3:// or s3n://) going forward in CDH as its powered by Amazon's own S3 Java SDK and supports more current abilities of S3 than the other older implementations. It is designed to replace the older ones.

Usage is straight-forward, with similar properties. Documentation is at http://www.cloudera.com/documentation/enterprise/latest/topics/spark_s3.html
Announcements