Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to install the hadoop-aws module to copy from on premist hdfs to s3 aws

avatar
New Member

How to install the hadoop-aws module to copy from on premist hdfs to s3 aws, I need the command s3DistCp

1 ACCEPTED SOLUTION

avatar

distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop.

For example: Moving data from Hadoop to S3.

hadoop distcp <current_cluster_folder> s3[a]://<bucket_info>

If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that.

https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.ht...

View solution in original post

8 REPLIES 8

avatar

distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop.

For example: Moving data from Hadoop to S3.

hadoop distcp <current_cluster_folder> s3[a]://<bucket_info>

If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that.

https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.ht...

avatar
New Member

When I run:

  1. hadoop credential create fs.s3a.access.key -provider localjceks://file/path/to/aws.jceks
  2. <enter AccessKey value at prompt>
  3. hadoop credential create fs.s3a.secret.key -provider localjceks://file/path/to/aws.jceks
  4. <enter SecretKey value at prompt>

It prompts me for a password:

[root@test232 conf]# hadoop credential create fs.s3a.access.key -provider localjceks://file/var/tmp/aws.jceks

Enter password:

Enter password again:

avatar
New Member

When I add the access key and the secret in the prompt for the password, I get this:

[hdfs@test232 ~]$ hdfs dfs -Dhadoop.security.credential.provider.path=jceks://hdfs/aws/aws.jceks -ls s3a://s3-us-west-2.amazonaws.com/kartik-test 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Caught an AmazonServiceException, which means your request made it to Amazon S3, but was rejected with an error response for some reason. 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: C3EFA25EC200D255, AWS Error Code: null, AWS Error Message: Forbidden 17/01/14 07:51:00 INFO s3a.S3AFileSystem: HTTP Status Code: 403 17/01/14 07:51:00 INFO s3a.S3AFileSystem: AWS Error Code: null 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Error Type: Client 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Request ID: C3EFA25EC200D255 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Class Name: com.amazonaws.services.s3.model.AmazonS3Exception -ls: Fatal internal error

avatar
New Member

[hdfs@test232 ~]$ curl http://kartik-test.s3-us-west-2.amazonaws.com <?xml version="1.0" encoding="UTF-8"?> <ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>kartik-test</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><IsTruncated>false</IsTruncated><Contents><Key>hosts</Key><LastModified>2017-01-12T19:48:14.000Z</LastModified><ETag>"881dc3861c3c8a28e213790785a940b7"</ETag><Size>44</Size><StorageClass>STANDARD</StorageClass></Contents><Contents><Key>logs/</Key><LastModified>2017-01-14T17:01:56.000Z</LastModified><ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag><Size>0</Size><StorageClass>STANDARD</StorageClass></Contents></ListBucketResult>[hdfs@test232 ~]$

avatar
New Member

I tried: hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws.jceks /nsswitch.conf s3a//kartik-test.s3-us-west-2.amazonaws.com

and it created a s3a folder in my hdfs:

[hdfs@test232 ~]$ hdfs dfs -ls

Found 3 items

drwx------ - hdfs hdfs 0 2017-01-14 07:47 .Trash

drwx------ - hdfs hdfs 0 2017-01-14 12:07 .staging

drwx------ - hdfs hdfs 0 2017-01-14 12:07 s3a

[hdfs@test232 ~]$

avatar
New Member

Getting there....I missed a colon in my previous attempt......

[hdfs@test232 ~]$ hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws.jceks /nsswitch.conf s3a://kartik-test.s3-us-west-2.amazonaws.com

17/01/14 15:12:31 INFO s3a.S3AFileSystem: Caught an AmazonServiceException, which means your request made it to Amazon S3, but was rejected with an error response for some reason.

17/01/14 15:12:31 INFO s3a.S3AFileSystem: Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 3094C5772AA3B4C0, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.

17/01/14 15:12:31 INFO s3a.S3AFileSystem: HTTP Status Code: 403 17/01/14 15:12:31 INFO s3a.S3AFileSystem: AWS Error Code: SignatureDoesNotMatch

17/01/14 15:12:31 INFO s3a.S3AFileSystem: Error Type: Client 17/01/14 15:12:31 INFO s3a.S3AFileSystem: Request ID: 3094C5772AA3B4C0 17/01/14 15:12:31 INFO s3a.S3AFileSystem: Class Name: com.amazonaws.services.s3.model.AmazonS3Exception

avatar
New Member

Regenerated keys, updated the aws.jceks entry

[hdfs@test232 ~]$ hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws/aws.jceks /nsswitch.conf s3a://kartik-test.s3-us-west-2.amazonaws.com 17/01/14 20:14:59 ERROR tools.DistCp: Invalid arguments: java.io.IOException: Bucket kartik-test.s3-us-west-2.amazonaws.com does not exist

But I am able to browse the bucket in http

avatar
New Member

This worked: [hdfs@test232 ~]$ hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws/aws.jceks /test s3a://kartik-test/

Thanks for all your help!!