Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to install the hadoop-aws module to copy from on premist hdfs to s3 aws

avatar
Contributor

How to install the hadoop-aws module to copy from on premist hdfs to s3 aws, I need the command s3DistCp

1 ACCEPTED SOLUTION

avatar

distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop.

For example: Moving data from Hadoop to S3.

hadoop distcp <current_cluster_folder> s3[a]://<bucket_info>

If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that.

https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.ht...

View solution in original post

8 REPLIES 8

avatar

distcp recognizes the s3[a] protocols from the default libraries already available in Hadoop.

For example: Moving data from Hadoop to S3.

hadoop distcp <current_cluster_folder> s3[a]://<bucket_info>

If you're looking for ways to manage access (via AWS Keys) to S3 Buckets in Hadoop, this article is a great secure way to do that.

https://community.hortonworks.com/articles/59161/using-hadoop-credential-api-to-store-aws-secrets.ht...

avatar
Contributor

When I run:

  1. hadoop credential create fs.s3a.access.key -provider localjceks://file/path/to/aws.jceks
  2. <enter AccessKey value at prompt>
  3. hadoop credential create fs.s3a.secret.key -provider localjceks://file/path/to/aws.jceks
  4. <enter SecretKey value at prompt>

It prompts me for a password:

[root@test232 conf]# hadoop credential create fs.s3a.access.key -provider localjceks://file/var/tmp/aws.jceks

Enter password:

Enter password again:

avatar
Contributor

When I add the access key and the secret in the prompt for the password, I get this:

[hdfs@test232 ~]$ hdfs dfs -Dhadoop.security.credential.provider.path=jceks://hdfs/aws/aws.jceks -ls s3a://s3-us-west-2.amazonaws.com/kartik-test 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Caught an AmazonServiceException, which means your request made it to Amazon S3, but was rejected with an error response for some reason. 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: C3EFA25EC200D255, AWS Error Code: null, AWS Error Message: Forbidden 17/01/14 07:51:00 INFO s3a.S3AFileSystem: HTTP Status Code: 403 17/01/14 07:51:00 INFO s3a.S3AFileSystem: AWS Error Code: null 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Error Type: Client 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Request ID: C3EFA25EC200D255 17/01/14 07:51:00 INFO s3a.S3AFileSystem: Class Name: com.amazonaws.services.s3.model.AmazonS3Exception -ls: Fatal internal error

avatar
Contributor

[hdfs@test232 ~]$ curl http://kartik-test.s3-us-west-2.amazonaws.com <?xml version="1.0" encoding="UTF-8"?> <ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>kartik-test</Name><Prefix></Prefix><Marker></Marker><MaxKeys>1000</MaxKeys><IsTruncated>false</IsTruncated><Contents><Key>hosts</Key><LastModified>2017-01-12T19:48:14.000Z</LastModified><ETag>"881dc3861c3c8a28e213790785a940b7"</ETag><Size>44</Size><StorageClass>STANDARD</StorageClass></Contents><Contents><Key>logs/</Key><LastModified>2017-01-14T17:01:56.000Z</LastModified><ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag><Size>0</Size><StorageClass>STANDARD</StorageClass></Contents></ListBucketResult>[hdfs@test232 ~]$

avatar
Contributor

I tried: hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws.jceks /nsswitch.conf s3a//kartik-test.s3-us-west-2.amazonaws.com

and it created a s3a folder in my hdfs:

[hdfs@test232 ~]$ hdfs dfs -ls

Found 3 items

drwx------ - hdfs hdfs 0 2017-01-14 07:47 .Trash

drwx------ - hdfs hdfs 0 2017-01-14 12:07 .staging

drwx------ - hdfs hdfs 0 2017-01-14 12:07 s3a

[hdfs@test232 ~]$

avatar
Contributor

Getting there....I missed a colon in my previous attempt......

[hdfs@test232 ~]$ hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws.jceks /nsswitch.conf s3a://kartik-test.s3-us-west-2.amazonaws.com

17/01/14 15:12:31 INFO s3a.S3AFileSystem: Caught an AmazonServiceException, which means your request made it to Amazon S3, but was rejected with an error response for some reason.

17/01/14 15:12:31 INFO s3a.S3AFileSystem: Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 3094C5772AA3B4C0, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.

17/01/14 15:12:31 INFO s3a.S3AFileSystem: HTTP Status Code: 403 17/01/14 15:12:31 INFO s3a.S3AFileSystem: AWS Error Code: SignatureDoesNotMatch

17/01/14 15:12:31 INFO s3a.S3AFileSystem: Error Type: Client 17/01/14 15:12:31 INFO s3a.S3AFileSystem: Request ID: 3094C5772AA3B4C0 17/01/14 15:12:31 INFO s3a.S3AFileSystem: Class Name: com.amazonaws.services.s3.model.AmazonS3Exception

avatar
Contributor

Regenerated keys, updated the aws.jceks entry

[hdfs@test232 ~]$ hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws/aws.jceks /nsswitch.conf s3a://kartik-test.s3-us-west-2.amazonaws.com 17/01/14 20:14:59 ERROR tools.DistCp: Invalid arguments: java.io.IOException: Bucket kartik-test.s3-us-west-2.amazonaws.com does not exist

But I am able to browse the bucket in http

avatar
Contributor

This worked: [hdfs@test232 ~]$ hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/aws/aws.jceks /test s3a://kartik-test/

Thanks for all your help!!