Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

[SOLVED] Distcp Cannot find GovCloud Endpoint

Solved Go to solution

[SOLVED] Distcp Cannot find GovCloud Endpoint

New Contributor

Hello community. I was hoping if someone might know of a way to point distcp to use GovCloud.

Problem background. Having problems doing a distributed file transfer on GovCloud. Using a distcp process I know works on my 50 node cluster on standard AWS, I've been attempting to do the same on GovCloud, but it doesn't work. I have verified that my keys are current and have the appropriate permissions. I am able to access the S3 files I via the an "aws s3 cp source dest" on my GovCloud systems. The failure occurs when I use distcp on those same files. It is attempting to pull information on the GovCloud bucket from standard AWS. See screenshot below. In the circled, part you'll see the reference to AWS standard in the error. Underlined are the descriptions of the errors. For reference, I have been successful using the natively install s3-dist-cp on my EMR cluster in GovCloud. I'm able to acces any file I need and transfer them to hdfs.

Possible solutions. Can anyone tell me of a way to give distcp the gov-cloud endpoint? Or another possible solution is, does anyone know of a method to install s3-dist-cp on a CDH cluster?

 

distcp_S3GovCloudFailure.JPG

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: [SOLVED] Distcp Cannot find GovCloud Endpoint

New Contributor

[ SOLVED ]

 

After posting this request for help, it got me thinking about re-wording my Google search parameters which led me to a Horton site for copying data between Horton and S3 buckets.  It didn't have an exact answer, but what it did have was the field for specifying an end point.  It was exactly what I needed to add to my cli command. 

 

Here is the complete command I used to get distcp working for s3-govcloud to hdfs.  Opposite also works, hdfs to s3-govcloud.

 

#AWS_SECRET=xxxxxxxxx

#AWS_KEY_ID=xxxxxxxxx

#AWS_BUCKET=xxxxxxxx     <-- name of your govcloud bucket

#hadoop distcp    -D fs.s3a.bucket.#AWS_BUCKET.endpoint=s3-us-gov-west-1.amazonaws.com    -D fs.s3a.awsAccessKeyId=$AWS_KEY_ID    -D fs.s3a.awsSecretAccessKey=$AWS_SECRET s3a://$AWS_BUCKET/path/to/files/    /path/to/hdfs/files/

 

 

Links I used for reference:

 

AWS GovCloud Endpoints

http://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html

 

Horton Amazon S3 Bucket Configuration

https://hortonworks.github.io/hdp-aws/s3-copy-data/index.html

 

1 REPLY 1
Highlighted

Re: [SOLVED] Distcp Cannot find GovCloud Endpoint

New Contributor

[ SOLVED ]

 

After posting this request for help, it got me thinking about re-wording my Google search parameters which led me to a Horton site for copying data between Horton and S3 buckets.  It didn't have an exact answer, but what it did have was the field for specifying an end point.  It was exactly what I needed to add to my cli command. 

 

Here is the complete command I used to get distcp working for s3-govcloud to hdfs.  Opposite also works, hdfs to s3-govcloud.

 

#AWS_SECRET=xxxxxxxxx

#AWS_KEY_ID=xxxxxxxxx

#AWS_BUCKET=xxxxxxxx     <-- name of your govcloud bucket

#hadoop distcp    -D fs.s3a.bucket.#AWS_BUCKET.endpoint=s3-us-gov-west-1.amazonaws.com    -D fs.s3a.awsAccessKeyId=$AWS_KEY_ID    -D fs.s3a.awsSecretAccessKey=$AWS_SECRET s3a://$AWS_BUCKET/path/to/files/    /path/to/hdfs/files/

 

 

Links I used for reference:

 

AWS GovCloud Endpoints

http://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html

 

Horton Amazon S3 Bucket Configuration

https://hortonworks.github.io/hdp-aws/s3-copy-data/index.html