09-18-2017 03:08 PM - edited 09-18-2017 04:00 PM
Hello community. I was hoping if someone might know of a way to point distcp to use GovCloud.
Problem background. Having problems doing a distributed file transfer on GovCloud. Using a distcp process I know works on my 50 node cluster on standard AWS, I've been attempting to do the same on GovCloud, but it doesn't work. I have verified that my keys are current and have the appropriate permissions. I am able to access the S3 files I via the an "aws s3 cp source dest" on my GovCloud systems. The failure occurs when I use distcp on those same files. It is attempting to pull information on the GovCloud bucket from standard AWS. See screenshot below. In the circled, part you'll see the reference to AWS standard in the error. Underlined are the descriptions of the errors. For reference, I have been successful using the natively install s3-dist-cp on my EMR cluster in GovCloud. I'm able to acces any file I need and transfer them to hdfs.
Possible solutions. Can anyone tell me of a way to give distcp the gov-cloud endpoint? Or another possible solution is, does anyone know of a method to install s3-dist-cp on a CDH cluster?
09-18-2017 04:11 PM
[ SOLVED ]
After posting this request for help, it got me thinking about re-wording my Google search parameters which led me to a Horton site for copying data between Horton and S3 buckets. It didn't have an exact answer, but what it did have was the field for specifying an end point. It was exactly what I needed to add to my cli command.
Here is the complete command I used to get distcp working for s3-govcloud to hdfs. Opposite also works, hdfs to s3-govcloud.
#AWS_BUCKET=xxxxxxxx <-- name of your govcloud bucket
#hadoop distcp -D fs.s3a.bucket.#AWS_BUCKET.endpoint=s3-us-gov-west-1.amazonaws.com -D fs.s3a.awsAccessKeyId=$AWS_KEY_ID -D fs.s3a.awsSecretAccessKey=$AWS_SECRET s3a://$AWS_BUCKET/path/to/files/ /path/to/hdfs/files/
Links I used for reference:
AWS GovCloud Endpoints
Horton Amazon S3 Bucket Configuration