Created on 09-30-2016 12:03 AM
When running a distcp process from HDFS to AWS S3, credentials are required to authenticate to the S3 bucket. Passing these into the S3A URI would leak secret values into application logs. Storing these secrets in core-site.xml is also not ideal because this means any user with hdfs CLI access can access the S3 bucket to which these AWS credentials are tied.
The Hadoop Credential API can be used to manage access to S3 in a more fine-grained way.
The first step is to create a local JCEKS file in which to store the AWS Access Key and AWS Secret Key values:
hadoop credential create fs.s3a.access.key -provider localjceks://file/path/to/aws.jceks <enter Access Key value at prompt> hadoop credential create fs.s3a.secret.key -provider localjceks://file/path/to/aws.jceks <enter Secret Key value at prompt>
We'll then copy this JCEKS file to HDFS with the appropriate permissions.
hdfs dfs -put /path/to/aws.jceks /user/admin/ hdfs dfs -chown admin:admin /user/admin/aws.jceks hdfs dfs -chmod 400 /user/admin/aws.jceks
We can then use the credential provider when calling hadoop distcp, as follows:
hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/user/admin/aws.jceks /user/admin/file s3a://my-bucket/
Notice that only the admin user can read this credentials file. If other users attempt to run the command above they will receive a permissions error because they can't read aws.jceks.
This also works with hdfs commands, as in the below example.
hdfs dfs -Dhadoop.security.credential.provider.path=jceks://hdfs/user/admin/aws.jceks -ls s3a://my-bucket
Created on 10-13-2016 04:17 AM
Very nice information, we have been having the same scenario and aws keys are exposed to ambari user through which we run the backup (HDFS to AWS S3) using AWS credentials. Now we have changed to Role based which means we dont need to use any credentials. Just we need to make appropriate permissions on AWS end. Just thought of sharing the info.
Before
"hadoop distcp -Dfs.s3a.server-side-encryption-algorithm=AES256 -Dfs.s3a.access.key=${AWS_ACCESS_KEY_ID} -Dfs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} -update hdfs://$dir/ s3a://${BUCKET_NAME}/CCS/$table_name/$year/$month/ "
After
" hadoop distcp -Dfs.s3a.server-side-encryption-algorithm=AES256 -update hdfs://$dir/ s3a://${BUCKET_NAME}/CCVR/$table_name/$year/$month/ "
OPTIONS:
<property> <name>fs.s3a.access.key</name> <description>AWS access key ID. Omit for Role-based authentication.</description> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key. Omit for Role-based authentication.</description> </property>
Created on 10-13-2016 04:24 AM
Thanks @Muthukumar S, can you please provide further details? How does role-based authentication work with an on-premise source outside of AWS?
Created on 10-13-2016 06:41 AM
Above one is for AWS instances as we have been using credentials with the command. For on-prem setup I would need to check. One thing I know is when we setup the onprem servers with AWS CLI installation, we can run aws configure command to provide the credentials once and there on we can run the aws s3 commands from the command line to access AWS S3 (provided we have setup things in AWS end like IAM user creation and bucket policy etc). But with hadoop distcp the one you provided is the solution. May be we can check with AWS guys if there is an option with role based from on-prem.
Created on 05-04-2017 03:23 AM
@slachtermanWe have created an instance profile for the node, and not added credentials in core-site.xml. hadoop fs -ls s3a:// works and even selecting few rows from the external table (whose data is in s3) works, but I try to do aggregation function like :
select max(updated_at) from s3_table;
This query fails with the below mentioned error. Could you please help.
Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Cannot find password option fs.s3a.access.key at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111) at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157) at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83) at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:694) at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:653) at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145) at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:184) ... 15 more Caused by: java.io.IOException: java.io.IOException: Cannot find password option fs.s3a.access.key at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:382) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) ... 26 more Caused by: java.io.IOException: Cannot find password option fs.s3a.access.key at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:489) at org.apache.hadoop.fs.s3a.S3AUtils.getPassword(S3AUtils.java:468) at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:451) at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:341) at org.apache.hadoop.fs.s3a.S3ClientFactory$DefaultS3ClientFactory.createS3Client(S3ClientFactory.java:73) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:185) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:244) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:94) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:80) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:380) ... 27 more Caused by: java.io.IOException: Configuration problem with provider path. at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:1999) at org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:1959) at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:484) ... 45 more Caused by: java.io.IOException: No CredentialProviderFactory for jceks://file/usr/hdp/current/hive-server2-hive2/conf/conf.server/hive-site.jceks in hadoop.security.credential.provider.path at org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:66) at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:1979)
Created on 05-04-2017 02:08 PM
Hi @Manmeet Kaur, please post this on HCC as a separate question.