Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

When running a distcp process from HDFS to AWS S3, credentials are required to authenticate to the S3 bucket. Passing these into the S3A URI would leak secret values into application logs. Storing these secrets in core-site.xml is also not ideal because this means any user with hdfs CLI access can access the S3 bucket to which these AWS credentials are tied.

The Hadoop Credential API can be used to manage access to S3 in a more fine-grained way.

The first step is to create a local JCEKS file in which to store the AWS Access Key and AWS Secret Key values:

hadoop credential create fs.s3a.access.key -provider localjceks://file/path/to/aws.jceks
<enter Access Key value at prompt>
hadoop credential create fs.s3a.secret.key -provider localjceks://file/path/to/aws.jceks
<enter Secret Key value at prompt>

We'll then copy this JCEKS file to HDFS with the appropriate permissions.

hdfs dfs -put /path/to/aws.jceks /user/admin/
hdfs dfs -chown admin:admin /user/admin/aws.jceks
hdfs dfs -chmod 400 /user/admin/aws.jceks

We can then use the credential provider when calling hadoop distcp, as follows:

hadoop distcp -Dhadoop.security.credential.provider.path=jceks://hdfs/user/admin/aws.jceks /user/admin/file s3a://my-bucket/

Notice that only the admin user can read this credentials file. If other users attempt to run the command above they will receive a permissions error because they can't read aws.jceks.

This also works with hdfs commands, as in the below example.

hdfs dfs -Dhadoop.security.credential.provider.path=jceks://hdfs/user/admin/aws.jceks -ls s3a://my-bucket
9,926 Views
Comments
avatar
Rising Star

@slachterman

Very nice information, we have been having the same scenario and aws keys are exposed to ambari user through which we run the backup (HDFS to AWS S3) using AWS credentials. Now we have changed to Role based which means we dont need to use any credentials. Just we need to make appropriate permissions on AWS end. Just thought of sharing the info.

Before

"hadoop distcp -Dfs.s3a.server-side-encryption-algorithm=AES256 -Dfs.s3a.access.key=${AWS_ACCESS_KEY_ID} -Dfs.s3a.secret.key=${AWS_SECRET_ACCESS_KEY} -update hdfs://$dir/ s3a://${BUCKET_NAME}/CCS/$table_name/$year/$month/ "

After

" hadoop distcp -Dfs.s3a.server-side-encryption-algorithm=AES256 -update hdfs://$dir/ s3a://${BUCKET_NAME}/CCVR/$table_name/$year/$month/ "

OPTIONS:

<property> <name>fs.s3a.access.key</name> <description>AWS access key ID. Omit for Role-based authentication.</description> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key. Omit for Role-based authentication.</description> </property>

avatar

Thanks @Muthukumar S, can you please provide further details? How does role-based authentication work with an on-premise source outside of AWS?

avatar
Rising Star

@slachterman

Above one is for AWS instances as we have been using credentials with the command. For on-prem setup I would need to check. One thing I know is when we setup the onprem servers with AWS CLI installation, we can run aws configure command to provide the credentials once and there on we can run the aws s3 commands from the command line to access AWS S3 (provided we have setup things in AWS end like IAM user creation and bucket policy etc). But with hadoop distcp the one you provided is the solution. May be we can check with AWS guys if there is an option with role based from on-prem.

avatar
New Contributor

@slachtermanWe have created an instance profile for the node, and not added credentials in core-site.xml. hadoop fs -ls s3a:// works and even selecting few rows from the external table (whose data is in s3) works, but I try to do aggregation function like :

select max(updated_at) from s3_table;

This query fails with the below mentioned error. Could you please help.

Caused by: java.lang.RuntimeException: java.io.IOException: java.io.IOException: Cannot find password option fs.s3a.access.key
  at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206)
  at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.(TezGroupedSplitsInputFormat.java:145)
  at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat.getRecordReader(TezGroupedSplitsInputFormat.java:111)
  at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:157)
  at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
  at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:694)
  at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:653)
  at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:145)
  at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:109)
  at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525)
  at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171)
  at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:184)
  ... 15 more
Caused by: java.io.IOException: java.io.IOException: Cannot find password option fs.s3a.access.key
  at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
  at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
  at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:382)
  at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
  ... 26 more
Caused by: java.io.IOException: Cannot find password option fs.s3a.access.key
  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:489)
  at org.apache.hadoop.fs.s3a.S3AUtils.getPassword(S3AUtils.java:468)
  at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:451)
  at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:341)
  at org.apache.hadoop.fs.s3a.S3ClientFactory$DefaultS3ClientFactory.createS3Client(S3ClientFactory.java:73)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:185)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2795)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
  at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
  at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:244)
  at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:94)
  at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:80)
  at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
  at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:380)
  ... 27 more
Caused by: java.io.IOException: Configuration problem with provider path.
  at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:1999)
  at org.apache.hadoop.conf.Configuration.getPassword(Configuration.java:1959)
  at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:484)
  ... 45 more
Caused by: java.io.IOException: No CredentialProviderFactory for jceks://file/usr/hdp/current/hive-server2-hive2/conf/conf.server/hive-site.jceks in hadoop.security.credential.provider.path
  at org.apache.hadoop.security.alias.CredentialProviderFactory.getProviders(CredentialProviderFactory.java:66)
  at org.apache.hadoop.conf.Configuration.getPasswordFromCredentialProviders(Configuration.java:1979)
avatar

Hi @Manmeet Kaur, please post this on HCC as a separate question.