Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDP Use IAM role Writing to S3 with SSE-KMS encryption

HDP Use IAM role Writing to S3 with SSE-KMS encryption

New Contributor

I setup a HDP cluster on AWS EC2 instances with IAM role setting up for S3 access.

I tested this setup on two S3 buckets, one with AWS-KMS encryption and the other without encryption.

I can write to these two buckets using "aws s3" command successfully:

aws s3 cp testfile.txt s3://no_encryption_bucket/testfile.txt

aws s3 cp testfile.txt s3://encryption_bucket/testfile.txt

These work just as expected with the first file not encrypted and the second file encrypted with AWS-KMS

But I failed to use "hadoop fs" command to write to the encrypted bucket:

hadoop fs -copyFromLocal testfile.txt s3a://no_encryption_bucket/hadoop_testfile_hdpA.txt

This command run successfully with the result file not encryped, just as expected

hadoop fs -copyFromLocal testfile.txt s3a://encryption_bucket/hadoop_testfile_hdpA.txt

This command failed with error:

copyFromLocal: saving output on encryption_bucket/test/hadoop_testfile_hdpA.txt._COPYING_: com.amazonaws.AmazonClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: pkYbRKamqFtjgaN8gTsaCw== in base 64) didn't match hash (etag: 41d5f50238eefddc6de740d997ddc23e in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: pkYbRKamqFtjgaN8gTsaCw==, md5DigestStream: null, bucketName: encryption_bucket, key: test/hadoop_testfile_hdpA.txt._COPYING_): Unable to verify integrity of data upload. Client calculated content hash (contentMD5: pkYbRKamqFtjgaN8gTsaCw== in base 64) didn't match hash (etag: 41d5f50238eefddc6de740d997ddc23e in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: pkYbRKamqFtjgaN8gTsaCw==, md5DigestStream: null, bucketName: encryption_bucket, key: test/hadoop_testfile_hdpA.txt._COPYING_)

Seems HDP gets fs.s3a.access.key and fs.s3a.secret.key from IAM role, but HDP didn't figure out fs.s3a.server-side-encryption-algorithm and fs.s3a.server-side-encryption.key from IAM role.

Thanks for any help or any clue.

6 REPLIES 6

Re: HDP Use IAM role Writing to S3 with SSE-KMS encryption

New Contributor

I tried "hdfs dfs" command and got similar error when writing to the encryption bucket:

hdfs dfs -put testfile.txt s3a://encryption_bucket/hdfs_testfile_hdpA.txt put: saving output on hdfs_testfile_hdpA.txt._COPYING_: com.amazonaws.AmazonClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: pkYbRKamqFtjgaN8gTsaCw== in base 64) didn't match hash (etag: a54112d682d10f89c1f9c1e49968bb0f in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: pkYbRKamqFtjgaN8gTsaCw==, md5DigestStream: null, bucketName: clients, key: hdfs_testfile_hdpA.txt._COPYING_): Unable to verify integrity of data upload. Client calculated content hash (contentMD5: pkYbRKamqFtjgaN8gTsaCw== in base 64) didn't match hash (etag: a54112d682d10f89c1f9c1e49968bb0f in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: pkYbRKamqFtjgaN8gTsaCw==, md5DigestStream: null, bucketName: clients, key: hdfs_testfile_hdpA.txt._COPYING_)

Re: HDP Use IAM role Writing to S3 with SSE-KMS encryption

Mentor

@Bright Lee

Have you tried DISTCP?

hadoop distcp /source-folder s3a://destination-bucket
	
To access DistCp utility, SSH to any node in your cluster. By default, DistCp is invoked against the cluster's default file system, which is defined in the configuration property fs.defaultFS in core-site.xml

Re: HDP Use IAM role Writing to S3 with SSE-KMS encryption

New Contributor

Thanks for your suggestion.

I ran the hadoop distcp and got similar error while write to the encryption-bucket:

hadoop distcp temp/testfile.txt s3a://encryption-bucket/data-sources-reports/media_mix_model/test/ 18/03/21 14:24:26 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, append=false, useDiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[temp/testfile.txt], targetPath=s3a://encryption-bucket/data-sources-reports/media_mix_model/test, targetPathExists=true, filtersFile='null', verboseLog=false} 18/03/21 14:24:26 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-97.us-east-2.compute.internal/10.0.0.97:8050 18/03/21 14:24:26 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-0-97.us-east-2.compute.internal/10.0.0.97:10200 18/03/21 14:24:27 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0 18/03/21 14:24:27 INFO tools.SimpleCopyListing: Build file listing completed. 18/03/21 14:24:27 INFO tools.DistCp: Number of paths in the copy list: 1 18/03/21 14:24:27 INFO tools.DistCp: Number of paths in the copy list: 1 18/03/21 14:24:27 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-97.us-east-2.compute.internal/10.0.0.97:8050 18/03/21 14:24:27 INFO client.AHSProxy: Connecting to Application History server at ip-10-0-0-97.us-east-2.compute.internal/10.0.0.97:10200 18/03/21 14:24:27 INFO mapreduce.JobSubmitter: number of splits:1 18/03/21 14:24:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1521586586857_0006 18/03/21 14:24:27 INFO impl.YarnClientImpl: Submitted application application_1521586586857_0006 18/03/21 14:24:27 INFO mapreduce.Job: The url to track the job: http://ip-10-0-0-97.us-east-2.compute.internal:8088/proxy/application_1521586586857_0006/ 18/03/21 14:24:27 INFO tools.DistCp: DistCp job-id: job_1521586586857_0006 18/03/21 14:24:27 INFO mapreduce.Job: Running job: job_1521586586857_0006 18/03/21 14:24:33 INFO mapreduce.Job: Job job_1521586586857_0006 running in uber mode : false 18/03/21 14:24:33 INFO mapreduce.Job: map 0% reduce 0% 18/03/21 14:24:43 INFO mapreduce.Job: map 100% reduce 0% 18/03/21 14:24:45 INFO mapreduce.Job: Task Id : attempt_1521586586857_0006_m_000000_0, Status : FAILED Error: java.io.IOException: File copy failed: hdfs://ip-10-0-0-75.us-east-2.compute.internal:8020/user/mli/temp/testfile.txt --> s3a://encryption-bucket/test/testfile.txt at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:299) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:266) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://ip-10-0-0-75.us-east-2.compute.internal:8020/user/mli/temp/testfile.txt to s3a://encryption-bucket/data-sources-reports/media_mix_model/test/testfile.txt at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101) at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:296) ... 10 more Caused by: org.apache.hadoop.fs.s3a.AWSClientIOException: saving output on data-sources-reports/media_mix_model/test/.distcp.tmp.attempt_1521586586857_0006_m_000000_0: com.amazonaws.AmazonClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: pkYbRKamqFtjgaN8gTsaCw== in base 64) didn't match hash (etag: dd9f32656c595bd40176556bd2f65a68 in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: pkYbRKamqFtjgaN8gTsaCw==, md5DigestStream: null, bucketName: encryption-bucket, key: data-sources-reports/media_mix_model/test/.distcp.tmp.attempt_1521586586857_0006_m_000000_0): Unable to verify integrity of data upload. Client calculated content hash (contentMD5: pkYbRKamqFtjgaN8gTsaCw== in base 64) didn't match hash (etag: dd9f32656c595bd40176556bd2f65a68 in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: pkYbRKamqFtjgaN8gTsaCw==, md5DigestStream: null, bucketName: encryption-bucket, key: test/.distcp.tmp.attempt_1521586586857_0006_m_000000_0) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:144) at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:121) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at java.io.FilterOutputStream.close(FilterOutputStream.java:159) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:260) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:183) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:123) at

........

Re: HDP Use IAM role Writing to S3 with SSE-KMS encryption

New Contributor

I added this setup to customer core-site using Ambari for HDFS:

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.SharedInstanceProfileCredentialsProvider</value>
</property>

Following links:

https://community.hortonworks.com/questions/138691/how-to-configure-hdp26-to-use-s3.html

https://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#S3A

But issue remains.

Here are my versions:

HDP: 2.6.1

Hadoop(HDFS/YARN): 2.7.3.

Two questions:

If Hadoop 2.8.0 will solve this issue with the above setup?

When HDP will support Hadoop 2.8 or above version?

Thanks.

Re: HDP Use IAM role Writing to S3 with SSE-KMS encryption

Mentor

@Bright Lee

Seems the job runs fine but encounters this issue "Caused by: org.apache.hadoop.fs.s3a.AWSClientIOException: saving output on data-sources-reports/media_mix_model/test/.distcp.tmp.attempt_1521586586857_0006_m_000000_0: com.amazonaws.AmazonClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: pkYbRKamqFtjgaN8gTsaCw== in base 64) didn't match hash (etag: dd9f32656c595bd40176556bd2f65a68 in hex)"

18/03/21 14:24:27 INFO tools.DistCp: DistCp job-id: job_1521586586857_0006 
18/03/21 14:24:27 INFO mapreduce.Job: Running job: job_1521586586857_0006 
18/03/21 14:24:33 INFO mapreduce.Job: Job job_1521586586857_0006 running in uber mode : false 
18/03/21 14:24:33 INFO mapreduce.Job: map 0% reduce 0% 
18/03/21 14:24:43 INFO mapreduce.Job: map 100% reduce 0% 
18/03/21 14:24:45 INFO mapreduce.Job: Task Id : attempt_1521586586857_0006_m_000000_0, Status : 
FAILED Error: java.io.IOException: 
File copy failed: hdfs://ip-10-0-0-75.us-east-2.compute.internal:8020/user/mli/temp/testfile.txt --> s3a://encryption-bucket/test/testfile.txt at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:299) 
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:266) 
at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) 
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) 

It seems like S3 server side data at rest encryption please try too enable it add the following property in core-site.xml:

<property>
  <name>fs.s3n.server-side-encryption-algorithm</name>
  <value> AES256 </value>
  <description>
    Specify a server-side encryption algorithm for S3.
    The default is NULL, and the only other currently allowable value is AES256.
  </description>
</property>

Please revert

Re: HDP Use IAM role Writing to S3 with SSE-KMS encryption

New Contributor

I am testing on two buckets, one has default encryption as None, and the other has default encryption as AWS-KMS.

The IAM role should figure this out automatically just as I am using "aws s3" command.

With your setup, I should setup both fs.s3a.server-side-encryption-algorithm(with value SSE-KMS") and fs.s3a.server-side-encryption.key. In this setup, both buckets will use same encryption, which is not what I want. Also I need to setup server-side.encryption.key which is not safe.

I want HDP/Hadoop can use IAM role and use right server-side.encryption.algorigthm and server-side.encryption.key for different buckets automatically based on IAM role setup.

Thanks.