Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to use s3a with HDP

Solved Go to solution

How to use s3a with HDP

Super Collaborator

I'm trying to use distcp to copy data to an S3 bucket, and experiencing nothing but pain.

I've tried something like this:

sudo -u hdfs hadoop distcp -Dhadoop.root.logger="DEBUG,console" -Dmapreduce.job.maxtaskfailures.per.tracker=1 -bandwidth 10 -i -log /user/hdfs/s3_staging/logging/distcp.log hdfs:///apps/hive/warehouse/my_db/my_table s3n://my_bucket/my_path

But I encounter this error:

http://stackoverflow.com/questions/37868404/distcp-from-hadoop-to-s3-fails-with-no-space-available-i...

From what I've read, I might have more luck trying s3a instead of s3n, but when I try the same command above using "s3a" in the URL, I get this error:

"No FileSystem for scheme: S3a"

Can someone please give me some insight to get this working with either file system

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to use s3a with HDP

Super Collaborator

I figured it out - I needed to add fs.s3a.access.key and fs.s3a.secret.key values to my HDFS config in Ambari.

I already had fs.s3.awsAccessKeyId and fs.s3.awsSecretKeyId, but those are just for s3:// urls, apparently.

So I had to do the following to get distcp to work on HDP 2.4.2:

Add aws-java-sdk-s3-1.10.62.jar to hadoop/lib on the node running the command

Add hadoop/lib* to the classpath for MapReduce and Yarn

Add fs.s3a.access.key and fs.s3a.secret.key properties to HDFS config in Ambari.

5 REPLIES 5

Re: How to use s3a with HDP

Contributor

s3n is pretty much deprecated. Please use "s3a". Which version of HDP are you using? Check if you have relevant s3a libraries (aws-java-sdk-s3*.jar) in hadoop and add "-Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"

Re: How to use s3a with HDP

Super Collaborator

Thanks @Rajesh Balamohan

I see that I only had aws-java-sdk-s3*.jar under /usr/hdp/current/zeppelin/lib/lib, so I copied it to /usr/hdp/current/hadoop/lib and /usr/hdp/current/hadoop-mapreduce/lib, but when I try to run with the -Dfs.s3a.impl argument, I get the error below.

I have the proper AWS credentials in my config and I don't have credential-related issues if I try a s3n: URL, so I think this is really an issue finding the right jars.

Do I need to add that jar to a path somewhere?

Any ideas?

16/11/11 06:25:41 ERROR tools.DistCp: Invalid arguments:
com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
        at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
        at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
        at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:228)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:216)
        at org.apache.hadoop.tools.DistCp.run(DistCp.java:116)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
Invalid arguments: Unable to load AWS credentials from any provider in the chain

Re: How to use s3a with HDP

Super Collaborator

I figured it out - I needed to add fs.s3a.access.key and fs.s3a.secret.key values to my HDFS config in Ambari.

I already had fs.s3.awsAccessKeyId and fs.s3.awsSecretKeyId, but those are just for s3:// urls, apparently.

So I had to do the following to get distcp to work on HDP 2.4.2:

Add aws-java-sdk-s3-1.10.62.jar to hadoop/lib on the node running the command

Add hadoop/lib* to the classpath for MapReduce and Yarn

Add fs.s3a.access.key and fs.s3a.secret.key properties to HDFS config in Ambari.

Re: How to use s3a with HDP

Super Collaborator

Oh. Also need this in HDFS configs:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem

Re: How to use s3a with HDP