Created 11-10-2016 09:07 PM
I'm trying to use distcp to copy data to an S3 bucket, and experiencing nothing but pain.
I've tried something like this:
sudo -u hdfs hadoop distcp -Dhadoop.root.logger="DEBUG,console" -Dmapreduce.job.maxtaskfailures.per.tracker=1 -bandwidth 10 -i -log /user/hdfs/s3_staging/logging/distcp.log hdfs:///apps/hive/warehouse/my_db/my_table s3n://my_bucket/my_path
But I encounter this error:
From what I've read, I might have more luck trying s3a instead of s3n, but when I try the same command above using "s3a" in the URL, I get this error:
"No FileSystem for scheme: S3a"
Can someone please give me some insight to get this working with either file system
Created 11-11-2016 01:11 PM
I figured it out - I needed to add fs.s3a.access.key and fs.s3a.secret.key values to my HDFS config in Ambari.
I already had fs.s3.awsAccessKeyId and fs.s3.awsSecretKeyId, but those are just for s3:// urls, apparently.
So I had to do the following to get distcp to work on HDP 2.4.2:
Add aws-java-sdk-s3-1.10.62.jar to hadoop/lib on the node running the command
Add hadoop/lib* to the classpath for MapReduce and Yarn
Add fs.s3a.access.key and fs.s3a.secret.key properties to HDFS config in Ambari.
Created 11-11-2016 03:26 AM
s3n is pretty much deprecated. Please use "s3a". Which version of HDP are you using? Check if you have relevant s3a libraries (aws-java-sdk-s3*.jar) in hadoop and add "-Dfs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem"
Created 11-11-2016 11:29 AM
Thanks @Rajesh Balamohan
I see that I only had aws-java-sdk-s3*.jar under /usr/hdp/current/zeppelin/lib/lib, so I copied it to /usr/hdp/current/hadoop/lib and /usr/hdp/current/hadoop-mapreduce/lib, but when I try to run with the -Dfs.s3a.impl argument, I get the error below.
I have the proper AWS credentials in my config and I don't have credential-related issues if I try a s3n: URL, so I think this is really an issue finding the right jars.
Do I need to add that jar to a path somewhere?
Any ideas?
16/11/11 06:25:41 ERROR tools.DistCp: Invalid arguments: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:228) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:216) at org.apache.hadoop.tools.DistCp.run(DistCp.java:116) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.tools.DistCp.main(DistCp.java:430) Invalid arguments: Unable to load AWS credentials from any provider in the chain
Created 11-11-2016 01:11 PM
I figured it out - I needed to add fs.s3a.access.key and fs.s3a.secret.key values to my HDFS config in Ambari.
I already had fs.s3.awsAccessKeyId and fs.s3.awsSecretKeyId, but those are just for s3:// urls, apparently.
So I had to do the following to get distcp to work on HDP 2.4.2:
Add aws-java-sdk-s3-1.10.62.jar to hadoop/lib on the node running the command
Add hadoop/lib* to the classpath for MapReduce and Yarn
Add fs.s3a.access.key and fs.s3a.secret.key properties to HDFS config in Ambari.
Created 11-11-2016 01:13 PM
Oh. Also need this in HDFS configs:
fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
Created 11-29-2016 12:30 PM
you need to set the s3a properties to log in; these are separate from the s3n ones