Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

Solved Go to solution

How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

New Contributor

I used hadoop distcp as given below:

hadoop distcp hdfs://hdfs_host:hdfs_port/hdfs_path/hdfs_file.txt s3n://s3_aws_access_key_id:s3_aws_access_key_secret@my_bucketname/

My Hadoop cluster is behind the company http proxy server, I can't figure out how to specify this when connecting to s3. The error I get is: ERROR tools.DistCp: Invalid arguments: org.apache.http.conn.ConnectTimeoutException: Connect to my_bucketname.s3.amazonaws.com:443 timed out.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through.

See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

8 REPLIES 8

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

@Venu Shanmukappa

443 timed out

We have to have connectivity to s3. See if this helps

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

it won't; java doesn't look at the OS proxy settings. (there's a couple of exceptions, but they don't usually surface in a world where applets are disabled)

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

Mentor
@Venu Shanmukapp

I'm glad you're utilizing HCC. Let us know if Neeraj's link helps and mark as best answer if it does. @azeltov

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through.

See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

New Contributor

Thanks all for your replies...

After adding fs.s3a.proxy.port & fs.s3a.proxy.host to the core-site.xml as Suggested by stevel, I am able to move HDFS files directly to aws s3 using s3a:// URI scheme form distcp tool.

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

New Contributor

@Venu Shanmukappa how did u add the proxy.. can u pls explain

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

New Contributor

Hi @Venu Shanmukappa

You can also use Hadoop 'cp' command after following the below steps :

1)Configure the core-site.xml file with following aws property :

<property>

<name>fs.s3n.awsAccessKeyId</name>

<value>AWS access key ID. Omit for Role-based authentication.</value>

</property>

<property>

<name>fs.s3n.awsSecretAccessKey</name>

<value>WS secret key. Omit for Role-based authentication.</value>

</property>

2) Export the JAR (aws-java-sdk-1.7.4.jar ) file provided by AWS in environment variable HADOOP_CLASSPATH using below command.

$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

3)The hadoop "cp" command will copy source data (Local Hdfs) to Destination (AWS S3 bucket) .

$ hadoop fs -cp /user/ubuntu/filename.txt s3n://S3-Bucket-Name/filename.txt

Re: How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

New Contributor

cud u pls explain this in detail

Don't have an account?
Coming from Hortonworks? Activate your account here