Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

avatar
Explorer

I used hadoop distcp as given below:

hadoop distcp hdfs://hdfs_host:hdfs_port/hdfs_path/hdfs_file.txt s3n://s3_aws_access_key_id:s3_aws_access_key_secret@my_bucketname/

My Hadoop cluster is behind the company http proxy server, I can't figure out how to specify this when connecting to s3. The error I get is: ERROR tools.DistCp: Invalid arguments: org.apache.http.conn.ConnectTimeoutException: Connect to my_bucketname.s3.amazonaws.com:443 timed out.

1 ACCEPTED SOLUTION

avatar

if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through.

See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

View solution in original post

8 REPLIES 8

avatar
Master Mentor
@Venu Shanmukappa

443 timed out

We have to have connectivity to s3. See if this helps

avatar

it won't; java doesn't look at the OS proxy settings. (there's a couple of exceptions, but they don't usually surface in a world where applets are disabled)

avatar
Master Mentor
@Venu Shanmukapp

I'm glad you're utilizing HCC. Let us know if Neeraj's link helps and mark as best answer if it does. @azeltov

avatar

if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through.

See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

avatar
Explorer

Thanks all for your replies...

After adding fs.s3a.proxy.port & fs.s3a.proxy.host to the core-site.xml as Suggested by stevel, I am able to move HDFS files directly to aws s3 using s3a:// URI scheme form distcp tool.

avatar
Explorer

@Venu Shanmukappa how did u add the proxy.. can u pls explain

avatar
New Contributor

Hi @Venu Shanmukappa

You can also use Hadoop 'cp' command after following the below steps :

1)Configure the core-site.xml file with following aws property :

<property>

<name>fs.s3n.awsAccessKeyId</name>

<value>AWS access key ID. Omit for Role-based authentication.</value>

</property>

<property>

<name>fs.s3n.awsSecretAccessKey</name>

<value>WS secret key. Omit for Role-based authentication.</value>

</property>

2) Export the JAR (aws-java-sdk-1.7.4.jar ) file provided by AWS in environment variable HADOOP_CLASSPATH using below command.

$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

3)The hadoop "cp" command will copy source data (Local Hdfs) to Destination (AWS S3 bucket) .

$ hadoop fs -cp /user/ubuntu/filename.txt s3n://S3-Bucket-Name/filename.txt

avatar
Explorer

cud u pls explain this in detail