Support Questions
Find answers, ask questions, and share your expertise

How to copy HDFS file to AWS S3 Bucket? hadoop distcp is not working.

New Contributor

I used hadoop distcp as given below:

hadoop distcp hdfs://hdfs_host:hdfs_port/hdfs_path/hdfs_file.txt s3n://s3_aws_access_key_id:s3_aws_access_key_secret@my_bucketname/

My Hadoop cluster is behind the company http proxy server, I can't figure out how to specify this when connecting to s3. The error I get is: ERROR tools.DistCp: Invalid arguments: org.apache.http.conn.ConnectTimeoutException: Connect to my_bucketname.s3.amazonaws.com:443 timed out.

1 ACCEPTED SOLUTION

Accepted Solutions

if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through.

See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

View solution in original post

8 REPLIES 8

@Venu Shanmukappa

443 timed out

We have to have connectivity to s3. See if this helps

it won't; java doesn't look at the OS proxy settings. (there's a couple of exceptions, but they don't usually surface in a world where applets are disabled)

Mentor
@Venu Shanmukapp

I'm glad you're utilizing HCC. Let us know if Neeraj's link helps and mark as best answer if it does. @azeltov

if you use the s3a:// client, then you can set fs.s3a.proxy settings (host, port, username, password, domain, workstation) to get through.

See https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

View solution in original post

New Contributor

Thanks all for your replies...

After adding fs.s3a.proxy.port & fs.s3a.proxy.host to the core-site.xml as Suggested by stevel, I am able to move HDFS files directly to aws s3 using s3a:// URI scheme form distcp tool.

Explorer

@Venu Shanmukappa how did u add the proxy.. can u pls explain

New Contributor

Hi @Venu Shanmukappa

You can also use Hadoop 'cp' command after following the below steps :

1)Configure the core-site.xml file with following aws property :

<property>

<name>fs.s3n.awsAccessKeyId</name>

<value>AWS access key ID. Omit for Role-based authentication.</value>

</property>

<property>

<name>fs.s3n.awsSecretAccessKey</name>

<value>WS secret key. Omit for Role-based authentication.</value>

</property>

2) Export the JAR (aws-java-sdk-1.7.4.jar ) file provided by AWS in environment variable HADOOP_CLASSPATH using below command.

$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

3)The hadoop "cp" command will copy source data (Local Hdfs) to Destination (AWS S3 bucket) .

$ hadoop fs -cp /user/ubuntu/filename.txt s3n://S3-Bucket-Name/filename.txt

Explorer

cud u pls explain this in detail