Support Questions

zack_riesland · ‎06-13-2016

I want to copy some data from Hive tables on our (bare metal) cluster to a S3.

I know that I can export the data out of HDFS to a CSV file and upload that to S3, but I'm guessing that there are better ways to accomplish this.

Any ideas?

jyadav · ‎06-13-2016

@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.

And same S3 data can be used again in hive external table.

CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    LOCATION 's3n://mysbucket/';

View solution in original post

jyadav · ‎06-13-2016

@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.

And same S3 data can be used again in hive external table.

CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    LOCATION 's3n://mysbucket/';

zack_riesland · ‎06-13-2016

Thanks @Jitendra Yadav

Is this baked into HDP, or are there Amazon-related binaries that I need in order for this to work?

jyadav · ‎06-13-2016

Yes it is baked by HDP, we only need to make that S3 secret keys are in place.

see this doc. https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...

jyadav · ‎06-14-2016

Hi @Zack Riesland please let me know if you required further info or accept this answer to close this thread.

zack_riesland · ‎06-14-2016

I added the appropriate entries to hive and hdfs configs in ambari (as specified here https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...), and gave this a try:

hdfs dfs -put /user/my_user/my_hdfs_file s3://my_bucket/my_folder

I got the error:

-put: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

I noticed that the instructions mention these settings:

fs.s3n.awsAccessKeyId

fs.s3n.awsSecretAccessKey

but the error message mentions these:

fs.s3.awsAccessKeyId

fs.s3.awsSecretAccessKey

Once I made that change, I was able to make some progress.

However, I think I still need a little help.

In your example, you show s3://bucket/hivetable as the destination.

But our S3 instance doesn't have tables, just folders. When I try and point at a folder, I get an error:

put: /<folder name> doesn't exist

Do I need to use the other syntax "create external table... LOCATION 's3n://mysbucket/" to create a TABLE in S3 and then access in this way?

Is there a similar way to simply transfer a file from hdfs to a FOLDER in an s3 bucket?

cc @Jitendra Yadav

Thanks!

jyadav · ‎06-15-2016

Yes, You can also use s3n instead of s3 as mentioned in the article and make sure secretekey defined in s3n properties.

zack_riesland · ‎06-15-2016

I didn't understand the difference between s3 and s3n.

This link helped: http://stackoverflow.com/questions/10569455/difference-between-amazon-s3-and-s3n-in-hadoop

Thanks again.

namaheshwari · ‎06-13-2016

In addition to the above, you might also want to install and configure aws cli:

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

Cloudera Community

Support Questions

Options for copying Hive data to S3