Support Questions

Find answers, ask questions, and share your expertise

Options for copying Hive data to S3

avatar
Super Collaborator

I want to copy some data from Hive tables on our (bare metal) cluster to a S3.

I know that I can export the data out of HDFS to a CSV file and upload that to S3, but I'm guessing that there are better ways to accomplish this.

Any ideas?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.

And same S3 data can be used again in hive external table.

CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    LOCATION 's3n://mysbucket/';

View solution in original post

8 REPLIES 8

avatar
Super Guru

@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.

And same S3 data can be used again in hive external table.

CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    LOCATION 's3n://mysbucket/';

avatar
Super Collaborator

Thanks @Jitendra Yadav

Is this baked into HDP, or are there Amazon-related binaries that I need in order for this to work?

avatar
Super Guru

Yes it is baked by HDP, we only need to make that S3 secret keys are in place.

see this doc. https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...

avatar
Super Guru

Hi @Zack Riesland please let me know if you required further info or accept this answer to close this thread.

avatar
Super Collaborator

I added the appropriate entries to hive and hdfs configs in ambari (as specified here https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...), and gave this a try:

hdfs dfs -put /user/my_user/my_hdfs_file s3://my_bucket/my_folder

I got the error:

-put: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

I noticed that the instructions mention these settings:

fs.s3n.awsAccessKeyId

fs.s3n.awsSecretAccessKey

but the error message mentions these:

fs.s3.awsAccessKeyId

fs.s3.awsSecretAccessKey

Once I made that change, I was able to make some progress.

However, I think I still need a little help.

In your example, you show s3://bucket/hivetable as the destination.

But our S3 instance doesn't have tables, just folders. When I try and point at a folder, I get an error:

put: /<folder name> doesn't exist

Do I need to use the other syntax "create external table... LOCATION 's3n://mysbucket/" to create a TABLE in S3 and then access in this way?

Is there a similar way to simply transfer a file from hdfs to a FOLDER in an s3 bucket?

cc @Jitendra Yadav

Thanks!

avatar
Super Guru

Yes, You can also use s3n instead of s3 as mentioned in the article and make sure secretekey defined in s3n properties.

avatar
Super Collaborator

I didn't understand the difference between s3 and s3n.

This link helped: http://stackoverflow.com/questions/10569455/difference-between-amazon-s3-and-s3n-in-hadoop

Thanks again.

avatar

In addition to the above, you might also want to install and configure aws cli:

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html