Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Options for copying Hive data to S3

Solved Go to solution

Options for copying Hive data to S3

Super Collaborator

I want to copy some data from Hive tables on our (bare metal) cluster to a S3.

I know that I can export the data out of HDFS to a CSV file and upload that to S3, but I'm guessing that there are better ways to accomplish this.

Any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Options for copying Hive data to S3

@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.

And same S3 data can be used again in hive external table.

CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    LOCATION 's3n://mysbucket/';

View solution in original post

8 REPLIES 8
Highlighted

Re: Options for copying Hive data to S3

@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.

And same S3 data can be used again in hive external table.

CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
    LOCATION 's3n://mysbucket/';

View solution in original post

Highlighted

Re: Options for copying Hive data to S3

Super Collaborator

Thanks @Jitendra Yadav

Is this baked into HDP, or are there Amazon-related binaries that I need in order for this to work?

Highlighted

Re: Options for copying Hive data to S3

Yes it is baked by HDP, we only need to make that S3 secret keys are in place.

see this doc. https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...

Highlighted

Re: Options for copying Hive data to S3

Hi @Zack Riesland please let me know if you required further info or accept this answer to close this thread.

Highlighted

Re: Options for copying Hive data to S3

Super Collaborator

I added the appropriate entries to hive and hdfs configs in ambari (as specified here https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...), and gave this a try:

hdfs dfs -put /user/my_user/my_hdfs_file s3://my_bucket/my_folder

I got the error:

-put: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

I noticed that the instructions mention these settings:

fs.s3n.awsAccessKeyId

fs.s3n.awsSecretAccessKey

but the error message mentions these:

fs.s3.awsAccessKeyId

fs.s3.awsSecretAccessKey

Once I made that change, I was able to make some progress.

However, I think I still need a little help.

In your example, you show s3://bucket/hivetable as the destination.

But our S3 instance doesn't have tables, just folders. When I try and point at a folder, I get an error:

put: /<folder name> doesn't exist

Do I need to use the other syntax "create external table... LOCATION 's3n://mysbucket/" to create a TABLE in S3 and then access in this way?

Is there a similar way to simply transfer a file from hdfs to a FOLDER in an s3 bucket?

cc @Jitendra Yadav

Thanks!

Highlighted

Re: Options for copying Hive data to S3

Yes, You can also use s3n instead of s3 as mentioned in the article and make sure secretekey defined in s3n properties.

Highlighted

Re: Options for copying Hive data to S3

Super Collaborator

I didn't understand the difference between s3 and s3n.

This link helped: http://stackoverflow.com/questions/10569455/difference-between-amazon-s3-and-s3n-in-hadoop

Thanks again.

Highlighted

Re: Options for copying Hive data to S3

In addition to the above, you might also want to install and configure aws cli:

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

Don't have an account?
Coming from Hortonworks? Activate your account here