Created on 06-13-2016 03:27 PM - edited 09-16-2022 03:24 AM
I want to copy some data from Hive tables on our (bare metal) cluster to a S3.
I know that I can export the data out of HDFS to a CSV file and upload that to S3, but I'm guessing that there are better ways to accomplish this.
Any ideas?
Created 06-13-2016 03:50 PM
@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.
And same S3 data can be used again in hive external table.
CREATE EXTERNAL TABLE mydata (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION 's3n://mysbucket/';
Created 06-13-2016 03:50 PM
@Zack Riesland You can put it directly through "hdfs fs -put /tablepath s3://bucket/hivetable. If you have partitions in hive table and you can run this command for each partition directory in concurrent mode through a small shell script just to increase the data ingestion speed.
And same S3 data can be used again in hive external table.
CREATE EXTERNAL TABLE mydata (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION 's3n://mysbucket/';
Created 06-13-2016 04:30 PM
Thanks @Jitendra Yadav
Is this baked into HDP, or are there Amazon-related binaries that I need in order for this to work?
Created 06-13-2016 04:46 PM
Yes it is baked by HDP, we only need to make that S3 secret keys are in place.
see this doc. https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...
Created 06-14-2016 10:08 PM
Hi @Zack Riesland please let me know if you required further info or accept this answer to close this thread.
Created 06-14-2016 10:44 PM
I added the appropriate entries to hive and hdfs configs in ambari (as specified here https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...), and gave this a try:
hdfs dfs -put /user/my_user/my_hdfs_file s3://my_bucket/my_folder
I got the error:
-put: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
I noticed that the instructions mention these settings:
fs.s3n.awsAccessKeyId
fs.s3n.awsSecretAccessKey
but the error message mentions these:
fs.s3.awsAccessKeyId
fs.s3.awsSecretAccessKey
Once I made that change, I was able to make some progress.
However, I think I still need a little help.
In your example, you show s3://bucket/hivetable as the destination.
But our S3 instance doesn't have tables, just folders. When I try and point at a folder, I get an error:
put: /<folder name> doesn't exist
Do I need to use the other syntax "create external table... LOCATION 's3n://mysbucket/" to create a TABLE in S3 and then access in this way?
Is there a similar way to simply transfer a file from hdfs to a FOLDER in an s3 bucket?
Thanks!
Created 06-15-2016 08:46 AM
Yes, You can also use s3n instead of s3 as mentioned in the article and make sure secretekey defined in s3n properties.
Created 06-15-2016 09:39 AM
I didn't understand the difference between s3 and s3n.
This link helped: http://stackoverflow.com/questions/10569455/difference-between-amazon-s3-and-s3n-in-hadoop
Thanks again.
Created 06-13-2016 09:48 PM
In addition to the above, you might also want to install and configure aws cli:
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html