Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Contributor

Step 1 : Log into AWS your credentials

Step 2 : From the AWS console go to the following options and create a user in for the demo in AWS

Security & Identity --> Identity and Access Management --> Users --> Create New Users

3151-screen-shot-2016-04-01-at-13022-pm.png

Step 3 : Make note of the credentials

awsAccessKeyId = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx';

awsSecretAccessKey = 'yyyyyyyyyyyyyyyyyyyyyyyyyyy';

Step 4 : Add the User to the Admin Group by clicking the button “User Actions” and select the option Add Users to Group and add select your user (admin)

Step 5 : Assign the Administration Access Policy to the User (admin)

Step 6 : In the AWS Console , Go to S3 and create a bucket “s3hdptest” and pick your region

3155-screen-shot-2016-04-01-at-14856-pm.png

Step 7 : Upload the file manually by using the upload button. In our example we are uploading the file S3HDPTEST.csv

3152-screen-shot-2016-04-01-at-15247-pm.png

Step 8 : In the Hadoop Environment create the user with the same name as it is created in the S3 Environment

Step 9 : In Ambari do all the below properties in both hdfs-site.xml and hive-site.xml

<property>
  <name>fs.s3a.access.key</name>
  <description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
  <name>fs.s3a.secret.key</name>
  <description>AWS secret key. Omit for Role-based authentication.</description>
</property>

Step 10 : Restart the Hadoop Services like HDFS , Hive and any depending services

Step 11 : Ensure the NTP is set to the properly to reflect the AWS timestamp, follow the steps in the below link

http://www.emind.co/how-to/how-to-fix-amazon-s3-requesttimetooskewed

Step 12 : Run the below statement from the command line to test whether we are able to view the file from S3

[root@sandbox ~]# su admin
bash-4.1$ hdfs dfs -ls s3a://s3hdptest/S3HDPTEST.csv
-rw-rw-rw- 1 188 2016-03-29 22:12 s3a://s3hdptest/S3HDPTEST.csv
bash-4.1$

Step 13: To verify the data you can use the below command

bash-4.1$ hdfs dfs -cat s3a://s3hdptest/S3HDPTEST.csv

Step 14 : Move a file from S3 to HDFS

bash-4.1$ hadoop fs -cp s3a://s3hdptest/S3HDPTEST.csv /user/admin/S3HDPTEST.csv

Step 15 : Move a file from HDFS to S3

bash-4.1$ hadoop fs -cp /user/admin/S3HDPTEST.csv s3a://s3hdptest/S3HDPTEST_1.csv

Step 15a : Verify whether the file has been stored in the AWS S3 Bucket

3153-screen-shot-2016-04-01-at-22348-pm.png

Step 16 : To access the data using Hive from S3:

Connect to Hive from Ambari using the Hive Views or Hive CLI

A) Create a table for the datafile in S3

hive> CREATE EXTERNAL TABLE mydata
(FirstName STRING, LastName STRING, StreetAddress STRING, City STRING, State STRING,ZipCode INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3a://s3hdptest/';

B) Select the file data from Hive

hive> SELECT * FROM mydata;

Step 17 : To Access the data using Pig from S3:

[root@sandbox ~]# pig -x tez

grunt> a = load 's3a://s3hdptest/S3HDPTEST.csv' using PigStorage();
grunt> dump a;

Step 18 : To Store the data using Pig to S3:

grunt> store a into 's3a://s3hdptest/OUTPUT' using PigStorage();

Checking the created data file in AWS S3 bucket

3154-screen-shot-2016-04-01-at-23950-pm.png

Note: For the article related to accessing AWS S3 Bucket using Spark please refer to the below link:

https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets....

49,954 Views
Comments

great article!!

Hi, Thanks much for the step by step information to access aws s3 files from Hadoop. I followed the same steps, I think I am connecting to s3 file but still giving an error have an issue accessing s3 bucket . We are using HDP2.2 and here is the message I am getting. I have got the same issue when I try from pig as well as hive. Need help to resolve this issue.

java.io.IOException: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: S3 Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchBucket</Code><Message>The specified bucket does not exist</Message><BucketName>bucketname</BucketName><RequestId>requestid</RequestId><HostId>hostid</HostId></Error>

Thank you! The above posts are very helpful.

I have a question on S3 access. Please help me on this.

I want to access 2 different S3 buckets with different permissions from HDFS. What is the best way to access the data and copy to hdfs. Is there any generic approach using IAM roles or we have to use only the aws access keys and override one after the other.

I have a question about accessing multiple AWS S3 buckets of different accounts in Hive.

I have several S3 buckets which belongs to different AWS accounts.

Following your info, I can access one of the buckets in Hive. However I have to write fs.s3a.access.key and fs.s3a.secret.key into hive-site.xml, it means for one instance of Hive, I can only access one AWS S3 account. Is that right?

And I want to use different buckets of different AWS S3 account in one Hive instance, is it possible?

S3N is deprecated in newer versions of Hadoop, so it's better to use s3a. To use s3a, specify s3a:// in front of the path.

The following properties need to be configured first:

<property><name>fs.s3a.access.key</name><value>ACCESS-KEY</value></property><property><name>fs.s3a.secret.key</name><value>SECRET-KEY</value></property>

newbie question: I don't know how I add a user in the hadoop environment ( Step 8 above) ? Can someone please enlighten me? Thanks

How to take care of the s3 user level permissions if the user does following? Can we leverage ranger hdfs to restrict s3 permissions if the user is going through hdfs client?
hdfs dfs -cat s3a://s3hdptest/S3HDPTEST.csv