Support Questions

Find answers, ask questions, and share your expertise

Making hive default to s3

avatar
New Contributor

Does anyone know how to make hive default to S3 so each table does not need to be external? Is this possible?

There are articles such as http://blog.sequenceiq.com/blog/2014/11/17/datalake-cloudbreak-2/ which indicate this is possible, but when one does this with the HDP 2.3 it appears the HiveServer2 fails when trying to access the webhdfs location including s3.

I set the hive.metastore.warehouse.dir=s3://<bucket>/warehouse and restarted

From the call below I'm willing to bet webhdfs is barfing on the syntax. Any ideas?

2016-02-28 14:09:36,339 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -sS -L -w '"'"'%{http_code}'"'"' -X PUT --negotiate -u : '"'"'http://<server>:50070/webhdfs/v1s3:/<bucket>/warehouse?op=MKDIRS&user.name=hdfs'"'"' 1>/tmp/tmp_QkaO7 2>/tmp/tmpFSumMx''] {'logoutput': None, 'quiet': False}
2016-02-28 14:09:36,360 - call returned (0, '')
3 REPLIES 3

avatar
Master Mentor

avatar
Master Mentor
@Matt Davies

Your question: Does anyone know how to make hive default to S3 so each table does not need to be external? Is this possible?

Locally managed table to s3 ...See this

Using S3 as the default FS

HDP in theory can be setup to use S3 as the default filesystem (instead of HDFS).

Detailed instructions on how to replace HDFS with S3 are given here.

http://wiki.apache.org/hadoop/AmazonS3

At a high level, we have to set the “fs.defaultFS” property has to be set to point to S3 in core-site.xml

The default setting for this property looks as below:

<property>

<name>fs.defaultFS</name>

<value>hdfs://hadoopNamenode:8020</value>

</property>

Change it the below setting:

<property>

<name>fs.defaultFS</name>

<value>s3://BUCKET</value>

</property>

In addition to setting the default FileSystem to be S3, we also have to provide the AWS access ID and AWS Secret Access Keys. Both these settings are shown below:

<property>

<name>fs.s3.awsAccessKeyId</name>

<value>ID</value>

</property>

<property>

<name>fs.s3.awsSecretAccessKey</name>

<value>SECRET</value>

</property>

Hive Tables in S3

A Hive table that uses “S3” as storage can be created as below:

CREATE TABLE SRC_TABLE

(

COL1 string ,

COL2 string ,

COL3 string

) ROW FORMAT DELIMITED

STORED AS TEXTFILE

LOCATION 's3://BUCKET_NAME/user/root/src_table'

;

The only difference here is that we specify the location of the table to be a sub-folder under “S3://BUCKET_NAME”.

Data can be loaded into this table using the hive command:

Hive: > load data local inpath “local_table.csv” into table SRC_TABLE;

The path “ s3://BUCKET_NAME/user/root/src_table” can be treated as any in HDFS and can be used with Hive/Pig/MapReduce etc.

avatar
Master Mentor

Hi @Matt Davies See this blog http://blog.sequenceiq.com/blog/2014/11/17/datalake-cloudbreak-2/

You can set "hive.metastore.warehouse.dir": "s3://siq-hadoop/apps/hive/warehouse",

Can you share more details from HS2 logs?