Welcome to the Cloudera Community

sanjaysubramani · ‎04-07-2016

Hey guys

On the CDH 5.6.0 (I have been happily and successfully using the CDH starving developers version since 2012) this is how you can create and use tables with data location pointing to S3 on AWS. I am sure there are possibly better and more elegant ways to do this (and guys please educate if so) - but this is one way that works successfully...so here goes

[1] In the HDFS Configuration in Cloudera Manager :

=====================================

SECTION = "HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml"

Add the following

<property>
<name>fs.s3.awsAccessKeyId</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>YOUR_AWS_SECRET_ACCESSKEY</value>
</property>
<property>
<name>fs.s3a.awsAccessKeyId</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.awsSecretAccessKey</name>
<value>YOUR_AWS_SECRET_ACCESSKEY</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_AWS_SECRET_ACCESSKEY</value>
</property>

[2] Create Table in Hive

==================

set hive.execution.engine=mr
;
use openpv
;
CREATE EXTERNAL TABLE IF NOT EXISTS solar_installs(
zipcode STRING,
state STRING,
sizekw DOUBLE,
cost DOUBLE,
date_ STRING,
lat DOUBLE,
lon DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;

[3] Set Data Location (note that I am pointing to the AWS S3 bucket and sub folder and not to a specific file)

==================================================================================

NOTE: If you want the data you can go here https://openpv.nrel.gov/search and hit search with no criteria defined and then download the CSV

set hive.execution.engine=mr
;
use openpv
;
ALTER TABLE
solar_installs
SET LOCATION
's3a://some-aws-bucket-name/openpv'
;

[4] Run a query in Hive using MR as execution engine

=========================================

set hive.execution.engine=mr
;
use openpv
;

select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc
;

[5] Run a query in Hive using Spark as execution engine

===========================================

use openpv
;
select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc
;

[6] Run a query in Impala

===================

impala-shell -q "invalidate metadata"

impala-shell -q "use openpv;select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc"

[7] Results Comparison

==================

These were run on a 3 node cluster runing under my cube, 1NN +1DN on one node , DN2 on node 2 and DN3 on node 3. Each node is 8 core HP 8300 32GB RAM

Impala = 3.81s

Hive-on-Spark = 27.582 seconds

Hive-on-MR = 46.774 seconds

Cloudera Community

Welcome to the Cloudera Community

Who agreed with this topic

Impala , Hive , Hive on spark with S3