28807
DISCUSSIONS
102214
MEMBERS
3161
ARTICLES
Created 04-07-2016 09:38 AM
Hey guys
On the CDH 5.6.0 (I have been happily and successfully using the CDH starving developers version since 2012) this is how you can create and use tables with data location pointing to S3 on AWS. I am sure there are possibly better and more elegant ways to do this (and guys please educate if so) - but this is one way that works successfully...so here goes
[1] In the HDFS Configuration in Cloudera Manager :
=====================================
SECTION = "HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml"
Add the following
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>YOUR_AWS_SECRET_ACCESSKEY</value>
</property>
<property>
<name>fs.s3a.awsAccessKeyId</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.awsSecretAccessKey</name>
<value>YOUR_AWS_SECRET_ACCESSKEY</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>YOUR_AWS_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>YOUR_AWS_SECRET_ACCESSKEY</value>
</property>
[2] Create Table in Hive
==================
set hive.execution.engine=mr
;
use openpv
;
CREATE EXTERNAL TABLE IF NOT EXISTS solar_installs(
zipcode STRING,
state STRING,
sizekw DOUBLE,
cost DOUBLE,
date_ STRING,
lat DOUBLE,
lon DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
;
[3] Set Data Location (note that I am pointing to the AWS S3 bucket and sub folder and not to a specific file)
==================================================================================
NOTE: If you want the data you can go here https://openpv.nrel.gov/search and hit search with no criteria defined and then download the CSV
set hive.execution.engine=mr
;
use openpv
;
ALTER TABLE
solar_installs
SET LOCATION
's3a://some-aws-bucket-name/openpv'
;
[4] Run a query in Hive using MR as execution engine
=========================================
set hive.execution.engine=mr
;
use openpv
;
select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc
;
[5] Run a query in Hive using Spark as execution engine
===========================================
use openpv
;
select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc
;
[6] Run a query in Impala
===================
impala-shell -q "invalidate metadata"
impala-shell -q "use openpv;select zipcode, count(*) from solar_installs group by zipcode order by zipcode asc"
[7] Results Comparison
==================
These were run on a 3 node cluster runing under my cube, 1NN +1DN on one node , DN2 on node 2 and DN3 on node 3. Each node is 8 core HP 8300 32GB RAM
Impala = 3.81s
Hive-on-Spark = 27.582 seconds
Hive-on-MR = 46.774 seconds