Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Converting Lab3 PIG script to HIVE

New Contributor

In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is:

a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader();
b = filter a by event != 'normal';
c = foreach b generate driverid, event, (int) '1' as occurance;
d = group c by driverid;
e = foreach d generate group as driverid, SUM(c.occurance) as t_occ;
g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader();
h = join e by driverid, g by driverid;
final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor;
store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();


I tried doing the same using HIVE. Below is HIVE script:

CREATE TABLE riskfactor_hive STORED AS ORC AS 
select 
	t2.driverid as driverid, 
	t2.events as events, 
	dm.totmiles as totmiles, 
	dm.totmiles/t2.events as riskfactor
from
(
  select t1.driverid, count(t1.occurance) as events
	from 
	(
	  select driverid, event, 1 as occurance from geolocation
	  where event != 'normal'
	) t1
  group by t1.driverid
) t2
join drivermileage dm
on t2.driverid = dm.driverid
;

Can we further optimise the HIVE script to make it run even faster?

Much Thanks!

1 ACCEPTED SOLUTION

Rising Star

@Vaibhav YadavI think you can try collapsing the inner query as follows:

 select driverid, count(event)  from geolocation where event!='normal' group by driverid; 

View solution in original post

4 REPLIES 4

Hi @Vaibhav Yadav, I won't comment on the SQL optimisation (perhaps someone else can), but you can do a lot of tuning to improve Hive performance and there are a lot of articles in the community to advise you. First of all, you should be using Tez as the hive execution engine instead of MapReduce and then you can do a lot of further tweaks: see for example this article which provides some advice on tuning Tez.

I'll leave the thread open for others to comment further, but this will definitely get you started!

New Contributor

Thanks @Ana Gillan.. I'll try that..

Rising Star

@Vaibhav YadavI think you can try collapsing the inner query as follows:

 select driverid, count(event)  from geolocation where event!='normal' group by driverid; 

New Contributor

Thanks @grajagopal. That just worked

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.