Support Questions

Find answers, ask questions, and share your expertise

Converting Lab3 PIG script to HIVE

avatar
New Contributor

In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is:

a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader();
b = filter a by event != 'normal';
c = foreach b generate driverid, event, (int) '1' as occurance;
d = group c by driverid;
e = foreach d generate group as driverid, SUM(c.occurance) as t_occ;
g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader();
h = join e by driverid, g by driverid;
final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor;
store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();


I tried doing the same using HIVE. Below is HIVE script:

CREATE TABLE riskfactor_hive STORED AS ORC AS 
select 
	t2.driverid as driverid, 
	t2.events as events, 
	dm.totmiles as totmiles, 
	dm.totmiles/t2.events as riskfactor
from
(
  select t1.driverid, count(t1.occurance) as events
	from 
	(
	  select driverid, event, 1 as occurance from geolocation
	  where event != 'normal'
	) t1
  group by t1.driverid
) t2
join drivermileage dm
on t2.driverid = dm.driverid
;

Can we further optimise the HIVE script to make it run even faster?

Much Thanks!

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Vaibhav YadavI think you can try collapsing the inner query as follows:

 select driverid, count(event)  from geolocation where event!='normal' group by driverid; 

View solution in original post

4 REPLIES 4

avatar

Hi @Vaibhav Yadav, I won't comment on the SQL optimisation (perhaps someone else can), but you can do a lot of tuning to improve Hive performance and there are a lot of articles in the community to advise you. First of all, you should be using Tez as the hive execution engine instead of MapReduce and then you can do a lot of further tweaks: see for example this article which provides some advice on tuning Tez.

I'll leave the thread open for others to comment further, but this will definitely get you started!

avatar
New Contributor

Thanks @Ana Gillan.. I'll try that..

avatar
Expert Contributor

@Vaibhav YadavI think you can try collapsing the inner query as follows:

 select driverid, count(event)  from geolocation where event!='normal' group by driverid; 

avatar
New Contributor

Thanks @grajagopal. That just worked