Support Questions

vebs0205 · ‎05-11-2016

In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is:

a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader();
b = filter a by event != 'normal';
c = foreach b generate driverid, event, (int) '1' as occurance;
d = group c by driverid;
e = foreach d generate group as driverid, SUM(c.occurance) as t_occ;
g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader();
h = join e by driverid, g by driverid;
final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor;
store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();

I tried doing the same using HIVE. Below is HIVE script:

CREATE TABLE riskfactor_hive STORED AS ORC AS 
select 
	t2.driverid as driverid, 
	t2.events as events, 
	dm.totmiles as totmiles, 
	dm.totmiles/t2.events as riskfactor
from
(
  select t1.driverid, count(t1.occurance) as events
	from 
	(
	  select driverid, event, 1 as occurance from geolocation
	  where event != 'normal'
	) t1
  group by t1.driverid
) t2
join drivermileage dm
on t2.driverid = dm.driverid
;

Can we further optimise the HIVE script to make it run even faster?

Much Thanks!

grajagopal · ‎05-11-2016

@Vaibhav YadavI think you can try collapsing the inner query as follows:

 select driverid, count(event)  from geolocation where event!='normal' group by driverid;

View solution in original post

agillan · ‎05-11-2016

Hi @Vaibhav Yadav, I won't comment on the SQL optimisation (perhaps someone else can), but you can do a lot of tuning to improve Hive performance and there are a lot of articles in the community to advise you. First of all, you should be using Tez as the hive execution engine instead of MapReduce and then you can do a lot of further tweaks: see for example this article which provides some advice on tuning Tez.

I'll leave the thread open for others to comment further, but this will definitely get you started!

vebs0205 · ‎05-12-2016

Thanks @Ana Gillan.. I'll try that..

grajagopal · ‎05-11-2016

@Vaibhav YadavI think you can try collapsing the inner query as follows:

 select driverid, count(event)  from geolocation where event!='normal' group by driverid;

vebs0205 · ‎05-12-2016

Thanks @grajagopal. That just worked

Cloudera Community

Support Questions

Converting Lab3 PIG script to HIVE