Member since
05-11-2016
3
Posts
1
Kudos Received
0
Solutions
05-11-2016
04:57 AM
1 Kudo
In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is: a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader();
b = filter a by event != 'normal';
c = foreach b generate driverid, event, (int) '1' as occurance;
d = group c by driverid;
e = foreach d generate group as driverid, SUM(c.occurance) as t_occ;
g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader();
h = join e by driverid, g by driverid;
final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor;
store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();
I tried doing the same using HIVE. Below is HIVE script: CREATE TABLE riskfactor_hive STORED AS ORC AS
select
t2.driverid as driverid,
t2.events as events,
dm.totmiles as totmiles,
dm.totmiles/t2.events as riskfactor
from
(
select t1.driverid, count(t1.occurance) as events
from
(
select driverid, event, 1 as occurance from geolocation
where event != 'normal'
) t1
group by t1.driverid
) t2
join drivermileage dm
on t2.driverid = dm.driverid
;
Can we further optimise the HIVE script to make it run even faster?
Much Thanks!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Pig