Support Questions
Find answers, ask questions, and share your expertise

Apache PIG - Ranking with group

Explorer

Hi experts,

I want to rank my dataset but after/before I need to group my data. My dataset is:

EMPLOYEESTOCKFURNISHERDATEVALUE
A2AA27-01-20163
A1AB28-01-20163
B4AA27-01-20165
C5AC27-01-20161
C2AC27-01-20164

Now I want to rank my data by Employee and Date and group them to obtain the sum of Value. I know that I can do this without ranking but it is a requirement the generation of the Rank by Employee and Date. Basically I want to extract the following output:

IDEMPLOYEESTOCKFURNISHERDATEVALUE
1A2AA27-01-20163
2A1AB28-01-20163
3B4AA27-01-20165
4C5AC27-01-20165
4C2AC27-01-20165

To obtain this using Apache PIG I'm using this script:

INPUT = LOAD 'FILE_PATH' USING PigStorage(';') as 
  (Employee:Chararray, STOCK:Int, FURNICHER:Chararray, Date:Chararray, Value:Double);
RANKING = rank DATA BY Employee,DATE;
GRP = GROUP RANKING BY FURNISHER;
DATA = FOREACH GRP_by_DATA GENERATE FLATTEN(RANKING);
STORE DATA INTO 'DESTINATION_PATH' USING PigStorage(','); 

But I'm not returning the desired output 😞

Anyone knows how can I do this?

Many thanks!

1 ACCEPTED SOLUTION

Guru

This produces the results you want:

RAW = LOAD 'filepath' USING PigStorage(';') as 
  (Employee:Chararray, Stock:Int, Furnisher:Chararray, Date:Chararray, Value:Double);
RANKING = rank RAW BY Employee, Date DENSE;
GRP = GROUP RANKING BY $0;
SUMMED = foreach GRP {
     summed = SUM(RANKING.Value);
     generate $0, summed as Ranksum;
}
JOINED = join RANKING by $0, SUMMED by $0;
FINAL= foreach JOINED generate $0, Employee, Stock, Furnisher, Date, Ranksum;
STORE FINAL INTO 'destinationpath' USING PigStorage(','); 

Let me know this is what you are looking for by accepting the answer. If I did not get the requirements correct, please clarify.

View solution in original post

1 REPLY 1

Guru

This produces the results you want:

RAW = LOAD 'filepath' USING PigStorage(';') as 
  (Employee:Chararray, Stock:Int, Furnisher:Chararray, Date:Chararray, Value:Double);
RANKING = rank RAW BY Employee, Date DENSE;
GRP = GROUP RANKING BY $0;
SUMMED = foreach GRP {
     summed = SUM(RANKING.Value);
     generate $0, summed as Ranksum;
}
JOINED = join RANKING by $0, SUMMED by $0;
FINAL= foreach JOINED generate $0, Employee, Stock, Furnisher, Date, Ranksum;
STORE FINAL INTO 'destinationpath' USING PigStorage(','); 

Let me know this is what you are looking for by accepting the answer. If I did not get the requirements correct, please clarify.

; ;