Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Apache PIG - Ranking with group

avatar
New Member

Hi experts,

I want to rank my dataset but after/before I need to group my data. My dataset is:

EMPLOYEESTOCKFURNISHERDATEVALUE
A2AA27-01-20163
A1AB28-01-20163
B4AA27-01-20165
C5AC27-01-20161
C2AC27-01-20164

Now I want to rank my data by Employee and Date and group them to obtain the sum of Value. I know that I can do this without ranking but it is a requirement the generation of the Rank by Employee and Date. Basically I want to extract the following output:

IDEMPLOYEESTOCKFURNISHERDATEVALUE
1A2AA27-01-20163
2A1AB28-01-20163
3B4AA27-01-20165
4C5AC27-01-20165
4C2AC27-01-20165

To obtain this using Apache PIG I'm using this script:

INPUT = LOAD 'FILE_PATH' USING PigStorage(';') as 
  (Employee:Chararray, STOCK:Int, FURNICHER:Chararray, Date:Chararray, Value:Double);
RANKING = rank DATA BY Employee,DATE;
GRP = GROUP RANKING BY FURNISHER;
DATA = FOREACH GRP_by_DATA GENERATE FLATTEN(RANKING);
STORE DATA INTO 'DESTINATION_PATH' USING PigStorage(','); 

But I'm not returning the desired output 😞

Anyone knows how can I do this?

Many thanks!

1 ACCEPTED SOLUTION

avatar
Guru

This produces the results you want:

RAW = LOAD 'filepath' USING PigStorage(';') as 
  (Employee:Chararray, Stock:Int, Furnisher:Chararray, Date:Chararray, Value:Double);
RANKING = rank RAW BY Employee, Date DENSE;
GRP = GROUP RANKING BY $0;
SUMMED = foreach GRP {
     summed = SUM(RANKING.Value);
     generate $0, summed as Ranksum;
}
JOINED = join RANKING by $0, SUMMED by $0;
FINAL= foreach JOINED generate $0, Employee, Stock, Furnisher, Date, Ranksum;
STORE FINAL INTO 'destinationpath' USING PigStorage(','); 

Let me know this is what you are looking for by accepting the answer. If I did not get the requirements correct, please clarify.

View solution in original post

1 REPLY 1

avatar
Guru

This produces the results you want:

RAW = LOAD 'filepath' USING PigStorage(';') as 
  (Employee:Chararray, Stock:Int, Furnisher:Chararray, Date:Chararray, Value:Double);
RANKING = rank RAW BY Employee, Date DENSE;
GRP = GROUP RANKING BY $0;
SUMMED = foreach GRP {
     summed = SUM(RANKING.Value);
     generate $0, summed as Ranksum;
}
JOINED = join RANKING by $0, SUMMED by $0;
FINAL= foreach JOINED generate $0, Employee, Stock, Furnisher, Date, Ranksum;
STORE FINAL INTO 'destinationpath' USING PigStorage(','); 

Let me know this is what you are looking for by accepting the answer. If I did not get the requirements correct, please clarify.