Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Pig latin / Order after group by

avatar
New Member

I have trouble keeping rows in order. I have data like this:

cata,productx, sales,total_sales_of_category

food bread 112USD,1890USD

food breadX 98USD, 1890USD

Oil MotorOil 786USD,7899USD

Oil MotorOilY 678USD,11331USD

Schema is: chararray,chararray,long,long ..

Sorry for the lame example, but there are four colums. I can group them by category (cata) and order them by sales inside bag, BUT if I would to also order them by total_sales_of_category how am I supposed to do it .. ? Ordering inside bag works fine:

grp = group ordered by $0;
top20 = foreach grp { 
sorted = order ordered by $2 desc;
top = limit sorted 20;
generate group,FLATTEN(top);
};
 

But after this Total_sales_of_category is not in order (Of course its not ..) but if I would like to get that also in order (total_sales_of_category) How can it be done ? Simple using x = order top 20 by $4 desc, will order rows but i will loose order of sales ..

Any advice would be great ..

1 ACCEPTED SOLUTION

avatar

What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals.

For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data.

[root@sandbox hcc]# cat raw_sales.txt 
CatZ,Prod22-cZ,30,60
CatA,Prod88-cA,15,50
CatY,Prod07-cY,20,40
CatB,Prod18-cB,10,50
CatX,Prod29-cZ,40,60
CatC,Prod09-cC,80,140
CatZ,Prod83-cZ,20,60
CatA,Prod17-cA,25,50
CatY,Prod98-cY,10,40
CatB,Prod99-cB,30,50
CatX,Prod19-cZ,10,60
CatC,Prod73-cC,50,140
CatZ,Prod52-cZ,10,60
CatA,Prod58-cA,15,50
CatY,Prod57-cY,10,40
CatB,Prod58-cB,10,50
CatX,Prod59-cZ,10,60
CatC,Prod59-cC,10,140

That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ.

Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done.

[root@sandbox hcc]# cat salesHCC.pig
rawSales = LOAD 'raw_sales.txt' USING PigStorage(',')
  AS (category: chararray, product: chararray,
      sales: long, total_sales_category: long);

-- group them by the total sales / category combos
grpByCatTotals = GROUP rawSales BY
  (total_sales_category, category);

-- put these groups in order from highest to lowest
sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;

-- just keep the top N
topSalesCats = LIMIT sortGrpByCatTotals 2;

-- do your original logic to get the top sales within the categories
topProdsByTopCats = FOREACH topSalesCats {
  sorted = ORDER rawSales BY sales DESC;
  top = LIMIT sorted 2;
  GENERATE group, FLATTEN(top);
}
DUMP topProdsByTopCats;

The output is as initially expected.

[root@sandbox hcc]# pig -x tez salesHCC.pig 
((140,CatC),CatC,Prod09-cC,80,140)
((140,CatC),CatC,Prod73-cC,50,140)
((60,CatZ),CatZ,Prod22-cZ,30,60)
((60,CatZ),CatZ,Prod83-cZ,20,60)

I hope this was what you were looking for. Either way, good luck!

View solution in original post

2 REPLIES 2

avatar

What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals.

For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data.

[root@sandbox hcc]# cat raw_sales.txt 
CatZ,Prod22-cZ,30,60
CatA,Prod88-cA,15,50
CatY,Prod07-cY,20,40
CatB,Prod18-cB,10,50
CatX,Prod29-cZ,40,60
CatC,Prod09-cC,80,140
CatZ,Prod83-cZ,20,60
CatA,Prod17-cA,25,50
CatY,Prod98-cY,10,40
CatB,Prod99-cB,30,50
CatX,Prod19-cZ,10,60
CatC,Prod73-cC,50,140
CatZ,Prod52-cZ,10,60
CatA,Prod58-cA,15,50
CatY,Prod57-cY,10,40
CatB,Prod58-cB,10,50
CatX,Prod59-cZ,10,60
CatC,Prod59-cC,10,140

That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ.

Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done.

[root@sandbox hcc]# cat salesHCC.pig
rawSales = LOAD 'raw_sales.txt' USING PigStorage(',')
  AS (category: chararray, product: chararray,
      sales: long, total_sales_category: long);

-- group them by the total sales / category combos
grpByCatTotals = GROUP rawSales BY
  (total_sales_category, category);

-- put these groups in order from highest to lowest
sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;

-- just keep the top N
topSalesCats = LIMIT sortGrpByCatTotals 2;

-- do your original logic to get the top sales within the categories
topProdsByTopCats = FOREACH topSalesCats {
  sorted = ORDER rawSales BY sales DESC;
  top = LIMIT sorted 2;
  GENERATE group, FLATTEN(top);
}
DUMP topProdsByTopCats;

The output is as initially expected.

[root@sandbox hcc]# pig -x tez salesHCC.pig 
((140,CatC),CatC,Prod09-cC,80,140)
((140,CatC),CatC,Prod73-cC,50,140)
((60,CatZ),CatZ,Prod22-cZ,30,60)
((60,CatZ),CatZ,Prod83-cZ,20,60)

I hope this was what you were looking for. Either way, good luck!

avatar
New Member

Thank you! This was exactly what I Wanted. What was new to me was this

  1. sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;

So Pig can order groups in that way.