Support Questions

Find answers, ask questions, and share your expertise

Pig latin / Order after group by

avatar
Contributor

I have trouble keeping rows in order. I have data like this:

cata,productx, sales,total_sales_of_category

food bread 112USD,1890USD

food breadX 98USD, 1890USD

Oil MotorOil 786USD,7899USD

Oil MotorOilY 678USD,11331USD

Schema is: chararray,chararray,long,long ..

Sorry for the lame example, but there are four colums. I can group them by category (cata) and order them by sales inside bag, BUT if I would to also order them by total_sales_of_category how am I supposed to do it .. ? Ordering inside bag works fine:

grp = group ordered by $0;
top20 = foreach grp { 
sorted = order ordered by $2 desc;
top = limit sorted 20;
generate group,FLATTEN(top);
};
 

But after this Total_sales_of_category is not in order (Of course its not ..) but if I would like to get that also in order (total_sales_of_category) How can it be done ? Simple using x = order top 20 by $4 desc, will order rows but i will loose order of sales ..

Any advice would be great ..

1 ACCEPTED SOLUTION

avatar

What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals.

For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data.

[root@sandbox hcc]# cat raw_sales.txt 
CatZ,Prod22-cZ,30,60
CatA,Prod88-cA,15,50
CatY,Prod07-cY,20,40
CatB,Prod18-cB,10,50
CatX,Prod29-cZ,40,60
CatC,Prod09-cC,80,140
CatZ,Prod83-cZ,20,60
CatA,Prod17-cA,25,50
CatY,Prod98-cY,10,40
CatB,Prod99-cB,30,50
CatX,Prod19-cZ,10,60
CatC,Prod73-cC,50,140
CatZ,Prod52-cZ,10,60
CatA,Prod58-cA,15,50
CatY,Prod57-cY,10,40
CatB,Prod58-cB,10,50
CatX,Prod59-cZ,10,60
CatC,Prod59-cC,10,140

That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ.

Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done.

[root@sandbox hcc]# cat salesHCC.pig
rawSales = LOAD 'raw_sales.txt' USING PigStorage(',')
  AS (category: chararray, product: chararray,
      sales: long, total_sales_category: long);

-- group them by the total sales / category combos
grpByCatTotals = GROUP rawSales BY
  (total_sales_category, category);

-- put these groups in order from highest to lowest
sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;

-- just keep the top N
topSalesCats = LIMIT sortGrpByCatTotals 2;

-- do your original logic to get the top sales within the categories
topProdsByTopCats = FOREACH topSalesCats {
  sorted = ORDER rawSales BY sales DESC;
  top = LIMIT sorted 2;
  GENERATE group, FLATTEN(top);
}
DUMP topProdsByTopCats;

The output is as initially expected.

[root@sandbox hcc]# pig -x tez salesHCC.pig 
((140,CatC),CatC,Prod09-cC,80,140)
((140,CatC),CatC,Prod73-cC,50,140)
((60,CatZ),CatZ,Prod22-cZ,30,60)
((60,CatZ),CatZ,Prod83-cZ,20,60)

I hope this was what you were looking for. Either way, good luck!

View solution in original post

2 REPLIES 2

avatar

What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals.

For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data.

[root@sandbox hcc]# cat raw_sales.txt 
CatZ,Prod22-cZ,30,60
CatA,Prod88-cA,15,50
CatY,Prod07-cY,20,40
CatB,Prod18-cB,10,50
CatX,Prod29-cZ,40,60
CatC,Prod09-cC,80,140
CatZ,Prod83-cZ,20,60
CatA,Prod17-cA,25,50
CatY,Prod98-cY,10,40
CatB,Prod99-cB,30,50
CatX,Prod19-cZ,10,60
CatC,Prod73-cC,50,140
CatZ,Prod52-cZ,10,60
CatA,Prod58-cA,15,50
CatY,Prod57-cY,10,40
CatB,Prod58-cB,10,50
CatX,Prod59-cZ,10,60
CatC,Prod59-cC,10,140

That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ.

Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done.

[root@sandbox hcc]# cat salesHCC.pig
rawSales = LOAD 'raw_sales.txt' USING PigStorage(',')
  AS (category: chararray, product: chararray,
      sales: long, total_sales_category: long);

-- group them by the total sales / category combos
grpByCatTotals = GROUP rawSales BY
  (total_sales_category, category);

-- put these groups in order from highest to lowest
sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;

-- just keep the top N
topSalesCats = LIMIT sortGrpByCatTotals 2;

-- do your original logic to get the top sales within the categories
topProdsByTopCats = FOREACH topSalesCats {
  sorted = ORDER rawSales BY sales DESC;
  top = LIMIT sorted 2;
  GENERATE group, FLATTEN(top);
}
DUMP topProdsByTopCats;

The output is as initially expected.

[root@sandbox hcc]# pig -x tez salesHCC.pig 
((140,CatC),CatC,Prod09-cC,80,140)
((140,CatC),CatC,Prod73-cC,50,140)
((60,CatZ),CatZ,Prod22-cZ,30,60)
((60,CatZ),CatZ,Prod83-cZ,20,60)

I hope this was what you were looking for. Either way, good luck!

avatar
Contributor

Thank you! This was exactly what I Wanted. What was new to me was this

  1. sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;

So Pig can order groups in that way.