Created 06-16-2016 02:33 PM
I have trouble keeping rows in order. I have data like this:
cata,productx, sales,total_sales_of_category
food bread 112USD,1890USD
food breadX 98USD, 1890USD
Oil MotorOil 786USD,7899USD
Oil MotorOilY 678USD,11331USD
Schema is: chararray,chararray,long,long ..
Sorry for the lame example, but there are four colums. I can group them by category (cata) and order them by sales inside bag, BUT if I would to also order them by total_sales_of_category how am I supposed to do it .. ? Ordering inside bag works fine:
grp = group ordered by $0; top20 = foreach grp { sorted = order ordered by $2 desc; top = limit sorted 20; generate group,FLATTEN(top); };
But after this Total_sales_of_category is not in order (Of course its not ..) but if I would like to get that also in order (total_sales_of_category) How can it be done ? Simple using x = order top 20 by $4 desc, will order rows but i will loose order of sales ..
Any advice would be great ..
Created 06-17-2016 03:05 PM
What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals.
For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data.
[root@sandbox hcc]# cat raw_sales.txt CatZ,Prod22-cZ,30,60 CatA,Prod88-cA,15,50 CatY,Prod07-cY,20,40 CatB,Prod18-cB,10,50 CatX,Prod29-cZ,40,60 CatC,Prod09-cC,80,140 CatZ,Prod83-cZ,20,60 CatA,Prod17-cA,25,50 CatY,Prod98-cY,10,40 CatB,Prod99-cB,30,50 CatX,Prod19-cZ,10,60 CatC,Prod73-cC,50,140 CatZ,Prod52-cZ,10,60 CatA,Prod58-cA,15,50 CatY,Prod57-cY,10,40 CatB,Prod58-cB,10,50 CatX,Prod59-cZ,10,60 CatC,Prod59-cC,10,140
That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ.
Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done.
[root@sandbox hcc]# cat salesHCC.pig rawSales = LOAD 'raw_sales.txt' USING PigStorage(',') AS (category: chararray, product: chararray, sales: long, total_sales_category: long); -- group them by the total sales / category combos grpByCatTotals = GROUP rawSales BY (total_sales_category, category); -- put these groups in order from highest to lowest sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC; -- just keep the top N topSalesCats = LIMIT sortGrpByCatTotals 2; -- do your original logic to get the top sales within the categories topProdsByTopCats = FOREACH topSalesCats { sorted = ORDER rawSales BY sales DESC; top = LIMIT sorted 2; GENERATE group, FLATTEN(top); } DUMP topProdsByTopCats;
The output is as initially expected.
[root@sandbox hcc]# pig -x tez salesHCC.pig ((140,CatC),CatC,Prod09-cC,80,140) ((140,CatC),CatC,Prod73-cC,50,140) ((60,CatZ),CatZ,Prod22-cZ,30,60) ((60,CatZ),CatZ,Prod83-cZ,20,60)
I hope this was what you were looking for. Either way, good luck!
Created 06-17-2016 03:05 PM
What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals.
For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data.
[root@sandbox hcc]# cat raw_sales.txt CatZ,Prod22-cZ,30,60 CatA,Prod88-cA,15,50 CatY,Prod07-cY,20,40 CatB,Prod18-cB,10,50 CatX,Prod29-cZ,40,60 CatC,Prod09-cC,80,140 CatZ,Prod83-cZ,20,60 CatA,Prod17-cA,25,50 CatY,Prod98-cY,10,40 CatB,Prod99-cB,30,50 CatX,Prod19-cZ,10,60 CatC,Prod73-cC,50,140 CatZ,Prod52-cZ,10,60 CatA,Prod58-cA,15,50 CatY,Prod57-cY,10,40 CatB,Prod58-cB,10,50 CatX,Prod59-cZ,10,60 CatC,Prod59-cC,10,140
That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ.
Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done.
[root@sandbox hcc]# cat salesHCC.pig rawSales = LOAD 'raw_sales.txt' USING PigStorage(',') AS (category: chararray, product: chararray, sales: long, total_sales_category: long); -- group them by the total sales / category combos grpByCatTotals = GROUP rawSales BY (total_sales_category, category); -- put these groups in order from highest to lowest sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC; -- just keep the top N topSalesCats = LIMIT sortGrpByCatTotals 2; -- do your original logic to get the top sales within the categories topProdsByTopCats = FOREACH topSalesCats { sorted = ORDER rawSales BY sales DESC; top = LIMIT sorted 2; GENERATE group, FLATTEN(top); } DUMP topProdsByTopCats;
The output is as initially expected.
[root@sandbox hcc]# pig -x tez salesHCC.pig ((140,CatC),CatC,Prod09-cC,80,140) ((140,CatC),CatC,Prod73-cC,50,140) ((60,CatZ),CatZ,Prod22-cZ,30,60) ((60,CatZ),CatZ,Prod83-cZ,20,60)
I hope this was what you were looking for. Either way, good luck!
Created 06-17-2016 07:21 PM
Thank you! This was exactly what I Wanted. What was new to me was this
So Pig can order groups in that way.