Support Questions

Find answers, ask questions, and share your expertise

Pig - Nested foreach

avatar
Expert Contributor

Hi,

I was going through Pig book and found this explanation with this example,

Ex:

--distinct_symbols.pig

daily = load 'NYSE_daily' as (exchange, symbol); -- not interested in other fields

grpd = group daily by exchange;

uniqcnt = foreach grpd {

sym = daily.symbol;

uniq_sym = distinct sym;

generate group, COUNT(uniq_sym);

};

Explanation:

In this nested code, each record passed to foreach is handled one at a time. In the first line we see a syntax that we have not seen outside of foreach . In fact, sym =daily.symbol would not be legal outside of foreach . It is roughly equivalent to the toplevel statement sym = foreach grpd generate daily.symbol , but it is not stated that way inside the foreach because it is not really another foreach . There is no relation for it to be associated with (that is, grpd is not defined here).

--

Here my question is, it says that the foreach is handled one at a time, in this case how it does the count? To count, we need more than one record, but this explanation says, one record is passed to this foreach at a time.

is my understanding correct?

1 ACCEPTED SOLUTION

avatar

Yep, know this "classic" script pretty well. You can find it on @Alan Gates's github account at https://github.com/alanfgates/programmingpig/blob/master/examples/ch6/distinct_symbols.pig. The absolute best way to understand things (and very often answer questions) is to simply run something and observe the behavior.

To help with this, I loaded up some simple data that equates to just three distinct trading symbols; SYM1, SYM2 and SYM3.

[maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NYSE_daily.csv
NYSE,SYM1,100,100
NYSE,SYM1,200,200
NYSE,SYM2,100,100
NYSE,SYM2,200,200
NYSE,SYM3,100,100
NYSE,SYM3,200,200
[maria_dev@sandbox 104217]$ 

I got the script ready to run.

[maria_dev@sandbox 104217]$ cat distinct_symbols.pig 
daily    = load '/user/maria_dev/hcc/104217/NYSE_daily.csv' 
                   USING PigStorage(',')
                   as (exchange, symbol); -- skip other fields

grpd     = group daily by exchange;

describe grpd;  -- to show where "daily" below comes from 
dump grpd;

uniqcnt  = foreach grpd {
   sym      = daily.symbol;
   uniq_sym = distinct sym;
   generate group, COUNT(uniq_sym);
};

describe uniqcnt;
dump uniqcnt;
[maria_dev@sandbox 104217]$ 

The first thing you seem to have trouble with is where does "daily" come from. As this output from the describe and dump of grpd shows, it is made up of two attributes; group and daily (where daily is the contents of of all records from the NYSE which is all records since that's all we have in this file).

grpd: { group: bytearray,  daily: {(exchange: bytearray,symbol: bytearray)} }
( NYSE, {(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)} ) 

So, there is only one tuple in the grpd alias (there is only one distinct exchange) that we get to loop through. While inside the loop (for the single row it has) sym ends up being all six rows from (grpd current row).daily.symbol and then uniq_sym ends up being the three distinct rows of which we use to generate the second (unnamed) attribute in uniqcnt.

From there, we can describe and dump grpd.

uniqcnt: {group: bytearray,long}
(NYSE,3) 

To help illustrate it more, add the following file to the same input directory in HDFS.

[maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NASDAQ_daily.csv
NASDAQ,HDP,10,1000
NASDAQ,ABC,1,1
NASDAQ,XYZ,1,1
NASDAQ,HDP,10,2000
[maria_dev@sandbox 104217]$ 

Then change the pig script to just read the directory that has two files now and you'll get this updated output.

grpd: {group: bytearray,daily: {(exchange: bytearray,symbol: bytearray)}}
(NYSE,{(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)})
(NASDAQ,{(NASDAQ,HDP),(NASDAQ,XYZ),(NASDAQ,ABC),(NASDAQ,HDP)})
uniqcnt: {group: bytearray,long}
(NYSE,3)
(NASDAQ,3) 

Hope this helps. Good luck and happy Hadooping!

View solution in original post

4 REPLIES 4

avatar

Yep, know this "classic" script pretty well. You can find it on @Alan Gates's github account at https://github.com/alanfgates/programmingpig/blob/master/examples/ch6/distinct_symbols.pig. The absolute best way to understand things (and very often answer questions) is to simply run something and observe the behavior.

To help with this, I loaded up some simple data that equates to just three distinct trading symbols; SYM1, SYM2 and SYM3.

[maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NYSE_daily.csv
NYSE,SYM1,100,100
NYSE,SYM1,200,200
NYSE,SYM2,100,100
NYSE,SYM2,200,200
NYSE,SYM3,100,100
NYSE,SYM3,200,200
[maria_dev@sandbox 104217]$ 

I got the script ready to run.

[maria_dev@sandbox 104217]$ cat distinct_symbols.pig 
daily    = load '/user/maria_dev/hcc/104217/NYSE_daily.csv' 
                   USING PigStorage(',')
                   as (exchange, symbol); -- skip other fields

grpd     = group daily by exchange;

describe grpd;  -- to show where "daily" below comes from 
dump grpd;

uniqcnt  = foreach grpd {
   sym      = daily.symbol;
   uniq_sym = distinct sym;
   generate group, COUNT(uniq_sym);
};

describe uniqcnt;
dump uniqcnt;
[maria_dev@sandbox 104217]$ 

The first thing you seem to have trouble with is where does "daily" come from. As this output from the describe and dump of grpd shows, it is made up of two attributes; group and daily (where daily is the contents of of all records from the NYSE which is all records since that's all we have in this file).

grpd: { group: bytearray,  daily: {(exchange: bytearray,symbol: bytearray)} }
( NYSE, {(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)} ) 

So, there is only one tuple in the grpd alias (there is only one distinct exchange) that we get to loop through. While inside the loop (for the single row it has) sym ends up being all six rows from (grpd current row).daily.symbol and then uniq_sym ends up being the three distinct rows of which we use to generate the second (unnamed) attribute in uniqcnt.

From there, we can describe and dump grpd.

uniqcnt: {group: bytearray,long}
(NYSE,3) 

To help illustrate it more, add the following file to the same input directory in HDFS.

[maria_dev@sandbox 104217]$ hdfs dfs -cat /user/maria_dev/hcc/104217/NASDAQ_daily.csv
NASDAQ,HDP,10,1000
NASDAQ,ABC,1,1
NASDAQ,XYZ,1,1
NASDAQ,HDP,10,2000
[maria_dev@sandbox 104217]$ 

Then change the pig script to just read the directory that has two files now and you'll get this updated output.

grpd: {group: bytearray,daily: {(exchange: bytearray,symbol: bytearray)}}
(NYSE,{(NYSE,SYM3),(NYSE,SYM3),(NYSE,SYM2),(NYSE,SYM2),(NYSE,SYM1),(NYSE,SYM1)})
(NASDAQ,{(NASDAQ,HDP),(NASDAQ,XYZ),(NASDAQ,ABC),(NASDAQ,HDP)})
uniqcnt: {group: bytearray,long}
(NYSE,3)
(NASDAQ,3) 

Hope this helps. Good luck and happy Hadooping!

avatar
Expert Contributor

Thanks @Lester Martin for the explanation.

But,we can achieve the same without using nested foreach right? (please refer the below sample code without nested foreach). When we should exactly use the nested foreach?

daily = Load '/user/satu/data.csv' USING PigStorage(',') as (exchange, symbol); -- skip other fields

--dump daily;

dist = distinct daily;

--dump dist;

grpd = group dist by exchange;

uniqcnt = foreach grpd generate group, COUNT(dist.symbol);

dump uniqcnt;

avatar

Agreed that the example use case could be solved more simply (real world demands KISS principle, but sometimes simple examples are overkill). The point was to make sure you understood how it can be used. For a slightly more meaty example, check out the one in https://shrikantbang.wordpress.com/2014/01/14/apache-pig-group-by-nested-foreach-join-example/

avatar
Expert Contributor

Thanks Lester.