Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

cant we filter the data which we have done in 3.7 step without using it by simply applying the constrains in batting table without using join?

cant we filter the data which we have done in 3.7 step without using it by simply applying the constrains in batting table without using join?

New Contributor
 
1 REPLY 1
Highlighted

Re: cant we filter the data which we have done in 3.7 step without using it by simply applying the constrains in batting table without using join?

I think you are talking about the tutorial at http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/ and I'd love to review your suggestion. As for me, and with a little jumpstart from this old response from old @Alan Gates, I think I got it working without the JOIN as you were wondering. Check out (and run!) my pig script below.

--MODIFY STEP 3.3 CODE TO LOOK LIKE THE FOLLOWING
--   so that we are dealing with integers
batting = load 'baseball/Batting.csv' using PigStorage(',')
  AS (playerID:chararray, year:int,
      dollar2:chararray, dollar3:chararray, dollar4:chararray, 
      dollar5:chararray, dollar6:chararray, dollar7:chararray,
      runs:int);

--MODIFY STEP 3.4 CODE TO LOOK LIKE THE FOLLOWING
--   so also get rid of any non-runs
raw_runs = FILTER batting BY (year > 0) AND (runs > 0);

--MODIFY STEP 3.5 CODE TO LOOK LIKE THE FOLLOWING
--   since fields already named (and typed) 
runs = FOREACH raw_runs GENERATE playerID, year, runs;

--STEP 3.6 CODE LOOKS GOOD
grp_data = GROUP runs by (year);

--REPLACE STEP 3.7 CODE TO LOOK LIKE THE FOLLOWING
--   perform nested foreach so that each year's grouping
--   can be sorted and then trimmed to just the first tuple
max_runs = FOREACH grp_data {
    inner_sorted = ORDER runs BY runs DESC;
    first_row = LIMIT inner_sorted 1;
    GENERATE first_row AS most_hits;
}
dump max_runs;

--DON'T INCLUDE STEP 3.8 CODE...

The first few rows of this output looks like the following and lines up with the totals I see in the tutorial.

({(barnero01,1871,66)})
({(eggleda01,1872,94)})
({(barnero01,1873,125)})
({(mcveyca01,1874,91)})
({(barnero01,1875,115)})
({(barnero01,1876,126)})

That said, I think the tutorial is good like it is as it is introducing the JOIN in a simple example. Yes, maybe not (or maybe so -- needs some testing at scale) exactly what you would do in production, but for a tutorial it is easier to understand than my nested FOREACH for a first time look-see at Pig. See more at http://pig.apache.org/docs/r0.15.0/basic.html#foreach

Don't have an account?
Coming from Hortonworks? Activate your account here