I
think you are talking about the tutorial at http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/
and I'd love to review your suggestion.
As for me, and with a little jumpstart from this old response from old @Alan Gates, I think I got it working without the JOIN as you were wondering. Check out (and run!) my pig
script below.
--MODIFY STEP 3.3 CODE TO LOOK LIKE THE FOLLOWING
-- so that we are dealing with integers
batting = load 'baseball/Batting.csv' using PigStorage(',')
AS (playerID:chararray, year:int,
dollar2:chararray, dollar3:chararray, dollar4:chararray,
dollar5:chararray, dollar6:chararray, dollar7:chararray,
runs:int);
--MODIFY STEP 3.4 CODE TO LOOK LIKE THE FOLLOWING
-- so also get rid of any non-runs
raw_runs = FILTER batting BY (year > 0) AND (runs > 0);
--MODIFY STEP 3.5 CODE TO LOOK LIKE THE FOLLOWING
-- since fields already named (and typed)
runs = FOREACH raw_runs GENERATE playerID, year, runs;
--STEP 3.6 CODE LOOKS GOOD
grp_data = GROUP runs by (year);
--REPLACE STEP 3.7 CODE TO LOOK LIKE THE FOLLOWING
-- perform nested foreach so that each year's grouping
-- can be sorted and then trimmed to just the first tuple
max_runs = FOREACH grp_data {
inner_sorted = ORDER runs BY runs DESC;
first_row = LIMIT inner_sorted 1;
GENERATE first_row AS most_hits;
}
dump max_runs;
--DON'T INCLUDE STEP 3.8 CODE...
The first few rows of this output looks like the
following and lines up with the totals I see in the tutorial.
({(barnero01,1871,66)})
({(eggleda01,1872,94)})
({(barnero01,1873,125)})
({(mcveyca01,1874,91)})
({(barnero01,1875,115)})
({(barnero01,1876,126)})
That said, I think
the tutorial is good like it is as it is introducing the JOIN in a simple
example. Yes, maybe not (or maybe so --
needs some testing at scale) exactly what you would do in production, but for a
tutorial it is easier to understand than my nested FOREACH for a first time look-see at Pig. See more at http://pig.apache.org/docs/r0.15.0/basic.html#foreach