we have records processing in a batch for each 30min interval. we need to calculate max(or)sum of all the values of all the records in a single iteration. could you help me in calculating this and also can you help me in having a wordcount of the no of records in each batch in pig
The following will give you the total number of records in a file and the sum of a value of one field.
A = LOAD 'data.csv' USING PigStorage(',') AS (name:chararray, salary:int);
B = GROUP A ALL;
X = FOREACH B GENERATE COUNT(A.name), SUM(A.salary);
MAX will give you maximum of all fields.
Note the dump X writes results to screen. If you want to persist the results you could write it to a file, but this is not a good practice in your case because you will be writing many small files (not best practice in hadoop). There are ways to get around this though. Alternatively you could insert results to a hive table with columns something like: date, filename, count, sum, max.