Support Questions

Find answers, ask questions, and share your expertise

Apache PIG - If Statement based on a count value

avatar
Rising Star

Hi experts, I've this statment in Apache PIG: ... Count = FOREACH data GENERATE SUM(Field); ... How can do a IF Statement like this: IF(SUM(Field) > 10)

Store into X; ELSE STORE into Y; Is possible to do this? Many thanks!

1 ACCEPTED SOLUTION

avatar
Guru

@João Souza

This requirement is based around FILTER, which retrieves records that satisfy one or more conditions.

There are two ways to do this.

This first is using FILTER as below:

X = FILTER Count by Field >10; 
Y = FILTER Count by Field <=10; 

The second way achieves the same result but using different grammar.

SPLIT Count into X if Field >10, Y if Field <=10;

Please note that the use of SUM requires a GROUP operation beforehand. In your case, you would have needed to GROUP data before you summed it as shown in your first line of code.

It would have to look something like the following.

data = LOAD ... as (amt:int, name:chararray);
grouped_data = GROUP data by name;
summed_data = FOREACH grouped_data GENERATE SUM(data.amt) amtSum, name; 
X = FILTER summed_data by amtSum >10; 
Y = FILTER summed_data by amtSum <=10; 

See:

(Let me know if this is what you are looking for by accepting the answer).

View solution in original post

1 REPLY 1

avatar
Guru

@João Souza

This requirement is based around FILTER, which retrieves records that satisfy one or more conditions.

There are two ways to do this.

This first is using FILTER as below:

X = FILTER Count by Field >10; 
Y = FILTER Count by Field <=10; 

The second way achieves the same result but using different grammar.

SPLIT Count into X if Field >10, Y if Field <=10;

Please note that the use of SUM requires a GROUP operation beforehand. In your case, you would have needed to GROUP data before you summed it as shown in your first line of code.

It would have to look something like the following.

data = LOAD ... as (amt:int, name:chararray);
grouped_data = GROUP data by name;
summed_data = FOREACH grouped_data GENERATE SUM(data.amt) amtSum, name; 
X = FILTER summed_data by amtSum >10; 
Y = FILTER summed_data by amtSum <=10; 

See:

(Let me know if this is what you are looking for by accepting the answer).