Support Questions

Find answers, ask questions, and share your expertise

distinct operation with bags

avatar
Expert Contributor
x = LOAD '/pigdata/source.txt' using PigStorage(',') As (exchange:chararray, symbol:chararray, date:chararray, open:double, high:double, low:double, close:double, volume:long, adj_close:double);


y = GROUP x by symbol;

z2 = foreach y generate x.exchange as exchange1;
dump z2;
({(NASDAQ),(NASDAQ),(NASDAQ),(ICICI),(ICICI),(ICICI),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ)})
({(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ)})

z4 = distinct z2; 
dump z4; 
({(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ)})
({(NASDAQ),(NASDAQ),(NASDAQ),(ICICI),(ICICI),(ICICI),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ),(NASDAQ)})

clarification:- How distinct will work with bags?For tuples it is clear and what will happen if i am using distinct with bags?dump z4 is not clear to me.

1 ACCEPTED SOLUTION

avatar
Guru

First you need to convert your bags into tuples, then flatten and distinct.

This is done using pig's built-in function BagToTuple()

See this post for explanation and example:

https://community.hortonworks.com/questions/58271/using-pig-latin-to-replace-multiple-strings-from-s...

View solution in original post

3 REPLIES 3

avatar
Guru

First you need to convert your bags into tuples, then flatten and distinct.

This is done using pig's built-in function BagToTuple()

See this post for explanation and example:

https://community.hortonworks.com/questions/58271/using-pig-latin-to-replace-multiple-strings-from-s...

avatar
Expert Contributor

Hi @Greg Keys

Thanks for input.may be my question is not clear.what will happen when we use z4 = distinct z2;

How z4 is calculated from z2 is not clear.

avatar
Guru

Same answer: since z2 is a bag, you need to flatten it to a tuple to do a distinct on it.

For the data you are showing:

z3 = for each z2 FLATTEN(BagToTuple($0));

z4 = distinct z3;

The link gives the detailed explanation of why this is required.