Created 09-25-2016 02:54 PM
Basically I've 7 fields how can I obtain this: 1;7287026502032012,18;706;101200010;17286;oz;2.5 Many thanks!
Created 09-26-2016 12:59 PM
You need to FLATTEN your nested data
Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.
Each of you grouped records can be seen as follows:
1; -- field (7287026502032012,18); -- tuple {(706)}; -- bag {(101200010)}; -- bag {(17286)}; -- bag {(oz)}; -- bag 2.5 -- field
Using FLATTEN with the tuple is simple but using it with a bag is more complicated.
Flattening tuples
To look at only tuples, let's assume your data looked like this:
1; -- field (7287026502032012,18); -- bag
Then you would use:
data_flattened = FOREACH data GENERATE $0, FLATTEN $1;
which for the data above would produce 1; 7287026502032012; 18
Flattening bags
Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs
For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).
Using Pig's builtin function BagToTuple() to help you out
Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.
Final code
Your final code will look like this:
data_flattened = FOREACH data GENERATE $0, FLATTEN $1, FLATTEN(BagToTuple($2)), FLATTEN(BagToTuple($3)), FLATTEN(BagToTuple($4)), FLATTEN(BagToTuple($5)), $6;
to produce your desired data.
Useful links:
https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html
If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.
Created 09-26-2016 12:59 PM
You need to FLATTEN your nested data
Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.
Each of you grouped records can be seen as follows:
1; -- field (7287026502032012,18); -- tuple {(706)}; -- bag {(101200010)}; -- bag {(17286)}; -- bag {(oz)}; -- bag 2.5 -- field
Using FLATTEN with the tuple is simple but using it with a bag is more complicated.
Flattening tuples
To look at only tuples, let's assume your data looked like this:
1; -- field (7287026502032012,18); -- bag
Then you would use:
data_flattened = FOREACH data GENERATE $0, FLATTEN $1;
which for the data above would produce 1; 7287026502032012; 18
Flattening bags
Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs
For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).
Using Pig's builtin function BagToTuple() to help you out
Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.
Final code
Your final code will look like this:
data_flattened = FOREACH data GENERATE $0, FLATTEN $1, FLATTEN(BagToTuple($2)), FLATTEN(BagToTuple($3)), FLATTEN(BagToTuple($4)), FLATTEN(BagToTuple($5)), $6;
to produce your desired data.
Useful links:
https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html
If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.
Created 01-01-2017 04:58 AM
HI @Greg Keys
Happy New year.Could you please provide below two clarifications.
clarification 1:- Let us say my input is:- 1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5 The expression for data_flattened is same and in that case whether my understanding is correct? Is below output is correct? Output:- 1;7287026502032012,18;706,707;101200010,101200011;17286,17287;oz,oz1;2.5
clarification 2:- Let us say my input is:- 1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5 data_flattened_1 = FOREACH data GENERATE $0, FLATTEN ($1), FLATTEN($2), FLATTEN($3), FLATTEN($4), FLATTEN($5), $6; The expression for data_flattened_1 is mentioned above and in that case whether my understanding is correct? Is below output is correct? Output:- 1;7287026502032012,18;706;101200010;17286;oz;2.5 1;7287026502032012,18;707;101200011;17287;oz1;2.5
Created 01-03-2017 12:19 PM
HI @Greg Keys
Could you please provide input on my clarification