Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Using PIG Latin to replace multiple strings from same field

avatar
Rising Star
Hi experts, I've this line from a .txt which results from a Group Operator: 1;(7287026502032012,18);{(706)};{(101200010)};{(17286)};{(oz)};2.5

Basically I've 7 fields how can I obtain this: 1;7287026502032012,18;706;101200010;17286;oz;2.5 Many thanks!

1 ACCEPTED SOLUTION

avatar
Guru

You need to FLATTEN your nested data

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1;					-- field
(7287026502032012,18);			-- tuple
{(706)};				-- bag
{(101200010)};				-- bag
{(17286)};				-- bag
{(oz)};					-- bag
2.5 					-- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

Flattening tuples

To look at only tuples, let's assume your data looked like this:

1;					-- field
(7287026502032012,18);			-- bag

Then you would use:

data_flattened = FOREACH data GENERATE
   $0,
   FLATTEN $1;

which for the data above would produce 1; 7287026502032012; 18

Flattening bags

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).

Using Pig's builtin function BagToTuple() to help you out

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

Final code

Your final code will look like this:

data_flattened = FOREACH data GENERATE 
	$0, 
	FLATTEN $1,
	FLATTEN(BagToTuple($2)),
	FLATTEN(BagToTuple($3)),
	FLATTEN(BagToTuple($4)),
	FLATTEN(BagToTuple($5)),
	$6; 

to produce your desired data.

Useful links:

https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

View solution in original post

3 REPLIES 3

avatar
Guru

You need to FLATTEN your nested data

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1;					-- field
(7287026502032012,18);			-- tuple
{(706)};				-- bag
{(101200010)};				-- bag
{(17286)};				-- bag
{(oz)};					-- bag
2.5 					-- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

Flattening tuples

To look at only tuples, let's assume your data looked like this:

1;					-- field
(7287026502032012,18);			-- bag

Then you would use:

data_flattened = FOREACH data GENERATE
   $0,
   FLATTEN $1;

which for the data above would produce 1; 7287026502032012; 18

Flattening bags

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).

Using Pig's builtin function BagToTuple() to help you out

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

Final code

Your final code will look like this:

data_flattened = FOREACH data GENERATE 
	$0, 
	FLATTEN $1,
	FLATTEN(BagToTuple($2)),
	FLATTEN(BagToTuple($3)),
	FLATTEN(BagToTuple($4)),
	FLATTEN(BagToTuple($5)),
	$6; 

to produce your desired data.

Useful links:

https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

avatar
Expert Contributor

HI @Greg Keys

Happy New year.Could you please provide below two clarifications.

clarification 1:-
Let us say my input is:-
1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5

The expression for data_flattened is same and in that case whether my understanding is correct?
Is below output is correct?
Output:-
1;7287026502032012,18;706,707;101200010,101200011;17286,17287;oz,oz1;2.5

clarification 2:-
Let us say my input is:-
1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5

data_flattened_1 = FOREACH data GENERATE 
	$0, 
	FLATTEN ($1),
	FLATTEN($2),
	FLATTEN($3),
	FLATTEN($4),
	FLATTEN($5),
	$6; 
The expression for data_flattened_1 is mentioned above and in that case whether my understanding is correct?
Is below output is correct?
Output:-
1;7287026502032012,18;706;101200010;17286;oz;2.5
1;7287026502032012,18;707;101200011;17287;oz1;2.5

avatar
Expert Contributor

HI @Greg Keys

Could you please provide input on my clarification