Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using PIG Latin to replace multiple strings from same field

Solved Go to solution

Using PIG Latin to replace multiple strings from same field

Hi experts, I've this line from a .txt which results from a Group Operator: 1;(7287026502032012,18);{(706)};{(101200010)};{(17286)};{(oz)};2.5

Basically I've 7 fields how can I obtain this: 1;7287026502032012,18;706;101200010;17286;oz;2.5 Many thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Using PIG Latin to replace multiple strings from same field

Guru

You need to FLATTEN your nested data

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1;					-- field
(7287026502032012,18);			-- tuple
{(706)};				-- bag
{(101200010)};				-- bag
{(17286)};				-- bag
{(oz)};					-- bag
2.5 					-- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

Flattening tuples

To look at only tuples, let's assume your data looked like this:

1;					-- field
(7287026502032012,18);			-- bag

Then you would use:

data_flattened = FOREACH data GENERATE
   $0,
   FLATTEN $1;

which for the data above would produce 1; 7287026502032012; 18

Flattening bags

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).

Using Pig's builtin function BagToTuple() to help you out

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

Final code

Your final code will look like this:

data_flattened = FOREACH data GENERATE 
	$0, 
	FLATTEN $1,
	FLATTEN(BagToTuple($2)),
	FLATTEN(BagToTuple($3)),
	FLATTEN(BagToTuple($4)),
	FLATTEN(BagToTuple($5)),
	$6; 

to produce your desired data.

Useful links:

https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

3 REPLIES 3

Re: Using PIG Latin to replace multiple strings from same field

Guru

You need to FLATTEN your nested data

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1;					-- field
(7287026502032012,18);			-- tuple
{(706)};				-- bag
{(101200010)};				-- bag
{(17286)};				-- bag
{(oz)};					-- bag
2.5 					-- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

Flattening tuples

To look at only tuples, let's assume your data looked like this:

1;					-- field
(7287026502032012,18);			-- bag

Then you would use:

data_flattened = FOREACH data GENERATE
   $0,
   FLATTEN $1;

which for the data above would produce 1; 7287026502032012; 18

Flattening bags

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).

Using Pig's builtin function BagToTuple() to help you out

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

Final code

Your final code will look like this:

data_flattened = FOREACH data GENERATE 
	$0, 
	FLATTEN $1,
	FLATTEN(BagToTuple($2)),
	FLATTEN(BagToTuple($3)),
	FLATTEN(BagToTuple($4)),
	FLATTEN(BagToTuple($5)),
	$6; 

to produce your desired data.

Useful links:

https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

Re: Using PIG Latin to replace multiple strings from same field

Contributor

HI @Greg Keys

Happy New year.Could you please provide below two clarifications.

clarification 1:-
Let us say my input is:-
1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5

The expression for data_flattened is same and in that case whether my understanding is correct?
Is below output is correct?
Output:-
1;7287026502032012,18;706,707;101200010,101200011;17286,17287;oz,oz1;2.5

clarification 2:-
Let us say my input is:-
1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5

data_flattened_1 = FOREACH data GENERATE 
	$0, 
	FLATTEN ($1),
	FLATTEN($2),
	FLATTEN($3),
	FLATTEN($4),
	FLATTEN($5),
	$6; 
The expression for data_flattened_1 is mentioned above and in that case whether my understanding is correct?
Is below output is correct?
Output:-
1;7287026502032012,18;706;101200010;17286;oz;2.5
1;7287026502032012,18;707;101200011;17287;oz1;2.5
Highlighted

Re: Using PIG Latin to replace multiple strings from same field

Contributor

HI @Greg Keys

Could you please provide input on my clarification

Don't have an account?
Coming from Hortonworks? Activate your account here