Support Questions

## Using PIG Latin to replace multiple strings from same field

Solved Go to solution

## Using PIG Latin to replace multiple strings from same field

Explorer
Hi experts, I've this line from a .txt which results from a Group Operator: 1;(7287026502032012,18);{(706)};{(101200010)};{(17286)};{(oz)};2.5

Basically I've 7 fields how can I obtain this: 1;7287026502032012,18;706;101200010;17286;oz;2.5 Many thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

## Re: Using PIG Latin to replace multiple strings from same field

Guru

You need to FLATTEN your nested data

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1;					-- field
(7287026502032012,18);			-- tuple
{(706)};				-- bag
{(101200010)};				-- bag
{(17286)};				-- bag
{(oz)};					-- bag
2.5 					-- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

Flattening tuples

To look at only tuples, let's assume your data looked like this:

1;					-- field
(7287026502032012,18);			-- bag

Then you would use:

data_flattened = FOREACH data GENERATE
\$0,
FLATTEN \$1;

which for the data above would produce 1; 7287026502032012; 18

Flattening bags

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten(\$0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE \$0, flatten(\$1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

Final code

Your final code will look like this:

data_flattened = FOREACH data GENERATE
\$0,
FLATTEN \$1,
FLATTEN(BagToTuple(\$2)),
FLATTEN(BagToTuple(\$3)),
FLATTEN(BagToTuple(\$4)),
FLATTEN(BagToTuple(\$5)),
\$6;

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

3 REPLIES 3

## Re: Using PIG Latin to replace multiple strings from same field

Guru

You need to FLATTEN your nested data

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1;					-- field
(7287026502032012,18);			-- tuple
{(706)};				-- bag
{(101200010)};				-- bag
{(17286)};				-- bag
{(oz)};					-- bag
2.5 					-- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

Flattening tuples

To look at only tuples, let's assume your data looked like this:

1;					-- field
(7287026502032012,18);			-- bag

Then you would use:

data_flattened = FOREACH data GENERATE
\$0,
FLATTEN \$1;

which for the data above would produce 1; 7287026502032012; 18

Flattening bags

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten(\$0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE \$0, flatten(\$1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

Final code

Your final code will look like this:

data_flattened = FOREACH data GENERATE
\$0,
FLATTEN \$1,
FLATTEN(BagToTuple(\$2)),
FLATTEN(BagToTuple(\$3)),
FLATTEN(BagToTuple(\$4)),
FLATTEN(BagToTuple(\$5)),
\$6;

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

## Re: Using PIG Latin to replace multiple strings from same field

Contributor

Happy New year.Could you please provide below two clarifications.

clarification 1:-
Let us say my input is:-
1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5

The expression for data_flattened is same and in that case whether my understanding is correct?
Is below output is correct?
Output:-
1;7287026502032012,18;706,707;101200010,101200011;17286,17287;oz,oz1;2.5

clarification 2:-
Let us say my input is:-
1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5

data_flattened_1 = FOREACH data GENERATE
\$0,
FLATTEN (\$1),
FLATTEN(\$2),
FLATTEN(\$3),
FLATTEN(\$4),
FLATTEN(\$5),
\$6;
The expression for data_flattened_1 is mentioned above and in that case whether my understanding is correct?
Is below output is correct?
Output:-
1;7287026502032012,18;706;101200010;17286;oz;2.5
1;7287026502032012,18;707;101200011;17287;oz1;2.5

## Re: Using PIG Latin to replace multiple strings from same field

Contributor

Could you please provide input on my clarification

Don't have an account?
Announcements
Product Announcements