Support Questions

Find answers, ask questions, and share your expertise

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Support Questions
- :
- Using PIG Latin to replace multiple strings from s...

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Labels:

Explorer

Created 09-25-2016 02:54 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Basically I've 7 fields how can I obtain this: 1;7287026502032012,18;706;101200010;17286;oz;2.5 Many thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Guru

Created 09-26-2016 12:59 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*You need to FLATTEN your nested data*

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1; -- field (7287026502032012,18); -- tuple {(706)}; -- bag {(101200010)}; -- bag {(17286)}; -- bag {(oz)}; -- bag 2.5 -- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

*Flattening tuples *

To look at only tuples, let's assume your data looked like this:

1; -- field (7287026502032012,18); -- bag

Then you would use:

data_flattened = FOREACH data GENERATE $0, FLATTEN $1;

which for the data above would produce 1; 7287026502032012; 18

*Flattening bags*

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

*For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).*

**Using Pig's builtin function BagToTuple() t****o help you out**

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

*Final code*

Your final code will look like this:

data_flattened = FOREACH data GENERATE $0, FLATTEN $1, FLATTEN(BagToTuple($2)), FLATTEN(BagToTuple($3)), FLATTEN(BagToTuple($4)), FLATTEN(BagToTuple($5)), $6;

to produce your desired data.

*Useful links:*

https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

3 REPLIES 3

Guru

Created 09-26-2016 12:59 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

*You need to FLATTEN your nested data*

Your grouped data set has (is a bag of) fields, tuples, and bags. You need to extract the fields from the bags and tuples using the FLATTEN operator.

Each of you grouped records can be seen as follows:

1; -- field (7287026502032012,18); -- tuple {(706)}; -- bag {(101200010)}; -- bag {(17286)}; -- bag {(oz)}; -- bag 2.5 -- field

Using FLATTEN with the tuple is simple but using it with a bag is more complicated.

*Flattening tuples *

To look at only tuples, let's assume your data looked like this:

1; -- field (7287026502032012,18); -- bag

Then you would use:

data_flattened = FOREACH data GENERATE $0, FLATTEN $1;

which for the data above would produce 1; 7287026502032012; 18

*Flattening bags*

Flattening bags is more complicated, because it flattens them to tuples but cross joins them with the other data in your GENERATE statement. From the Apache Pig docs

*For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).*

**Using Pig's builtin function BagToTuple() t****o help you out**

Pig has a builtin function BagToTuple() which as it says converts a bag to a tuple. By converting your bags to tuples, you can then easily flatten them as above.

*Final code*

Your final code will look like this:

data_flattened = FOREACH data GENERATE $0, FLATTEN $1, FLATTEN(BagToTuple($2)), FLATTEN(BagToTuple($3)), FLATTEN(BagToTuple($4)), FLATTEN(BagToTuple($5)), $6;

to produce your desired data.

*Useful links:*

https://pig.apache.org/docs/r0.10.0/basic.html#flatten http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToTuple.html

If this answers your question, let me know by accepting the answer. Else, let me know the gaps or issues that are remaining.

Re: Using PIG Latin to replace multiple strings from same field

Contributor

Created 01-01-2017 04:58 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

HI @Greg Keys

Happy New year.Could you please provide below two clarifications.

clarification 1:- Let us say my input is:- 1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5 The expression for data_flattened is same and in that case whether my understanding is correct? Is below output is correct? Output:- 1;7287026502032012,18;706,707;101200010,101200011;17286,17287;oz,oz1;2.5

clarification 2:- Let us say my input is:- 1;(7287026502032012,18);{(706),(707)};{(101200010),(101200011)};{(17286),(17287)};{(oz),(oz1)};2.5 data_flattened_1 = FOREACH data GENERATE $0, FLATTEN ($1), FLATTEN($2), FLATTEN($3), FLATTEN($4), FLATTEN($5), $6; The expression for data_flattened_1 is mentioned above and in that case whether my understanding is correct? Is below output is correct? Output:- 1;7287026502032012,18;706;101200010;17286;oz;2.5 1;7287026502032012,18;707;101200011;17287;oz1;2.5

Re: Using PIG Latin to replace multiple strings from same field

Contributor

Created 01-03-2017 12:19 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

HI @Greg Keys

Could you please provide input on my clarification

Announcements

What's New @ Cloudera

What's New @ Cloudera

What's New @ Cloudera

What's New @ Cloudera

Product Announcements