Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

I need to know how to use regex for new line in pig latin

Solved Go to solution

I need to know how to use regex for new line in pig latin

New Contributor

I am using catalana log . my input has is like below lines. for date and others i have no problem but I need to read neext line which is start after INFO . i tried alot but i do not how to bring next line .I have used \\n and \\r but they did not work.

my regex is like this .

A= LOAD 'catalina.log' USING TextLoader AS (line:chararray);

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([a-zA-z]{3}\\s[0-9]{1,2},\\s[0-9]{4}\\s[0-9]{1,2}:[0-9]{2}:[0-9]{2}\\s[A-Z]{2})\\n+(.*)INFO:(.*)));

DUMP B;

input : Nov 3, 2016 11:00:06 AM org.apache.catalina.startup.HostConfig deployDescriptor

INFO: Deploying configuration descriptor host-manager.xmlF

output: Nov 3, 2016 11:00:06 AM org.apache.catalina.startup.HostConfig deployDescriptor

1 ACCEPTED SOLUTION

Accepted Solutions

Re: I need to know how to use regex for new line in pig latin

Guru

Unfortunately when pig loads data it does it line by line. When processing data it also does so line by line and does not hold it in memory -- so there is no way to operate over multiple lines.

Similarly, when applying regex, it ignores the newline operator -- once records are loaded you are forced to operate on a record by record basis (though of course you can aggregate into sum, average, etc)

There is one possibility with processing multiple lines, but it will not work in your case: if you have fields in double quotes that have a new line inside the field then you can use piggybank's CSVExcelStorage to remove them. Since you are using log data this will not work for you.

https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/piggybank/storage/class-use/CSVExcelStorage.M...

You will have to preprocess the data using another programming paradigm to group your lines (INFO and next n number of lines) together.

Suggestions are:

  • Spark
  • map-reduce program where you implement your own InputFormat or RecordReader
  • NiFi (using ExtractText processor and regex, where Enable Multiline Mode = false), typically outside of hadoop
  • awk or sed (outside of hadoop)
  • java or groovy (outside of hadoop)
  • python, R, etc (outside of hadoop)

These look like good solutions for you (using Spark):

If this is what you are looking for let me know by accepting the answer; else, let me know of any gaps or follow up questions.

2 REPLIES 2

Re: I need to know how to use regex for new line in pig latin

Guru

Unfortunately when pig loads data it does it line by line. When processing data it also does so line by line and does not hold it in memory -- so there is no way to operate over multiple lines.

Similarly, when applying regex, it ignores the newline operator -- once records are loaded you are forced to operate on a record by record basis (though of course you can aggregate into sum, average, etc)

There is one possibility with processing multiple lines, but it will not work in your case: if you have fields in double quotes that have a new line inside the field then you can use piggybank's CSVExcelStorage to remove them. Since you are using log data this will not work for you.

https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/piggybank/storage/class-use/CSVExcelStorage.M...

You will have to preprocess the data using another programming paradigm to group your lines (INFO and next n number of lines) together.

Suggestions are:

  • Spark
  • map-reduce program where you implement your own InputFormat or RecordReader
  • NiFi (using ExtractText processor and regex, where Enable Multiline Mode = false), typically outside of hadoop
  • awk or sed (outside of hadoop)
  • java or groovy (outside of hadoop)
  • python, R, etc (outside of hadoop)

These look like good solutions for you (using Spark):

If this is what you are looking for let me know by accepting the answer; else, let me know of any gaps or follow up questions.

Re: I need to know how to use regex for new line in pig latin

New Contributor

OK, thanks alot for help , i will try it

Don't have an account?
Coming from Hortonworks? Activate your account here