Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

I need to know how to use regex for new line in pig latin

avatar
Rising Star

I am using catalana log . my input has is like below lines. for date and others i have no problem but I need to read neext line which is start after INFO . i tried alot but i do not how to bring next line .I have used \\n and \\r but they did not work.

my regex is like this .

A= LOAD 'catalina.log' USING TextLoader AS (line:chararray);

B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([a-zA-z]{3}\\s[0-9]{1,2},\\s[0-9]{4}\\s[0-9]{1,2}:[0-9]{2}:[0-9]{2}\\s[A-Z]{2})\\n+(.*)INFO:(.*)));

DUMP B;

input : Nov 3, 2016 11:00:06 AM org.apache.catalina.startup.HostConfig deployDescriptor

INFO: Deploying configuration descriptor host-manager.xmlF

output: Nov 3, 2016 11:00:06 AM org.apache.catalina.startup.HostConfig deployDescriptor

1 ACCEPTED SOLUTION

avatar
Guru

Unfortunately when pig loads data it does it line by line. When processing data it also does so line by line and does not hold it in memory -- so there is no way to operate over multiple lines.

Similarly, when applying regex, it ignores the newline operator -- once records are loaded you are forced to operate on a record by record basis (though of course you can aggregate into sum, average, etc)

There is one possibility with processing multiple lines, but it will not work in your case: if you have fields in double quotes that have a new line inside the field then you can use piggybank's CSVExcelStorage to remove them. Since you are using log data this will not work for you.

https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/piggybank/storage/class-use/CSVExcelStorage.M...

You will have to preprocess the data using another programming paradigm to group your lines (INFO and next n number of lines) together.

Suggestions are:

  • Spark
  • map-reduce program where you implement your own InputFormat or RecordReader
  • NiFi (using ExtractText processor and regex, where Enable Multiline Mode = false), typically outside of hadoop
  • awk or sed (outside of hadoop)
  • java or groovy (outside of hadoop)
  • python, R, etc (outside of hadoop)

These look like good solutions for you (using Spark):

If this is what you are looking for let me know by accepting the answer; else, let me know of any gaps or follow up questions.

View solution in original post

2 REPLIES 2

avatar
Guru

Unfortunately when pig loads data it does it line by line. When processing data it also does so line by line and does not hold it in memory -- so there is no way to operate over multiple lines.

Similarly, when applying regex, it ignores the newline operator -- once records are loaded you are forced to operate on a record by record basis (though of course you can aggregate into sum, average, etc)

There is one possibility with processing multiple lines, but it will not work in your case: if you have fields in double quotes that have a new line inside the field then you can use piggybank's CSVExcelStorage to remove them. Since you are using log data this will not work for you.

https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/piggybank/storage/class-use/CSVExcelStorage.M...

You will have to preprocess the data using another programming paradigm to group your lines (INFO and next n number of lines) together.

Suggestions are:

  • Spark
  • map-reduce program where you implement your own InputFormat or RecordReader
  • NiFi (using ExtractText processor and regex, where Enable Multiline Mode = false), typically outside of hadoop
  • awk or sed (outside of hadoop)
  • java or groovy (outside of hadoop)
  • python, R, etc (outside of hadoop)

These look like good solutions for you (using Spark):

If this is what you are looking for let me know by accepting the answer; else, let me know of any gaps or follow up questions.

avatar
Rising Star

OK, thanks alot for help , i will try it