Created 11-24-2016 09:15 AM
I am using catalana log . my input has is like below lines. for date and others i have no problem but I need to read neext line which is start after INFO . i tried alot but i do not how to bring next line .I have used \\n and \\r but they did not work.
my regex is like this .
A= LOAD 'catalina.log' USING TextLoader AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^([a-zA-z]{3}\\s[0-9]{1,2},\\s[0-9]{4}\\s[0-9]{1,2}:[0-9]{2}:[0-9]{2}\\s[A-Z]{2})\\n+(.*)INFO:(.*)));
DUMP B;
input : Nov 3, 2016 11:00:06 AM org.apache.catalina.startup.HostConfig deployDescriptor
INFO: Deploying configuration descriptor host-manager.xmlF
output: Nov 3, 2016 11:00:06 AM org.apache.catalina.startup.HostConfig deployDescriptor
Created 11-24-2016 12:47 PM
Unfortunately when pig loads data it does it line by line. When processing data it also does so line by line and does not hold it in memory -- so there is no way to operate over multiple lines.
Similarly, when applying regex, it ignores the newline operator -- once records are loaded you are forced to operate on a record by record basis (though of course you can aggregate into sum, average, etc)
There is one possibility with processing multiple lines, but it will not work in your case: if you have fields in double quotes that have a new line inside the field then you can use piggybank's CSVExcelStorage to remove them. Since you are using log data this will not work for you.
You will have to preprocess the data using another programming paradigm to group your lines (INFO and next n number of lines) together.
Suggestions are:
These look like good solutions for you (using Spark):
If this is what you are looking for let me know by accepting the answer; else, let me know of any gaps or follow up questions.
Created 11-24-2016 12:47 PM
Unfortunately when pig loads data it does it line by line. When processing data it also does so line by line and does not hold it in memory -- so there is no way to operate over multiple lines.
Similarly, when applying regex, it ignores the newline operator -- once records are loaded you are forced to operate on a record by record basis (though of course you can aggregate into sum, average, etc)
There is one possibility with processing multiple lines, but it will not work in your case: if you have fields in double quotes that have a new line inside the field then you can use piggybank's CSVExcelStorage to remove them. Since you are using log data this will not work for you.
You will have to preprocess the data using another programming paradigm to group your lines (INFO and next n number of lines) together.
Suggestions are:
These look like good solutions for you (using Spark):
If this is what you are looking for let me know by accepting the answer; else, let me know of any gaps or follow up questions.
Created 12-02-2016 08:46 PM
OK, thanks alot for help , i will try it