Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unstructured data into structured data using Pig processing

Highlighted

Unstructured data into structured data using Pig processing

Explorer

Hello all,

 

I am trying to structure the un-structured data using PIG processing. I am able to process single line syslog sample log data like

 

dev_id=03   user_id=000 int_ip=198.0.13.24  ext_ip=68.67.0.14   src_port=99 dest_port=213   response_code=5

 

using REGEX_EXTRACT.

 

I am looking for some suggestion to process multi-line log data. Here is the multi-line active directory sample log:

 

Aug  7 13:17:56 198.50.167.202 Aug 07 13:12:28 lab pool-43-thread-10 com.broadhop.radius.impl.actions.ProxyAccountingRequest Error proxying radius accounting message: com.broadhop.radius.messages.impl.RadiusAccountingMessage request:	
	NAS-PORT = 0
	"CISCO-AVPAIR = connect-progress=Call Up,portbundle=enable"
	SERVICE-TYPE = 2
	ACCT-INPUT-PACKETS = 79
	ACCT-OUTPUT-PACKETS = 87
	ACCT-INPUT-OCTETS = 11090
	NAS-IDENTIFIER = XYZRS001.abc.com
	DEvice_ID = S198.50.122.89:632
	NAS-PORT-ID = 0/0/0/523
	ACCT-STATUS-TYPE = 3
	PORT-TYPE = 5
	Source_ip = 100.87.197.132
	PROTOCOL = 1
	ACCT-AUTHENTIC = 2
	ACCT-DELAY-TIME = 0
	ext_IP_ADDRESS = 198.50.122.89
	gateway= xyz4.3rde.3235
	ACCT-INPUT-OCTETS = 416817
	NAS-IDENTIFIER = XYZRS001.abc.com
Aug  7 13:17:55 198.50.167.202 Aug 07 13:12:27 lab pool-43-thread-4 com.broadhop.radius.impl.actions.ProxyAccountingRequest Error proxying radius accounting message: com.broadhop.radius.messages.impl.RadiusAccountingMessage request:	ACCT-DELAY-TIME = 0
	ext_IP_ADDRESS = 198.50.122.89
	gateway= 754843.zscjsahc.45f33
	USER-NAME = 754843.zscjsahc.45f33
	ACCT-SESSION-TIME = 763
	"CISCO-CONTROL-INFO = I0;416817,O0;1681229"
	ACCT-OUTPUT-OCTETS = 1681229
	ACCT-SESSION-ID = 000AF829 to server Digital Route
	
	NAS-PORT = 0
	"CISCO-AVPAIR = parent-session-id=000AF828,portbundle=enable"
	SERVICE-TYPE = 2
	ACCT-INPUT-PACKETS = 3263
	ACCT-OUTPUT-PACKETS = 5089
	ACCT-INPUT-OCTETS = 416817
	NAS-IDENTIFIER = XYZRS001.abc.com
	DEvice_ID = askcjas.3543i5:09878
	NAS-PORT-ID = 0/0/0/523
	ACCT-STATUS-TYPE = 3
	PORT-TYPE = 5
	PROTOCOL = 1
	Source_ip = 100.87.197.142
	SERVICE-INFO = N10M-UP-DOWN
	ACCT-DELAY-TIME = 0
	ext_IP_ADDRESS = 198.50.122.89
	gateway= 754843.zscjsahc.45f33
	USER-NAME = 754843.zscjsahc.45f33
	ACCT-SESSION-TIME = 763
	"CISCO-CONTROL-INFO = I0;416817,O0;1681229"
	ACCT-OUTPUT-OCTETS = 1681229
	ACCT-SESSION-ID = 000AF829 to server Digital Route

 Expected output:

 

Aug 7 13:17:56, 198.50.167.202, Aug 07 13:12:28, 0, "CISCO-AVPAIR = connect-progress=Call Up,portbundle=enable", 2, 79, 87, 11090, XYZRS001.abc.com, S198.50.122.89:632, 0/0/0/523, 3, 5, 100.87.197.132, 1, 2, 0, 198.50.122.89, xyz4.3rde.3235, 416817, XYZRS001.abc.com

Aug  7 13:17:55, 198.50.167.202, Aug 07 13:12:27, 198.50.122.89, 754843.zscjsahc.45f33, 754843.zscjsahc.45f33, 763,
"CISCO-CONTROL-INFO = I0;416817,O0;1681229", 1681229, 000AF829 to server Digital Route, 0, "CISCO-AVPAIR = parent-session-id=000AF828,portbundle=enable", 2, 3263, 5089, 416817, XYZRS001.abc.com, askcjas.3543i5:09878, 0/0/0/523, 3, 5, 1, 100.87.197.142, N10M-UP-DOWN, 0, 198.50.122.89, 754843.zscjsahc.45f33, 754843.zscjsahc.45f33, 763, "CISCO-CONTROL-INFO = I0;416817,O0;1681229", 1681229, 000AF829 to server Digital Route

My objective is to remove un-necesaary info., and store the information in CSV format to directly export into HBase table for report generation etc.

 

Followings are my queries:

 

1. As shown in multi-line sample log, second records having extra fields(columns) compare to first field. If sample log having different-2 fields information like number of fields values keep on changing, is it possible to process?

 

2. Is it possible to process multiple line sample log using script or using user define function in Java or Python? 

 

As we can perform near to real time serach using Cloudera search. In case to build model, I think we need to parse it, extract useful fields and then build model. Correct me if i am wrong, looking forward to reply.

 

Your help is higly appreciated. 


Thank you.

 

 

3 REPLIES 3
Highlighted

Re: Unstructured data into structured data using Pig processing

Cloudera Employee

I'm afraid that Pig currently do not supports multiline inputs. You might need to write either custom Storage or UDF that will handle your input. If you have ability to drive format on the source, it will be significantly simpler to generate one line with everything.

Highlighted

Re: Unstructured data into structured data using Pig processing

Explorer

Thanks jarcec for reply.

 

I read "Morphline" having capabilites to handle multiline inputs, but i am not sure about that. I am not aware about single line data generation, can you please suggest some link or information regarding one line generation from source.

 

Thanks

Re: Unstructured data into structured data using Pig processing

Cloudera Employee

Hi sir,

by "generatation on source" I meant the system that is generating the multiline data in the first place. If you can alter them, e.g. if you can generate only single line entries, then your problem with Pig will disappear.

 

Jarcec

Don't have an account?
Coming from Hortonworks? Activate your account here