Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Problems using 'readMultiLine'


Problems using 'readMultiLine'

New Contributor

Trying to ingest ND JSON records like this:

        "latitude": "44.041803", 
        "occurrence_date": "2015-02-24 13:40:48", 
        "longitude": "-123.082105", 
        "external_device_id": "ER00E925"

        "latitude": "44.044547", 
        "occurrence_date": "2015-02-24 13:41:19", 
        "longitude": "-123.082082", 
        "external_device_id": "ER00E925"


I'm using a spool source to read the input file, and tried to use 'readMultiLine' to turn them into this:

{ "latitude": "44.041803", "occurrence_date": "2015-02-24 13:40:48", "longitude": "-123.082105", "external_device_id": "ER00E925"},

{ "latitude": "44.044547", "occurrence_date": "2015-02-24 13:41:19", "longitude": "-123.082082", "external_device_id": "ER00E925"}


But instead I still get each input line output as an individual meaasge.

I've tried various combinations of 'regex', 'what', etc.


I sesarched, and used the log4j example, cut'n'paste the data and the Morphline, with same results.


Am I misunderstanding the fundamental operation of 'readMultiLine', doing something trivially stupid, or both?






Re: Problems using 'readMultiLine'

Expert Contributor
The Flume spool source emits one Flume event per input line. Thus, the morphline never receives an event that contains multiple lines. Thus readMultiLine can never emit more than a single line per Flume event. Maybe you can work-around it by configuring the Flume spool source to use the BlobDeserializer, which emits the entire input file in a single event (not applicable to large files due too RAM pressure).



Re: Problems using 'readMultiLine'

New Contributor

Ahh, OK I see.

I had previously tried changing the Spool deserializer from LINE to Blob.

My source is limited to max of 5000 'records', so RAM demands are low.

That enabled me to use Regex to eliminate newlines and aggregate lines into 'records',

but I got errors complaining that I could not create multiple output events from one (blob) input events.


Perhaps I can serialize the single aggregated output blob into individual 'records' at the Sink.

Or, a custom ND JSON deserializer to use with the Spool that emits JSON 'records' the way I want them.


Thank you very much for straightening me out on readMultiLine.




Don't have an account?
Coming from Hortonworks? Activate your account here