New Contributor
Posts: 8
Registered: ‎11-15-2013

Morphline: IOException Not a data file



I'm having some problems passing an avro_event through to Morphlines.


When I skip the SolrSink in my flume config and just write to file (file-roll-sink) using an avro_event serializer I get a file the complete event in it.


java -jar ~/avro-tools-1.7.4.jar tojson ../flume/1386248426733-1
{"headers":{"timestamp":"1386248331991","id":"e96dc77f-3b07-4b5d-9e2e-7b641936c0f1","hostname":"","log_type":"com_job"},"body":"[2013-11-04 05:51:34,155][Thread-27][ERROR][..."}


When I enable the SolrSink with the most basic morphline configuration:


morphlines : [
    id : morphline1
    importCommands : ["com.cloudera.**", "org.apache.solr.**"]
    commands : [                    
        readAvroContainer {
      { logDebug { format : "output record: {}", args : ["@{}"] } }    

 I get the following stacktrace: 


TRACE com.cloudera.cdk.morphline.avro.ReadAvroContainerBuilder$ReadAvroContainer: beforeProcess: {_attachment_body=[[B@4ea20232], hostname=[], id=[77ae7588-b64a-41af-98e6-006730a28734], log_type=[com_job], timestamp=[1386248421968]}
2013-12-05 05:50:21,176 ERROR org.apache.flume.sink.solr.morphline.MorphlineSink: Morphline Sink SolrOut: Unable to process event from channel mc1. Exception follows.
com.cloudera.cdk.morphline.api.MorphlineRuntimeException: com.cloudera.cdk.morphline.api.MorphlineRuntimeException: Not a data file.
	at com.cloudera.cdk.morphline.base.FaultTolerance.handleException(
	at org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl.process(
	at org.apache.flume.sink.solr.morphline.MorphlineSink.process(
	at org.apache.flume.sink.DefaultSinkProcessor.process(
	at org.apache.flume.SinkRunner$
Caused by: com.cloudera.cdk.morphline.api.MorphlineRuntimeException: Not a data file.
	at com.cloudera.cdk.morphline.stdio.AbstractParser.doProcess(
	at com.cloudera.cdk.morphline.base.AbstractCommand.process(
	at com.cloudera.cdk.morphline.base.AbstractCommand.doProcess(
	at com.cloudera.cdk.morphline.base.AbstractCommand.process(
	at org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl.process(
	... 4 more
Caused by: Not a data file.
	at org.apache.avro.file.DataFileStream.initialize(
	at org.apache.avro.file.DataFileReader.<init>(
	at com.cloudera.cdk.morphline.avro.ReadAvroContainerBuilder$ReadAvroContainer.doProcess(
	at com.cloudera.cdk.morphline.stdio.AbstractParser.doProcess(
	... 8 more

 Can somebody explain where this is coming from?


Thank you!




Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Morphline: IOException Not a data file

This means your the morphline sink does not receive valid Avro data.

New Contributor
Posts: 8
Registered: ‎11-15-2013

Re: Morphline: IOException Not a data file

[ Edited ]

That's indeed what the exception says.

The real question then is: why is that event not valid?


When I just skip Morphline and write the avro event to file then it looks OK.


I can extra the schema with the avro-tools jar, I can extract JSON as shown in my example.


When I read it with readAvroContainer, then I get the exception.


Btw, the event is pulled in from an AvroSource. if the avro data would be invalid I would expect that the source would complain and throw an exception, no?



New Contributor
Posts: 8
Registered: ‎11-15-2013

Re: Morphline: IOException Not a data file

To be clear, what I want to do is chain an AvroSource to a SolrSink.


collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind =
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1

collector.channels = mc1
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.sinks = SolrOut

collector.sinks.SolrOut.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink = mc1
collector.sinks.SolrOut.batchSize = 1
collector.sinks.SolrOut.batchDurationMillis = 1000
collector.sinks.SolrOut.morphlineFile = morphlines.conf
collector.sinks.SolrOut.morphlineId = morphline1

 First thing the morphline does is readAvroContainer and it fails because of the "Not a data file"-thingy.


These events can have multiline bodies that are already aggregated into single events at the source. 

readLine works on the event, but this splits up the events based on new lines in the body.

readMultiline is pretty hard to do, because the regexp can be different from event to event (based on the source of the event).

I'm using a tiered approach just because I don't want to bother the backend tier when there are new sources that need new regexps.


I need to be able to just process the body as a whole independent of the fact that it is single- or multiline and I thought readAvroContainer would do the job.



Expert Contributor
Posts: 139
Registered: ‎07-21-2014

Re: Morphline: IOException Not a data file

I'm running into similar error and wanted to know how you got it resolved. Please let me know, thanks!

New Contributor
Posts: 3
Registered: ‎11-04-2016

Re: Morphline: IOException Not a data file

Do we have any answer for this?