Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Morphline: IOException Not a data file

Morphline: IOException Not a data file

New Contributor



I'm having some problems passing an avro_event through to Morphlines.


When I skip the SolrSink in my flume config and just write to file (file-roll-sink) using an avro_event serializer I get a file the complete event in it.


java -jar ~/avro-tools-1.7.4.jar tojson ../flume/1386248426733-1
{"headers":{"timestamp":"1386248331991","id":"e96dc77f-3b07-4b5d-9e2e-7b641936c0f1","hostname":"","log_type":"com_job"},"body":"[2013-11-04 05:51:34,155][Thread-27][ERROR][..."}


When I enable the SolrSink with the most basic morphline configuration:


morphlines : [
    id : morphline1
    importCommands : ["com.cloudera.**", "org.apache.solr.**"]
    commands : [                    
        readAvroContainer {
      { logDebug { format : "output record: {}", args : ["@{}"] } }    

 I get the following stacktrace: 


TRACE com.cloudera.cdk.morphline.avro.ReadAvroContainerBuilder$ReadAvroContainer: beforeProcess: {_attachment_body=[[B@4ea20232], hostname=[], id=[77ae7588-b64a-41af-98e6-006730a28734], log_type=[com_job], timestamp=[1386248421968]}
2013-12-05 05:50:21,176 ERROR org.apache.flume.sink.solr.morphline.MorphlineSink: Morphline Sink SolrOut: Unable to process event from channel mc1. Exception follows.
com.cloudera.cdk.morphline.api.MorphlineRuntimeException: com.cloudera.cdk.morphline.api.MorphlineRuntimeException: Not a data file.
	at com.cloudera.cdk.morphline.base.FaultTolerance.handleException(
	at org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl.process(
	at org.apache.flume.sink.solr.morphline.MorphlineSink.process(
	at org.apache.flume.sink.DefaultSinkProcessor.process(
	at org.apache.flume.SinkRunner$
Caused by: com.cloudera.cdk.morphline.api.MorphlineRuntimeException: Not a data file.
	at com.cloudera.cdk.morphline.stdio.AbstractParser.doProcess(
	at com.cloudera.cdk.morphline.base.AbstractCommand.process(
	at com.cloudera.cdk.morphline.base.AbstractCommand.doProcess(
	at com.cloudera.cdk.morphline.base.AbstractCommand.process(
	at org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl.process(
	... 4 more
Caused by: Not a data file.
	at org.apache.avro.file.DataFileStream.initialize(
	at org.apache.avro.file.DataFileReader.<init>(
	at com.cloudera.cdk.morphline.avro.ReadAvroContainerBuilder$ReadAvroContainer.doProcess(
	at com.cloudera.cdk.morphline.stdio.AbstractParser.doProcess(
	... 8 more

 Can somebody explain where this is coming from?


Thank you!





Re: Morphline: IOException Not a data file

Expert Contributor
This means your the morphline sink does not receive valid Avro data.

Re: Morphline: IOException Not a data file

New Contributor

That's indeed what the exception says.

The real question then is: why is that event not valid?


When I just skip Morphline and write the avro event to file then it looks OK.


I can extra the schema with the avro-tools jar, I can extract JSON as shown in my example.


When I read it with readAvroContainer, then I get the exception.


Btw, the event is pulled in from an AvroSource. if the avro data would be invalid I would expect that the source would complain and throw an exception, no?




Re: Morphline: IOException Not a data file

New Contributor

To be clear, what I want to do is chain an AvroSource to a SolrSink.


collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind =
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1

collector.channels = mc1
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.sinks = SolrOut

collector.sinks.SolrOut.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink = mc1
collector.sinks.SolrOut.batchSize = 1
collector.sinks.SolrOut.batchDurationMillis = 1000
collector.sinks.SolrOut.morphlineFile = morphlines.conf
collector.sinks.SolrOut.morphlineId = morphline1

 First thing the morphline does is readAvroContainer and it fails because of the "Not a data file"-thingy.


These events can have multiline bodies that are already aggregated into single events at the source. 

readLine works on the event, but this splits up the events based on new lines in the body.

readMultiline is pretty hard to do, because the regexp can be different from event to event (based on the source of the event).

I'm using a tiered approach just because I don't want to bother the backend tier when there are new sources that need new regexps.


I need to be able to just process the body as a whole independent of the fact that it is single- or multiline and I thought readAvroContainer would do the job.



Re: Morphline: IOException Not a data file

Expert Contributor

I'm running into similar error and wanted to know how you got it resolved. Please let me know, thanks!

Re: Morphline: IOException Not a data file

New Contributor

Do we have any answer for this?