Reply
Dr
Explorer
Posts: 17
Registered: ‎08-14-2013

Flume morphline sink to HDFS

Guys,

 

I see there's already a Flume morphline solr sink but how can I use a morphline with Flume to write to HDFS without Solr. I'm currently using Flume to partition avro data (AvroFlumeEventSerializer) but I'd like to use the ExtractAvroPath morphline to flatten the complex types so IMPALA can query them. Is this possible?

 

Thanks

 

Andrew

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Flume morphline sink to HDFS

You can use a Flume MorphlineInterceptor. Alternatively, you can continue to use Flume Morphline Solr Sink and write a custom command, say writeHdfs, that writes the data into HDFS rather than Solr.

Dr
Explorer
Posts: 17
Registered: ‎08-14-2013

Re: Flume morphline sink to HDFS

I thought the Flume MorphlineInterceptor couldn't generator more records than events. In my case I have complex types, the array fields in particular can be large. One event can generate thousands of flattened records. So I don't think I could use this? I would be interested in writing a custom command but for those of use who are JAVA challenged, how would I do this? I would have thought a writeHDFS would have been stock component? Thanks Andrew
Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Flume morphline sink to HDFS

Flume does not permit more than one output event per input event in an interceptor. It's a general limitation of any flume interceptor. A flume sink does not have this limitation, and so the flume morphline sink can emit more than one output event per input event.

Custom commands can be implemented in java.

Wolfgang

Dr
Explorer
Posts: 17
Registered: ‎08-14-2013

Re: Flume morphline sink to HDFS

Ok, I see, I can embed JAVA in the config file to open a file and write the records. But will I not be opening lots of files? How can I batch it and roll the file like the HDFS Flume sink?

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Flume morphline sink to HDFS

For this you'd need to write a custom morphline command in Java. Regarding batching you could buffer up records and flush to the file after every N-th record, and on receiving a commit via the doNotify() method. Somewhat similar to what the existing loadSolr command is doing.

Dr
Explorer
Posts: 17
Registered: ‎08-14-2013

Re: Flume morphline sink to HDFS

Thanks, I'll have a look.

 

I do however think you missed a trick by not having this as a stock command. Some of my colleagues dismissed morphines as only applicable to Solr. They wanted to do the same as me (have Flume flatten Avro when ingesting the data) but by-passed them and have PIG scripts running instead.

 

HDFS is the backbone of Hadoop so standard way for morphlines to write to it outside Solr, would have been great!

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Flume morphline sink to HDFS

The reason why something like writeToHDFS isn't a stock command yet is because it might be even better to enhance flume a little, such that a Flume sink (here: the Morphline Sink) can send output to another Flume sink (via a Flume channel). This way you could plug the Morphline Sink between the Flume Source and the existing Flume HDFS Sink. At least that's one line of thought.

Wolfgang.

Dr
Explorer
Posts: 17
Registered: ‎08-14-2013

Re: Flume morphline sink to HDFS

I could in the meantime write a custom morphline command to post the flattened avro onto another Flume agent?

 

 

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Flume morphline sink to HDFS

Sure. For example, a corresponding command you could use an EmbeddedAgent to do so (http://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent)