Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Flume morphline sink to HDFS

Flume morphline sink to HDFS

Explorer

Guys,

 

I see there's already a Flume morphline solr sink but how can I use a morphline with Flume to write to HDFS without Solr. I'm currently using Flume to partition avro data (AvroFlumeEventSerializer) but I'd like to use the ExtractAvroPath morphline to flatten the complex types so IMPALA can query them. Is this possible?

 

Thanks

 

Andrew

11 REPLIES 11

Re: Flume morphline sink to HDFS

Expert Contributor
You can use a Flume MorphlineInterceptor. Alternatively, you can continue to use Flume Morphline Solr Sink and write a custom command, say writeHdfs, that writes the data into HDFS rather than Solr.

Re: Flume morphline sink to HDFS

Explorer
I thought the Flume MorphlineInterceptor couldn't generator more records than events. In my case I have complex types, the array fields in particular can be large. One event can generate thousands of flattened records. So I don't think I could use this? I would be interested in writing a custom command but for those of use who are JAVA challenged, how would I do this? I would have thought a writeHDFS would have been stock component? Thanks Andrew

Re: Flume morphline sink to HDFS

Expert Contributor
Flume does not permit more than one output event per input event in an interceptor. It's a general limitation of any flume interceptor. A flume sink does not have this limitation, and so the flume morphline sink can emit more than one output event per input event.

Custom commands can be implemented in java.

Wolfgang

Re: Flume morphline sink to HDFS

Explorer

Ok, I see, I can embed JAVA in the config file to open a file and write the records. But will I not be opening lots of files? How can I batch it and roll the file like the HDFS Flume sink?

Re: Flume morphline sink to HDFS

Expert Contributor
For this you'd need to write a custom morphline command in Java. Regarding batching you could buffer up records and flush to the file after every N-th record, and on receiving a commit via the doNotify() method. Somewhat similar to what the existing loadSolr command is doing.

Re: Flume morphline sink to HDFS

Explorer

Thanks, I'll have a look.

 

I do however think you missed a trick by not having this as a stock command. Some of my colleagues dismissed morphines as only applicable to Solr. They wanted to do the same as me (have Flume flatten Avro when ingesting the data) but by-passed them and have PIG scripts running instead.

 

HDFS is the backbone of Hadoop so standard way for morphlines to write to it outside Solr, would have been great!

Re: Flume morphline sink to HDFS

Expert Contributor
The reason why something like writeToHDFS isn't a stock command yet is because it might be even better to enhance flume a little, such that a Flume sink (here: the Morphline Sink) can send output to another Flume sink (via a Flume channel). This way you could plug the Morphline Sink between the Flume Source and the existing Flume HDFS Sink. At least that's one line of thought.

Wolfgang.

Re: Flume morphline sink to HDFS

Explorer

I could in the meantime write a custom morphline command to post the flattened avro onto another Flume agent?

 

 

Re: Flume morphline sink to HDFS

Expert Contributor
Sure. For example, a corresponding command you could use an EmbeddedAgent to do so (http://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent)

Don't have an account?
Coming from Hortonworks? Activate your account here