Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Using Kite as a standalone event extraction library

Using Kite as a standalone event extraction library

Explorer

Hi...

 

We have a use case for a standalone ETL application, with the following characteristics:

  1. Text or binary sources (including files stored in a local disk, or data arriving over the network), accessible via a Java InputStream
  2. Custom event format (possibly multiline), requiring some sort of detection (non-regex), ideintify and to launch the correct reader
  3. Pull model for reading the data, e.g. a custom event reader, extracting events one-by-one and only after an explicit call to its read() method
  4. Transformation of each event into a custom Java object, which is pushed downstream towards a custom sink

Hadoop and Solr may also be used in the above scenario, but for initial implementation purposes they are not included.

 

Is it possible and would it make sense from a performance perspective, to use morphlines as per the above use case? Any example code would be much appreciated.

 

Thanks.

 

PNS

 

 

5 REPLIES 5

Re: Using Kite as a standalone event extraction library

Expert Contributor
It?s possible except that morphlines implements a push model rather than a pull model. That is, each output record (aka event) of the custom morphline command (here: parser) is send one-by-one into a callback method. There is no pull style iterator logic.

Wolfgang.

Highlighted

Re: Using Kite as a standalone event extraction library

Explorer

Thanks Wolfgang. It's nice to hear from the lead developer of the product. :-)

 

From the little I have seen on morphlines, I got the impression that it is a push model, too. This can be changed with a slight hack of having a thread gathering data from the cllaback method in a queue and pretending to be a pull reader on the other side (e.g., as Tika does with its Reader class).

 

I am mostly concerned about the ability of morphlines to allow custom code hookups, to implement custom detection or transformations of event streams (as per steps 2 and 4 in my original post).

 

Also, is there any of demo or example of how one would use the morphlines API in a "lightweight" manner, i.e. without needing to run VMs or Hadoop or any service at all - just use the API for the data extraction?

 

Re: Using Kite as a standalone event extraction library

Expert Contributor
You can write custom morphline commands and plug them in per http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html#Implementing_your_own_...

A good way to study how to use the framework in a standalone mode without hadoop, solr, etc are the unit tests, the little unit test framework and especially https://github.com/kite-sdk/kite-examples/tree/master/kite-examples-morphlines

Wolfgang.

Re: Using Kite as a standalone event extraction library

Explorer

I had a look at the Kite SDK source code yesterday and spent some time to compile it, as apparently there are missing classes in the latest snapshot from Git. For example, unless there is another dependency to be downloaded, the package org.kitedsk.data.hbase.manager from the data-hbase module (currently on Git), is missing an entire package ("generated").

 

Another example, in addition to the links you pointed to, is the org.kitesdk.examples.data.HelloKite class, among numerous other.

 

The importpant part however is in my original question, which, after the above discussion, could be rephrased as follows: Is it advisable to use morphlines in the "lightweight" mode discussed, or would that be something of a hack, with limited versatility and performance?

 

In other words, did you guys develop morphilines as a standalone tool, in addition to the Hadoop use case?

Re: Using Kite as a standalone event extraction library

Expert Contributor
Morphlines can definitely be used as a standalone lightweight tool and it has explicitly been designed to be used for exactly that purpose as well, with the same performance and versatility.

You can use and build morphlines from scratch as described here: https://github.com/kite-sdk/kite/tree/master/kite-morphlines

You can use as many or as few of the kite-morphline-* dependencies. The minimum requirement is kite-morphlines-core which has a deliberately minimalistic dependencies set - dependency tree is here: http://kitesdk.org/docs/current/kite-morphlines/kite-morphlines-core/dependencies.html and for the other (optional) maven modules it?s here: http://kitesdk.org/docs/current/dependencies.html

You can pull in the other optional kite-morphline-* dependencies if you want to - it?s up to you.

Wolfgang.