Reply
PNS
Explorer
Posts: 38
Registered: ‎05-20-2014

Using Kite as a standalone event extraction library

Hi...

 

We have a use case for a standalone ETL application, with the following characteristics:

  1. Text or binary sources (including files stored in a local disk, or data arriving over the network), accessible via a Java InputStream
  2. Custom event format (possibly multiline), requiring some sort of detection (non-regex), ideintify and to launch the correct reader
  3. Pull model for reading the data, e.g. a custom event reader, extracting events one-by-one and only after an explicit call to its read() method
  4. Transformation of each event into a custom Java object, which is pushed downstream towards a custom sink

Hadoop and Solr may also be used in the above scenario, but for initial implementation purposes they are not included.

 

Is it possible and would it make sense from a performance perspective, to use morphlines as per the above use case? Any example code would be much appreciated.

 

Thanks.

 

PNS

 

 

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Using Kite as a standalone event extraction library

It?s possible except that morphlines implements a push model rather than a pull model. That is, each output record (aka event) of the custom morphline command (here: parser) is send one-by-one into a callback method. There is no pull style iterator logic.

Wolfgang.

PNS
Explorer
Posts: 38
Registered: ‎05-20-2014

Re: Using Kite as a standalone event extraction library

Thanks Wolfgang. It's nice to hear from the lead developer of the product. :-)

 

From the little I have seen on morphlines, I got the impression that it is a push model, too. This can be changed with a slight hack of having a thread gathering data from the cllaback method in a queue and pretending to be a pull reader on the other side (e.g., as Tika does with its Reader class).

 

I am mostly concerned about the ability of morphlines to allow custom code hookups, to implement custom detection or transformations of event streams (as per steps 2 and 4 in my original post).

 

Also, is there any of demo or example of how one would use the morphlines API in a "lightweight" manner, i.e. without needing to run VMs or Hadoop or any service at all - just use the API for the data extraction?

 

Highlighted
Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Using Kite as a standalone event extraction library

You can write custom morphline commands and plug them in per http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html#Implementing_your_own_...

A good way to study how to use the framework in a standalone mode without hadoop, solr, etc are the unit tests, the little unit test framework and especially https://github.com/kite-sdk/kite-examples/tree/master/kite-examples-morphlines

Wolfgang.

PNS
Explorer
Posts: 38
Registered: ‎05-20-2014

Re: Using Kite as a standalone event extraction library

I had a look at the Kite SDK source code yesterday and spent some time to compile it, as apparently there are missing classes in the latest snapshot from Git. For example, unless there is another dependency to be downloaded, the package org.kitedsk.data.hbase.manager from the data-hbase module (currently on Git), is missing an entire package ("generated").

 

Another example, in addition to the links you pointed to, is the org.kitesdk.examples.data.HelloKite class, among numerous other.

 

The importpant part however is in my original question, which, after the above discussion, could be rephrased as follows: Is it advisable to use morphlines in the "lightweight" mode discussed, or would that be something of a hack, with limited versatility and performance?

 

In other words, did you guys develop morphilines as a standalone tool, in addition to the Hadoop use case?

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Using Kite as a standalone event extraction library

Morphlines can definitely be used as a standalone lightweight tool and it has explicitly been designed to be used for exactly that purpose as well, with the same performance and versatility.

You can use and build morphlines from scratch as described here: https://github.com/kite-sdk/kite/tree/master/kite-morphlines

You can use as many or as few of the kite-morphline-* dependencies. The minimum requirement is kite-morphlines-core which has a deliberately minimalistic dependencies set - dependency tree is here: http://kitesdk.org/docs/current/kite-morphlines/kite-morphlines-core/dependencies.html and for the other (optional) maven modules it?s here: http://kitesdk.org/docs/current/dependencies.html

You can pull in the other optional kite-morphline-* dependencies if you want to - it?s up to you.

Wolfgang.

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.