We have a use case for a standalone ETL application, with the following characteristics:
Hadoop and Solr may also be used in the above scenario, but for initial implementation purposes they are not included.
Is it possible and would it make sense from a performance perspective, to use morphlines as per the above use case? Any example code would be much appreciated.
Thanks Wolfgang. It's nice to hear from the lead developer of the product. :-)
From the little I have seen on morphlines, I got the impression that it is a push model, too. This can be changed with a slight hack of having a thread gathering data from the cllaback method in a queue and pretending to be a pull reader on the other side (e.g., as Tika does with its Reader class).
I am mostly concerned about the ability of morphlines to allow custom code hookups, to implement custom detection or transformations of event streams (as per steps 2 and 4 in my original post).
Also, is there any of demo or example of how one would use the morphlines API in a "lightweight" manner, i.e. without needing to run VMs or Hadoop or any service at all - just use the API for the data extraction?
I had a look at the Kite SDK source code yesterday and spent some time to compile it, as apparently there are missing classes in the latest snapshot from Git. For example, unless there is another dependency to be downloaded, the package org.kitedsk.data.hbase.manager from the data-hbase module (currently on Git), is missing an entire package ("generated").
Another example, in addition to the links you pointed to, is the org.kitesdk.examples.data.HelloKite class, among numerous other.
The importpant part however is in my original question, which, after the above discussion, could be rephrased as follows: Is it advisable to use morphlines in the "lightweight" mode discussed, or would that be something of a hack, with limited versatility and performance?
In other words, did you guys develop morphilines as a standalone tool, in addition to the Hadoop use case?