Is it possible to pass a field value across successive commands in the same morphline, using the morphlines framework?
For example, I use a tryRules structure in which there is a Grok command, followed by a custom command. I need to get the exact value (regex string) of the Grok expression in the custom command. Grok reads that value from the Config object at instantiation, but apparently every command gets a new Config object, so the expression is not carried over from Grok to the custom command that follows.
Can this be done in any other way than copy-pasting or string substitution?
This seems like a limitation imposed by the morphlines framework, which seems to treat every command in a morphline as a separate config file.
Resorting to string substitution probably contradicts the very nature of the morphline framework. I have started doing something similar and it quickly ends up being a way of dynamically generating morphlines from Java. To be sure, it is doable, but there is no reason to run Java in order to compile to an intermediate format (the morphline produced after the text substitution), which would then run the corresponding Java code (the morphline implementation). Every time a modification is needed, both the morphline conf file and the Java code would need to be adapted.
It is surprising that you have not had to add such a feature, given the versatility, the maturity and the richness of the morphlines API.
(Not being grampy, just constructive criticism :-)
Each command sees it’s own config parameters, not the config params of other commands or the whole morphline. This keeps things modular. Multiple commands can share the same parameters via string substitution per HOCON, as already mentioned, e.g the SOLR_LOCATOR is a good example where this feature is frequently used. I can’t see a valid need for one command to directly see config params of another command.
FYI, you can also pass external variables into the morphline framework via the optional "overrides” parameter of the Compiler.compile() API. These variables are statically resolved at compile time - https://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/main/java/org/...
For example, this feature is used in MapReduceIndexerTool to pass morphline variables from CLI to the morphline via the -D option, e.g.: hadoop ... -D morphlineVariable.myGrokExpression=foo
(CrunchIndexerTool and Flume Morphline Solr Sink and the Lily HBase Indexer also expose this feature similarly)
Understandable, but either I am missing something (which I hope :-), or this is not the best way of doing things, as already explained.
String substitution would be a good choice if it was embedded in the framework and transparent to the user, i.e. if one could put placeholder variables, the values of which would be substituted by Kite (or HOCON).
The main use case of the morphlines is to craft an ETL pipeline via just combining existing commands in a HOCON configuration file. It is not very efficient to mix Java and HOCON, in order to do string substitution or other things. Java should ideally be used only for writing the command implementation once and thenafter everything should be kept in HOCON. The way things are done currently seems an unnecessary limitation.
Another unnecessary limitation, reinforcing the need for the parameter passing I mentioned, is that there is no way to dicriminate control data from user data in the Record class, which is final and thus cannot be expanded.
More specifically, If the application needs to carry state beyond the values extracted from the data, this state (object references) is saved in the same map as the extracted values. So, at the end, when the values extracted from the data are to be used, one has to omit all the other (i.e., the control) values in the Record - either by using some namespace and doing string comparisons for the field names, or by keeping a list of the field names extracted by the morphline, or via some other hack (e.g., using a single map as a record field for adding the control values). But always as a hack with unnecessary extra processing.
In the case of morphlines with a Grok expression, one would need to keep a list of the field names that the expression will extract, which means another configuration parameter that has to remain in sync with the Grok expression so that, for example, name chages in the expression are reflected in that configuration parameter. If the expression could be read by the Java code, one could automatically get those fields.
As I said at the start, I wish there was a better way of doing all this. Is there? :-)
String substitution is a first class feature of HOCON - it works well and is the recommended way to handle things. Most morphlines that load data into Solr use HOCON string substitution to pass the same SolrLocator to multiple commands. You could do the same mechanism to pass your grok expression to multiple commands, including your custom morphline command.
You can use commands such as "removeFields" to remove any unwanted state from a record.
Yes, that is what I have ended up doing.
I am of course aware of the removeFields() method, but that still requires knowledge of which fields should be excluded (or which should be included, depending on the approach followed).
One more comment: the compile() method for the morphlines you mentioned in your previous response, only accepts a File (not a Reader or an InputStream), which is too restrictive and one has to manually invoke methods from ConfigFactory and the Compiler class.
This of course is not a morphlines issue, but in case you guys provide feedback to the HOCON team, it would be useful to enrich the API.
Thanks for all the helpful suggestions. :-)