Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Matching an input field with multiple regular expressions

Matching an input field with multiple regular expressions

Explorer

Hi...

 

Is there any "compact" way of defining and trying multiple regular expressions on the same field, until one of them matches? The "grok" command requires different field names for different expressions (e.g.,"message1", "message2" and "message3" for 3 different regular expressions, instead of using "message" and trying that field until one of the 3 matches, or all are tried).

 

Of course, one can use tryRules with a different grok command per expression, e.g.:

 

          {
            tryRules {
              rules: [
                {
                  commands: [
                    {
                      grok {
                        dictionaryResources: [...]
                        expressions: {
                          message: """<expression1>"""
                        }
                      }
                    }
                  ]
                }
                {
                  commands: [
                    {
                      grok {
                        dictionaryResources: [conf/etl/grok-dictionaries/patterns]
                        expressions: {
                          message: """<expression2>"""
                        }
                      }
                    }
                  ]
                }

...
{ commands: [ { grok { dictionaryResources: [...] expressions: { message: """<expressionN>""" } } } ] } { commands : [ # No expression matched { dropRecord {} } ] } ] } }

 

The above works, but it has around 5 times more lines than necessary.

 

Any ideas for a more compact notation, with existing morphline commands?

 

Thanks.

 

 

3 REPLIES 3

Re: Matching an input field with multiple regular expressions

Explorer

Happy New Year!

 

Any ideas about this question?

Re: Matching an input field with multiple regular expressions

Cloudera Employee

I have a similar request.

 

I'm trying to pull many fields from a single "record".  The fields

can be in any order.

 

The format is like this (Chess PGN)

 

[Event:"Some Event"]

[White:"Player White"]

[Black:"Player Black"]

[Date:"12-12-2012"]

 

These "tags" can be in (really) any order, and are multi-line, and can span multiple lines.

 

I would like to define a "grok" morphline like this:

 

        grok {
        
          dictionaryFiles : [src/test/resources/grok-dictionaries]
          expressions : {
            # Desired goal:  PGN format does not specify the *order* of the fields, so, I want to have
            # as many matches as possible.  How do I do this?
            message : """\[White %{QUOTEDSTRING:white}\]"""
            message : """\[Black %{QUOTEDSTRING:black}\]"""
            message : """\[Event %{QUOTEDSTRING:event}\]"""

          }

 

However, it seems that the Morphline *only* hits one of the message: fields, and if it's successful,

it gets out of grok.  I would like be able to avoid having some huge regex with all the possible tags.  I would

like to be able to express that the grok command would match any/all of the above "message" expressions.

 

I have a unit test coded [1] that shows this exact "problem", where if we want to

pull data from the same field, we either need to

 

1) Write a regex that handles a bunch of UN-ordered extractions (I don't know how to do this yet) or else

 

2) Have an overwhelming bunch of "grok" commands like the original poster has.

 

Thanks,

--Nate

 

[1] It's a clone of kite-examples here:  https://github.com/NathanNeff/kite-examples/commit/0bc281d2209b45125cb8006a8933236cf0e4534f

 

Highlighted

Re: Matching an input field with multiple regular expressions

Expert Contributor
I'd recommend using four separate grok commands, one that extracts the data for the "Event" irrespective of position and ignores everything else, one for "White", one for "Black", one for "Date".


Don't have an account?
Coming from Hortonworks? Activate your account here