Support Questions

Find answers, ask questions, and share your expertise

I am getting 3 attributes instead of one, using ExtractText Processor.

avatar
Contributor

Hi! So I am very confused about how regular expressions and groups work in nifi.

I read documentation and I saw that ExtractText processors always exctracts more attributes than needed somehow.

So I have this file with the line like this

9999, text

 

And I wrote regular expression to extract  value 9999 for attribute call number.  (\d{4})

But instead of one attribute number I am getting number0, number and number1 attributes.

 

Can someone please explain me why is this happening, because documentation explanation is quite complex really. 

 

Thank you beforehand!

1 REPLY 1

avatar
Master Mentor

@Brenigan 

The ExtractText processor will support 1 to 40 capture groups in a Java regular expression.
The user added property defines the attribute in to which the value from capture group one will be placed.

The processor creates additional attribute by capture group number.
so in your case you added a new property with:

MattWho_0-1651860431777.png

 

This is a single capture group which reads 4 digits.
So in you example (9999, text) this would result in creating attributes:
number = 9999 <-- alway contains value from capture group 1.
number.1 = 9999  <-- the ".1" signifies the capture group the value came from.

number.0 contains the entire matching java regular expression.  This attribute is controlled by this property:

MattWho_1-1651860653088.png

Setting to false will stop this one from being added to your FlowFiles.

To help understand this better, let's look at another example:
Suppose your java regular expression looked like this with 2 capture groups instead:

MattWho_2-1651860803371.png

Also assume we had "Include Capture Group 0" set to "true"

Now with same source text of "9999, text", we would expect to see these attributes added:
number = 9999 <-- alway contains value from capture group 1.
number.0 = 9999, text  <-- The complete match from the java regular expression.

number.1 = 9999 <-- The ".1" signifies the capture group the value came from
number.2 = text  <-- the ".2" signifies the capture group the value came from.

Setting "false" for "Include Capture Group 0" would have resulted in "number.0" not being created; however, number, number.1, and number.2 would have still been created.

This functionality allows this processor component to handle multiple use cases.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt