Support Questions

Brenigan · ‎05-06-2022

Hi! So I am very confused about how regular expressions and groups work in nifi.

I read documentation and I saw that ExtractText processors always exctracts more attributes than needed somehow.

So I have this file with the line like this

9999, text

And I wrote regular expression to extract value 9999 for attribute call number. (\d{4})

But instead of one attribute number I am getting number0, number and number1 attributes.

Can someone please explain me why is this happening, because documentation explanation is quite complex really.

Thank you beforehand!

MattWho · ‎05-06-2022

@Brenigan

The ExtractText processor will support 1 to 40 capture groups in a Java regular expression.
The user added property defines the attribute in to which the value from capture group one will be placed.

The processor creates additional attribute by capture group number.
so in your case you added a new property with:

This is a single capture group which reads 4 digits.
So in you example (9999, text) this would result in creating attributes:
number = 9999 <-- alway contains value from capture group 1.
number.1 = 9999 <-- the ".1" signifies the capture group the value came from.

number.0 contains the entire matching java regular expression. This attribute is controlled by this property:

Setting to false will stop this one from being added to your FlowFiles.

To help understand this better, let's look at another example:
Suppose your java regular expression looked like this with 2 capture groups instead:

Also assume we had "Include Capture Group 0" set to "true"

Now with same source text of "9999, text", we would expect to see these attributes added:
number = 9999 <-- alway contains value from capture group 1.
number.0 = 9999, text <-- The complete match from the java regular expression.

number.1 = 9999 <-- The ".1" signifies the capture group the value came from
number.2 = text <-- the ".2" signifies the capture group the value came from.

Setting "false" for "Include Capture Group 0" would have resulted in "number.0" not being created; however, number, number.1, and number.2 would have still been created.

This functionality allows this processor component to handle multiple use cases.

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

Cloudera Community

Support Questions

I am getting 3 attributes instead of one, using ExtractText Processor.

Regex doesn't work on ExtractText Processor?

Getting started with Nifi expression language and ...

How to send a attribute to downstream processor an...

Automate HDP installation using Ambari Blueprints ...

Using Apache NiFi 0.7.0's New PutSlack Processor

Getting Started on GCP with Cloudbreak

Using put Email processor in NiFI java.lang.NullPo...

Using NiFi GetTwitter, UpdateAttributes and Replac...

Using Ansible to deploy instances on AWS

Sensor Data Analysis using HDP and HDF - Part 3