Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Extracting Multiple Lines from FlowFile

Given a flowfile of text, I'm trying to extract certain lines from it using regex and pass ONLY those lines onto the next processors.

For example, if I have the text:

cat dog fish dog bird

I only want to extract the lines with "dog", then put those lines into their own flowfile that would look like:

dog dog

I understand that using an ExtractText processor is probably my best bet in this case, but I'm not sure of how this processor is used correctly. I've read multiple questions where the answer describes how to grab specific text and store them into attributes, but the answers always stop there. I need to know how I can then use those attributes and pass them to another processor, and how to make that processor only use those attributes that ExtractText caught.

TL;DR: How do I use ExtractText to grab specific lines, then pass those lines to another processor, and then how do I use that processor to spit out the previously grabbed lines?

2 REPLIES 2

Super Guru

@Nick Stantzos

After Extract Text processor use ReplaceText processor to spit out previously grabbed lines.

Flow:

ExtractText //extract the matching lines and keep them as attributes to the flowfile
ReplaceText //replace the contents of flowfile with attribute values & use replacement strategy as always replace

(or)

By using ReplaceText processor you can keep search value to match your required lines from the flowfile content then in Replacement value keep the capture group $1..etc so that processor will only output the content of flowfile that matched in the replacement value.

Flow:

ReplaceText //search value as matching regex to extract required lines and Replacement value as capture group and Replacement strategy as regex replace.

some references regards to these approaches as follows link1, link2, link3

@Shu

In regards to your second option, using a ReplaceText processor:
I can successfully use regex to match with existing lines, but I can't seem to extract only the text that I matched with. When using a replacement value of $1, I still end up with the lines that I did not want to save. For example, if my text is:

cat
dog
fish
dog
bird

I can successfully capture the "dog" lines, but when my Replacement Value is set to $1, the result is still:

cat
dog
fish
dog
bird

I'm not sure how I can use ReplaceText to extract only the lines that I match with.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.