Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Extracting Multiple Lines from FlowFile

Extracting Multiple Lines from FlowFile

New Contributor

Given a flowfile of text, I'm trying to extract certain lines from it using regex and pass ONLY those lines onto the next processors.

For example, if I have the text:

cat dog fish dog bird

I only want to extract the lines with "dog", then put those lines into their own flowfile that would look like:

dog dog

I understand that using an ExtractText processor is probably my best bet in this case, but I'm not sure of how this processor is used correctly. I've read multiple questions where the answer describes how to grab specific text and store them into attributes, but the answers always stop there. I need to know how I can then use those attributes and pass them to another processor, and how to make that processor only use those attributes that ExtractText caught.

TL;DR: How do I use ExtractText to grab specific lines, then pass those lines to another processor, and then how do I use that processor to spit out the previously grabbed lines?

2 REPLIES 2

Re: Extracting Multiple Lines from FlowFile

Super Guru

@Nick Stantzos

After Extract Text processor use ReplaceText processor to spit out previously grabbed lines.

Flow:

ExtractText //extract the matching lines and keep them as attributes to the flowfile
ReplaceText //replace the contents of flowfile with attribute values & use replacement strategy as always replace

(or)

By using ReplaceText processor you can keep search value to match your required lines from the flowfile content then in Replacement value keep the capture group $1..etc so that processor will only output the content of flowfile that matched in the replacement value.

Flow:

ReplaceText //search value as matching regex to extract required lines and Replacement value as capture group and Replacement strategy as regex replace.

some references regards to these approaches as follows link1, link2, link3

Re: Extracting Multiple Lines from FlowFile

New Contributor

@Shu

In regards to your second option, using a ReplaceText processor:
I can successfully use regex to match with existing lines, but I can't seem to extract only the text that I matched with. When using a replacement value of $1, I still end up with the lines that I did not want to save. For example, if my text is:

cat
dog
fish
dog
bird

I can successfully capture the "dog" lines, but when my Replacement Value is set to $1, the result is still:

cat
dog
fish
dog
bird

I'm not sure how I can use ReplaceText to extract only the lines that I match with.