Support Questions

mark_hadoop · ‎01-22-2018

Hi,

I have 1M files flowing per second.

flowfile:

This is a standard [flowfile name="11" size="1KB"........ timestamp="2018-01-21 10:00:00.00000000"][end of file]

extract text property

.*(\[flowfile(.*?")\]).*

replacetext

$2

As I have "[flowfile" as constant in every file and I need only data between "[flowfile" and "][end of file]", I have set extract text as above.

I can see around 1500-2000 files are being processed persecond (concurrent processors=1, if set to 4, 6000-8000 files/sec are processed.

As, incoming files are 1M/s and outgoing are only 2000/s.

There is a huge difference in incoming and out going.

Could you please help me in increasing extract text performance.

mburgess · ‎01-22-2018

Perhaps try ReplaceText first, to match your beginning and end text, and replace them with an empty string. Then if you need the content as an attribute, you can use ExtractText with (.*). Do you definitely need the value in an attribute? If you can keep it in the content after the ReplaceText processor.

View solution in original post

mburgess · ‎01-22-2018

Perhaps try ReplaceText first, to match your beginning and end text, and replace them with an empty string. Then if you need the content as an attribute, you can use ExtractText with (.*). Do you definitely need the value in an attribute? If you can keep it in the content after the ReplaceText processor.

mark_hadoop · ‎01-23-2018

@Matt Burgess

Thanks for the suggestion.

I can use replace text, but if any unmatched files are present(could be) they will also be processed as they are.

In this case how can I get rid off, could you please suggest me.

Also I need the attrribute which is being extracted from extract text processor.

sivaprasanna246 · ‎01-23-2018

As @Matt Burgess said, when you use ReplaceText processor the way he said, your flowfile content would be changed from [flowfile name="11" size="1KB"........ timestamp="2018-01-21 10:00:00.00000000"][end of file] to name="11" size="1KB"........ timestamp="2018-01-21 10:00:00.00000000".

You can then connect the Success relationship from ReplaceText to ExtractText and use (.*) as the regular expression to get the content and assign it to an attribute. You can ignore this step, if you are just trying to extract contents to an attribute and use ReplaceText to write that attribute to the flowfile content since ReplaceText itself does that in the above approach.

mark_hadoop · ‎01-24-2018

@Sivaprasanna

I am not sure, if you are trying to explain Matt's answer in a otherway or could not understand my question.

I will try to put my question this way

replace text processor

search value: (\[flowfile )(.*?\")(\].*)

replace value: $2

case 1: file contains

[flowfile name="11" size="1KB"........ timestamp="2018-01-21 10:00:00.00000000"][end of file]

result will be: name="11" size="1KB"........ timestamp="2018-01-21 10:00:00.00000000"

case 2: If thefile dont match as case 1 and some other error text enters.

"a new file with out the flowfile but something else" -> Lets assume this is the content in the file

result will be : "a new file with out the flowfile but something else" -> which is not expected and these should not be proceeded.

To avoid the above situation I was using extract text.

So, how to avoid the above using replace text.

Can you please suggest, Thanks in advance

mburgess · ‎01-24-2018

So right now it appears you are trying to do validation and extraction at the same time, since you don't want "case 2" to move down the stream. If your new ReplaceText from this comment is more performant than the one from the original question, you can use RouteOnContent first to exclude the files that do not have the required header and footer. Since there will now be two pattern matching processors, you may find that it is less performant, but it's probably worth a try. Another option is ExecuteScript with a fast scripting language like Groovy or Javascript/Nashorn, but the overhead of the interpreted script might be worse than the improvement of looking only for headers/footers rather than a whole regex.

Cloudera Community

Support Questions

how to in improce performance of extract text processor