Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Using ExecuteScript to regex flow file and write result to flowfile

Using ExecuteScript to regex flow file and write result to flowfile

I'm trying to use the ExecuteScript processor to take in a flow file, convert that flow file to a string, regex the string, convert the regexed string back into a flow file, then pass that flow file along the processor flow. I'm wanting to do this with Python.

As an example, an incoming string might have text like "The 8 dogs jumped over the 2 cats !". I want the result to look like "The_8_dogs_jumped_over_the_2_cats_!".

I've already got a regex that works in this case, my trouble lies within converting the incoming flow file into a string that I can then use regex on. The nature of the regex requires the string to be processed multiple times to eliminate all of the whitespaces, so I would like the ExecuteScript processor to be able to pass back into itself multiple times per flowfile.

Another note: I've tried using a ReplaceText processor coupled with a RouteOnContent processor, however this solution is much too slow to handle the quantity of data I'm working with (even when adjusting the concurrent tasks property).

In short I want to convert a flowfile to a string, do some regex substitutions, then convert back to a flowfile.

Don't have an account?
Coming from Hortonworks? Activate your account here