Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ExtractText multiple Groups found

Highlighted

ExtractText multiple Groups found

New Contributor

I am scraping a page for multiple links. I use extract Text from the HTML and have the Enable Repeating Capture set to true.

93023-extracttext.png

I get the following attributes associated to the flow file

URl.0, URL.1...URL.n

93022-links.png

I want to process this individually.

How do I either make these attributes the new flow?

Convert into something that will allow for accessing each one?

The number of links found is not a fixed number.

I did try ExecuteScript but it killed Nifi. Was my first attemp at using the executescript.

var OutputStreamCallback =  Java.type("org.apache.nifi.processor.io.OutputStreamCallback");
var IOUtils = Java.type("org.apache.commons.io.IOUtils");
var StandardCharsets = Java.type("java.nio.charset.StandardCharsets");
var flowFiles =[];


var flowFile = session.get() 
if (flowFile != null) {
    var myAttr = flowFile.getAttribute('urls').split("|");
    if(myAttr.length>0)
 {
        for(var i=0;i<myAttr.length;i++)
        {
   
   var newflowFile = session.create(flowFile);
                        newflowFile =session.write(newflowFile ,
        new OutputStreamCallback(function(outputStream) {
            outputStream.write(myAttr[i].getBytes(StandardCharsets.UTF_8))
       }));
   flowFiles .push(newflowFile );
        }
  session.transfer(flowFiles , REL_SUCCESS);
  session.remove(flowFile);
             session.commit();
 }
}


2 REPLIES 2

Re: ExtractText multiple Groups found

New Contributor

Found the solution. Thanks to a template found on Apache Wiki.

Fetch the HTML and feed to RouteText

93030-routetext.png

This produces 1 file of all matches with each match on one line.

Then use splitText to break into individual flow files for each line.

Re: ExtractText multiple Groups found

New Contributor

The problem with this is it returns one line for any match o if the html has many a href injected with no line break and and I want to produce a line for each match of a href how do i do it.

eg:

<a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a>


RegExpression for a href block.


ROUTE TEXT returns just one line
<a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a>


What I want is (Basically for each match new line)
<a href="something.html">something</a>
<a href="anotherthing.html">another</a>
<a href="yetmore.html">Yet More</a>

Don't have an account?
Coming from Hortonworks? Activate your account here