Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ExtractText multiple Groups found

ExtractText multiple Groups found

Explorer

I am scraping a page for multiple links. I use extract Text from the HTML and have the Enable Repeating Capture set to true.

93023-extracttext.png

I get the following attributes associated to the flow file

URl.0, URL.1...URL.n

93022-links.png

I want to process this individually.

How do I either make these attributes the new flow?

Convert into something that will allow for accessing each one?

The number of links found is not a fixed number.

I did try ExecuteScript but it killed Nifi. Was my first attemp at using the executescript.

var OutputStreamCallback =  Java.type("org.apache.nifi.processor.io.OutputStreamCallback");
var IOUtils = Java.type("org.apache.commons.io.IOUtils");
var StandardCharsets = Java.type("java.nio.charset.StandardCharsets");
var flowFiles =[];


var flowFile = session.get() 
if (flowFile != null) {
    var myAttr = flowFile.getAttribute('urls').split("|");
    if(myAttr.length>0)
 {
        for(var i=0;i<myAttr.length;i++)
        {
   
   var newflowFile = session.create(flowFile);
                        newflowFile =session.write(newflowFile ,
        new OutputStreamCallback(function(outputStream) {
            outputStream.write(myAttr[i].getBytes(StandardCharsets.UTF_8))
       }));
   flowFiles .push(newflowFile );
        }
  session.transfer(flowFiles , REL_SUCCESS);
  session.remove(flowFile);
             session.commit();
 }
}


2 REPLIES 2
Highlighted

Re: ExtractText multiple Groups found

Explorer

Found the solution. Thanks to a template found on Apache Wiki.

Fetch the HTML and feed to RouteText

93030-routetext.png

This produces 1 file of all matches with each match on one line.

Then use splitText to break into individual flow files for each line.

Highlighted

Re: ExtractText multiple Groups found

Explorer

The problem with this is it returns one line for any match o if the html has many a href injected with no line break and and I want to produce a line for each match of a href how do i do it.

eg:

<a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a>


RegExpression for a href block.


ROUTE TEXT returns just one line
<a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a>


What I want is (Basically for each match new line)
<a href="something.html">something</a>
<a href="anotherthing.html">another</a>
<a href="yetmore.html">Yet More</a>

Don't have an account?
Coming from Hortonworks? Activate your account here