Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

ExtractText multiple Groups found

Explorer

I am scraping a page for multiple links. I use extract Text from the HTML and have the Enable Repeating Capture set to true.

93023-extracttext.png

I get the following attributes associated to the flow file

URl.0, URL.1...URL.n

93022-links.png

I want to process this individually.

How do I either make these attributes the new flow?

Convert into something that will allow for accessing each one?

The number of links found is not a fixed number.

I did try ExecuteScript but it killed Nifi. Was my first attemp at using the executescript.

var OutputStreamCallback =  Java.type("org.apache.nifi.processor.io.OutputStreamCallback");
var IOUtils = Java.type("org.apache.commons.io.IOUtils");
var StandardCharsets = Java.type("java.nio.charset.StandardCharsets");
var flowFiles =[];


var flowFile = session.get() 
if (flowFile != null) {
    var myAttr = flowFile.getAttribute('urls').split("|");
    if(myAttr.length>0)
 {
        for(var i=0;i<myAttr.length;i++)
        {
   
   var newflowFile = session.create(flowFile);
                        newflowFile =session.write(newflowFile ,
        new OutputStreamCallback(function(outputStream) {
            outputStream.write(myAttr[i].getBytes(StandardCharsets.UTF_8))
       }));
   flowFiles .push(newflowFile );
        }
  session.transfer(flowFiles , REL_SUCCESS);
  session.remove(flowFile);
             session.commit();
 }
}


2 REPLIES 2

Explorer

Found the solution. Thanks to a template found on Apache Wiki.

Fetch the HTML and feed to RouteText

93030-routetext.png

This produces 1 file of all matches with each match on one line.

Then use splitText to break into individual flow files for each line.

Explorer

The problem with this is it returns one line for any match o if the html has many a href injected with no line break and and I want to produce a line for each match of a href how do i do it.

eg:

<a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a>


RegExpression for a href block.


ROUTE TEXT returns just one line
<a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a>


What I want is (Basically for each match new line)
<a href="something.html">something</a>
<a href="anotherthing.html">another</a>
<a href="yetmore.html">Yet More</a>

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.