Created on 10-26-2018 06:43 AM - edited 09-16-2022 06:50 AM
I am scraping a page for multiple links. I use extract Text from the HTML and have the Enable Repeating Capture set to true.
I get the following attributes associated to the flow file
URl.0, URL.1...URL.n
I want to process this individually.
How do I either make these attributes the new flow?
Convert into something that will allow for accessing each one?
The number of links found is not a fixed number.
I did try ExecuteScript but it killed Nifi. Was my first attemp at using the executescript.
var OutputStreamCallback = Java.type("org.apache.nifi.processor.io.OutputStreamCallback"); var IOUtils = Java.type("org.apache.commons.io.IOUtils"); var StandardCharsets = Java.type("java.nio.charset.StandardCharsets"); var flowFiles =[]; var flowFile = session.get() if (flowFile != null) { var myAttr = flowFile.getAttribute('urls').split("|"); if(myAttr.length>0) { for(var i=0;i<myAttr.length;i++) { var newflowFile = session.create(flowFile); newflowFile =session.write(newflowFile , new OutputStreamCallback(function(outputStream) { outputStream.write(myAttr[i].getBytes(StandardCharsets.UTF_8)) })); flowFiles .push(newflowFile ); } session.transfer(flowFiles , REL_SUCCESS); session.remove(flowFile); session.commit(); } }
Created on 10-29-2018 12:55 AM - edited 08-17-2019 04:58 PM
Found the solution. Thanks to a template found on Apache Wiki.
Fetch the HTML and feed to RouteText
This produces 1 file of all matches with each match on one line.
Then use splitText to break into individual flow files for each line.
Created 12-17-2018 03:16 PM
The problem with this is it returns one line for any match o if the html has many a href injected with no line break and and I want to produce a line for each match of a href how do i do it.
eg:
<a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a> RegExpression for a href block. ROUTE TEXT returns just one line <a href="something.html">something</a><a href="anotherthing.html">another</a><a href="yetmore.html">Yet More</a> What I want is (Basically for each match new line) <a href="something.html">something</a> <a href="anotherthing.html">another</a> <a href="yetmore.html">Yet More</a>