Member since
05-02-2018
27
Posts
2
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1159 | 08-09-2018 05:09 PM |
10-28-2020
01:21 PM
This issue appears to be related to https://issues.apache.org/jira/browse/NIFI-4417 I also tried using UpdateAttribute to create my regex in an attribute, then use the attribute as the Search Value in ReplaceText, but that appears to have the same issue of NiFi attributes not getting evaluated properly in the Search Value.
... View more
10-28-2020
08:44 AM
I'm trying to use ReplaceText to remove x number of lines from the top of a flowfile based on a flowfile attribute. I'm using the following regex but ReplaceText says its invalid: ^(.*?\n){${skip_lines}} It seems like I should be able to reference a flowfile attribute from regex according to this question but I just get an error. Any idea how I should be doing this? My full config is below:
... View more
Labels:
- Labels:
-
Apache NiFi
07-11-2020
09:11 AM
1 Kudo
I now see that 'Infer Schema' is an option in Record readers, so this processor is no longer needed. Leaving this up so others might find it.
... View more
07-11-2020
07:05 AM
I feel a little silly asking because I can't find anything about this on the internet, but was InferAvroSchema removed from NiFi 1.11? My organization recently upgraded our NiFi version and I noticed it was missing, but figured it was something they had been messing with. However I upgraded my home server's NiFi and I notice its missing from there too. I'm hoping that it was replaced by another processor or something? I really use this a lot.
... View more
Labels:
- Labels:
-
Apache NiFi
11-29-2018
06:57 PM
I figured there had to be a better way to do that, thanks @Matt Burgess! Is there documentation on the different programming language API's that I'm missing? I've been working off of your excellent ExecuteScript cookbooks posted here, but beyond that I couldn't find in the documentation where I could have looked up something like session.remove().
... View more
11-28-2018
05:55 PM
I'm trying to read a JSON from `flowFile` and add the contents as attribute keys in the empty `updated_flowFile`, but I get `transfer relationship not specified` even though I'm specifying it. from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import InputStreamCallback
from org.apache.nifi.processor.io import OutputStreamCallback
import json
data = {}
# Read contents of flowFile and write contents to data{}
class PyInputStreamCallback(InputStreamCallback):
def __init__(self):
pass
def process(self, inputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
global data
data = json.loads(text)
# Get incoming flowFile and call PyInputStreamCallback
flowFile = session.get()
if (flowFile != None):
try:
session.read(flowFile, PyInputStreamCallback())
global data
# Create a blank flowfile, update the attributes with contents of data{} and and write it to session
updated_flowFile = session.create()
updated_flowFile = session.putAttribute(updated_flowFile, 'left', data['left'])
updated_flowFile = session.putAttribute(updated_flowFile, 'top', data['top'])
session.close(flowFile)
session.transfer(updated_flowFile, REL_SUCCESS)
except:
session.close(updated_flowFile)
session.transfer(flowFile, REL_FAILURE)
else:
session.transfer(flowFile, REL_FAILURE) Alternatively, if there's a way to use the same flowFile object and wipe the JSON contents that would work too. I am doing a mergeContent later in my pipeline so I need the contents to be totally empty except for the attributes I'm adding.
... View more
Labels:
- Labels:
-
Apache NiFi
08-09-2018
05:09 PM
This issue was caused by me not using try/catch properly. Since the files weren't visible to the rest of my code outside the try/catch, it was returning the PDF.
... View more
08-08-2018
08:36 PM
I found Matt's cookbooks and I'm following the recipe for overwriting a FlowFile. It seems very simple and straightforward and I'm not sure what I'm missing. My code is supposed to read the PDF in from the FlowFile, use PDFBox to extract first and last name from the form (it's an I9) and then output the results into a JSON which gets sent out in REL_SUCCESS. Instead it just outputs the PDF file to REL_SUCCESS. Not sure if it's never being read which is causing blank output or I'm writing it out wrong or what. import java.nio.charset.StandardCharsets
import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
import java.nio.charset.StandardCharsets
def flowFile = session.get()
flowFile = session.write(flowFile, { inputStream, outputStream ->
try {
//Load Flowfile contents
PDDocument document = PDDocument.load(inputStream)
PDFTextStripperByArea stripper = new PDFTextStripperByArea()
//Get the first page
List<PDPage> allPages = document.getDocumentCatalog().getAllPages()
PDPage page = allPages.get(0)
} catch (Exception e){
System.out.println(e.getMessage())
session.transfer(flowFile, REL_FAILURE)
}
//Define the areas to search and add them as search regions
stripper = new PDFTextStripperByArea()
Rectangle lname = new Rectangle(25, 226, 240, 15)
stripper.addRegion("lname", lname)
Rectangle fname = new Rectangle(276, 226, 240, 15)
stripper.addRegion("fname", fname)
//Load the results into a JSON
def boxMap = [:]
stripper.setSortByPosition(true)
stripper.extractRegions(page)
regions = stripper.getRegions()
for (String region : regions) {
String box = stripper.getTextForRegion(region)
boxMap.put(region, box)
}
Gson gson = new Gson()
//Remove random noise from the output
json = gson.toJson(boxMap, LinkedHashMap.class)
json = json.replace('\\n', '')
json = json.replace('\\r', '')
json = json.replace(',"', ',\n"')
//Overwrite flowfile contents with JSON
outputStream.write(json.getBytes(StandardCharsets.UTF_8))
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS) Help appreciated!
... View more
Labels:
- Labels:
-
Apache NiFi
08-08-2018
08:15 PM
Thanks for everything @Matt Burgess I was able to get this going by learning making my code more Groovy and cutting the need for classes and the main() method out of my implementation.
... View more
08-08-2018
05:13 PM
Hey @Matt Burgess that worked, thanks! I'm trying to scale up now and when I try adding that code to a class and calling it from main() I get errors about static keyword and context. I've tried running it from a run() method and then calling that from main, moving the flowFile declaration outside of main but I'm just not understanding. Sorry to be such a bother I just can't find this in the documentation or examples of doing this from a class. import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
class nocr {
static void main(String args) {
def flowFile = session.get()
if (!flowFile)
return
try {
def inputStream = session.read(flowFile)
PDDocument document = PDDocument.load(inputStream)
PDFTextStripperByArea stripper = newPDFTextStripperByArea()
// Do your other stuff here, probably writing something out to flow file(s)?
inputStream.close()
// If you changed the original flow file, transfer it here
session.transfer(flowFile, REL_SUCCESS)
} catch (
Exception whatever
) {
print(whatever)
// Something went wrong, send the original flow file to failure
session.transfer(flowFile, REL_FAILURE)
}
println('it worked')
}
}
... View more