Created 08-07-2018 05:52 PM
I'm trying to read a PDF from my flowfile into a groovy/java File object, not sure what I'm doing wrong but when I try to use session.get() I get the error `transfer relationship not specified`
import org.apache.pdfbox.io.IOUtils import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.util.PDFTextStripperByArea import java.awt.Rectangle import org.apache.pdfbox.pdmodel.PDPage import com.google.gson.Gson try { //Get flowfile into File() object File file = session.get() PDDocument document = PDDocument.load(file) PDFTextStripperByArea stripper = new PDFTextStripperByArea() } catch (Exception whatever) { print(whatever) } println('it worked')I've also tried using stdin instead of relying on the session object but that just hangs. I let it spin for about 5 minutes trying to read an 85 kB PDF file before I gave up.
import org.apache.pdfbox.io.IOUtils import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.util.PDFTextStripperByArea import java.awt.Rectangle import org.apache.pdfbox.pdmodel.PDPage import com.google.gson.Gson try { //Get flowfile into File() object File file = new File() OutputStream os = new FileOutputStream(file) IOUtils.copy(System.in, os) os.close() PDDocument document = PDDocument.load(file) PDFTextStripperByArea stripper = new PDFTextStripperByArea() } catch (Exception whatever) { print(whatever) } println('it worked')It doesn't really matter which of these works, just as long as I can get the FlowFile into a Java File object so I can continue processing. Any help appreciated I'm extremely frustrated.
Created 08-07-2018 06:01 PM
A FlowFile doesn't ever really exist as a Java File object, instead you access its contents as an InputStream. I believe PDDocument has a load(InputStream) method, so you could do something like:
import org.apache.pdfbox.io.IOUtils import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.util.PDFTextStripperByArea import java.awt.Rectangle import org.apache.pdfbox.pdmodel.PDPage import com.google.gson.Gson def flowFile = session.get() if(!flowFile) return try { def inputStream = session.read(flowFile) PDDocument document = PDDocument.load(inputStream) PDFTextStripperByArea stripper = newPDFTextStripperByArea() // Do your other stuff here, probably writing something out to flow file(s)? inputStream.close() // If you changed the original flow file, transfer it here session.transfer(flowFile, REL_SUCCESS) } catch(Exception whatever) { print(whatever) // Something went wrong, send the original flow file to failure session.transfer(flowFile, REL_FAILURE) } println('it worked')
If you're going to be replacing the contents of the incoming flow file with some extraction from the PDF, then you can do both the read and the write in a "StreamCallback", check out Part 2 of my ExecuteScript Cookbook for ways to read/write flow files.
Created 08-07-2018 06:01 PM
A FlowFile doesn't ever really exist as a Java File object, instead you access its contents as an InputStream. I believe PDDocument has a load(InputStream) method, so you could do something like:
import org.apache.pdfbox.io.IOUtils import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.util.PDFTextStripperByArea import java.awt.Rectangle import org.apache.pdfbox.pdmodel.PDPage import com.google.gson.Gson def flowFile = session.get() if(!flowFile) return try { def inputStream = session.read(flowFile) PDDocument document = PDDocument.load(inputStream) PDFTextStripperByArea stripper = newPDFTextStripperByArea() // Do your other stuff here, probably writing something out to flow file(s)? inputStream.close() // If you changed the original flow file, transfer it here session.transfer(flowFile, REL_SUCCESS) } catch(Exception whatever) { print(whatever) // Something went wrong, send the original flow file to failure session.transfer(flowFile, REL_FAILURE) } println('it worked')
If you're going to be replacing the contents of the incoming flow file with some extraction from the PDF, then you can do both the read and the write in a "StreamCallback", check out Part 2 of my ExecuteScript Cookbook for ways to read/write flow files.
Created 08-07-2018 06:35 PM
Hey @Matt Burgess thanks for such a quick response. I've been messing with your code for a little bit but can't get past the error `MissingPropertyException: No such property: flowFile for class: Script58238`. It still doesn't seem to be getting the flowFile correctly.
Created 08-07-2018 06:48 PM
Oops I put the def flowFile inside the try, I have since edited the answer to (hopefully!) be correct
Created 08-08-2018 05:13 PM
Hey @Matt Burgess that worked, thanks!
I'm trying to scale up now and when I try adding that code to a class and calling it from main() I get errors about static keyword and context. I've tried running it from a run() method and then calling that from main, moving the flowFile declaration outside of main but I'm just not understanding.
Sorry to be such a bother I just can't find this in the documentation or examples of doing this from a class.
import org.apache.pdfbox.io.IOUtils import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.util.PDFTextStripperByArea import java.awt.Rectangle import org.apache.pdfbox.pdmodel.PDPage import com.google.gson.Gson class nocr { static void main(String args) { def flowFile = session.get() if (!flowFile) return try { def inputStream = session.read(flowFile) PDDocument document = PDDocument.load(inputStream) PDFTextStripperByArea stripper = newPDFTextStripperByArea() // Do your other stuff here, probably writing something out to flow file(s)? inputStream.close() // If you changed the original flow file, transfer it here session.transfer(flowFile, REL_SUCCESS) } catch ( Exception whatever ) { print(whatever) // Something went wrong, send the original flow file to failure session.transfer(flowFile, REL_FAILURE) } println('it worked') } }
Created 08-08-2018 06:13 PM
The script code runs basically as the body of an onTrigger() method, which is a Processor's method that gets called when ExecuteScript is triggered to execute. You don't need that code in a class, but if you want to call main() then you have to do it outside the class but in the same script. If you're trying to be able to pass in arguments, then instead of using a class with a main() method, the user would specify them in the ExecuteScript configuration dialog as user-defined properties, and they are available to the script by their name. They get bound to the script as variables with PropertyValue objects, so to get their values you'll need to call getValue() on them (there are examples in the cookbook).
If you want a full-fledged implementation of a Processor, you can use InvokeScriptedProcessor, that expects an implementation of the Processor interface, but still requires a line outside the class to store away a variable containing an instance of your Processor, then InvokeScriptedProcessor's methods get delegated to your Processor implementation. One of the advantages there is that you can add concrete properties to the InvokeScriptedProcessor dialog (via the getSupportedPropertyDescriptors() method) rather than passing in user-defined properties as variables to the script. I have some examples on my blog on how to use InvokeScriptedProcessor.
Created 08-08-2018 08:15 PM
Thanks for everything @Matt Burgess I was able to get this going by learning making my code more Groovy and cutting the need for classes and the main() method out of my implementation.