Created on 04-03-2019 04:42 PM - edited 08-17-2019 04:17 PM
My goal is to take in a large PDF (34MB), split it into smaller, 2-page PDFs, and send each of the new 2-page PDFs out as a new flowfile for further processing. I've provided the code I have so far below, most of which came from the ExecuteScript Cookbook series. It looks as though the upstream flowfile is not even entering the processor (photo attached). Any help would be greatly appreciated!
import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.multipdf.Splitter import org.apache.commons.StandardCharsets flowFile = session.get() if(!flowFile) return def pdf = "" session.read(flowFile, {inputStream -> pdf = IOUtils.toString(inputStream, StandardCharsets.UTF_8)} as InputStreamCallback) def document = PDDocument.load(pdf) def splitter = new Splitter() splitter.setSplitAtPage(2) def forms = splitter.split(document) //Iterator<PDDocument> iterator = forms.listIterator(); for(form in forms) { form.close() newFlowFile = session.create() newFlowFile = session.write(newFlowFile, {outputStream -> outputStream.write(form.getBytes(StandardCharsets.UTF_8)) as OutputStreamCallback }) session.transfer(newFlowFile, REL_SUCCESS) }
Created 04-03-2019 06:27 PM
I believe the flow file is entering the processor and just taking a very long time to process. In the meantime it will show up in the connection on the UI (although if you try to remove it while it's being processed, you will get a message that zero flow files were removed). The indicator that the flow file is being processed is the grid of light/dark dots on the right side of the processor. While that is shown, the processor is executing, ostensibly on one or more flow files from the incoming queue.
For your script, I think the reason for the long processing (which I would think would be followed by errors on the processor and in the log?) is because you're reading the entire file into a String, then calling PDDocument.load() on the String, when there is no method for that (you need byte[] or InputStream). The very unfortunate part here is that Groovy will try to print out the value of your String, and for some unknown reason when you call toString() on a PDDocument, it gives the entire content, which for large PDFs you can imagine is quite cumbersome.
Luckily you can skip the representation as a String altogether, since the ProcessSession API gives you an InputStream and/or OutputStream, which you can use for load() and save() methods on a PDDocument.
I took the liberty of refactoring your script above, mine's not super sophisticated (especially in terms of error handling) but should give you the gist of the approach:
import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.multipdf.Splitter flowFile = session.get() if(!flowFile) return def flowFiles = [] as List<FlowFile> try { def document session.read(flowFile, {inputStream -> document = PDDocument.load(inputStream) } as InputStreamCallback) def splitter = new Splitter() splitter.setSplitAtPage(2) try { def forms = splitter.split(document) forms.each { form -> newFlowFile = session.write(session.create(flowFile), {outputStream -> form.save(outputStream) } as OutputStreamCallback) flowFiles << newFlowFile form.close() } } catch(e) { log.error('Error writing splits', e) throw e } finally { document?.close() } session.transfer(flowFiles, REL_SUCCESS) } catch(Exception e) { log.error('Error processing incoming PDF', e) session.remove(flowFiles) } session.remove(flowFile)
Created 04-03-2019 06:27 PM
I believe the flow file is entering the processor and just taking a very long time to process. In the meantime it will show up in the connection on the UI (although if you try to remove it while it's being processed, you will get a message that zero flow files were removed). The indicator that the flow file is being processed is the grid of light/dark dots on the right side of the processor. While that is shown, the processor is executing, ostensibly on one or more flow files from the incoming queue.
For your script, I think the reason for the long processing (which I would think would be followed by errors on the processor and in the log?) is because you're reading the entire file into a String, then calling PDDocument.load() on the String, when there is no method for that (you need byte[] or InputStream). The very unfortunate part here is that Groovy will try to print out the value of your String, and for some unknown reason when you call toString() on a PDDocument, it gives the entire content, which for large PDFs you can imagine is quite cumbersome.
Luckily you can skip the representation as a String altogether, since the ProcessSession API gives you an InputStream and/or OutputStream, which you can use for load() and save() methods on a PDDocument.
I took the liberty of refactoring your script above, mine's not super sophisticated (especially in terms of error handling) but should give you the gist of the approach:
import org.apache.pdfbox.pdmodel.PDDocument import org.apache.pdfbox.multipdf.Splitter flowFile = session.get() if(!flowFile) return def flowFiles = [] as List<FlowFile> try { def document session.read(flowFile, {inputStream -> document = PDDocument.load(inputStream) } as InputStreamCallback) def splitter = new Splitter() splitter.setSplitAtPage(2) try { def forms = splitter.split(document) forms.each { form -> newFlowFile = session.write(session.create(flowFile), {outputStream -> form.save(outputStream) } as OutputStreamCallback) flowFiles << newFlowFile form.close() } } catch(e) { log.error('Error writing splits', e) throw e } finally { document?.close() } session.transfer(flowFiles, REL_SUCCESS) } catch(Exception e) { log.error('Error processing incoming PDF', e) session.remove(flowFiles) } session.remove(flowFile)
Created 04-09-2019 11:32 PM
Hi Matt. Works like a charm, thanks for your help!