Support Questions

rstewart · ‎04-03-2019

My goal is to take in a large PDF (34MB), split it into smaller, 2-page PDFs, and send each of the new 2-page PDFs out as a new flowfile for further processing. I've provided the code I have so far below, most of which came from the ExecuteScript Cookbook series. It looks as though the upstream flowfile is not even entering the processor (photo attached). Any help would be greatly appreciated!

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.multipdf.Splitter
import org.apache.commons.StandardCharsets

flowFile = session.get()
if(!flowFile) return

def pdf = ""
session.read(flowFile, {inputStream ->
    pdf = IOUtils.toString(inputStream, StandardCharsets.UTF_8)} as InputStreamCallback)

def document = PDDocument.load(pdf)

def splitter = new Splitter()
splitter.setSplitAtPage(2)

def forms = splitter.split(document)

//Iterator<PDDocument> iterator = forms.listIterator();

for(form in forms)
{
    form.close()
    newFlowFile = session.create()
    newFlowFile = session.write(newFlowFile, {outputStream ->
        outputStream.write(form.getBytes(StandardCharsets.UTF_8)) as OutputStreamCallback
        })
    session.transfer(newFlowFile, REL_SUCCESS)
}

mburgess · ‎04-03-2019

I believe the flow file is entering the processor and just taking a very long time to process. In the meantime it will show up in the connection on the UI (although if you try to remove it while it's being processed, you will get a message that zero flow files were removed). The indicator that the flow file is being processed is the grid of light/dark dots on the right side of the processor. While that is shown, the processor is executing, ostensibly on one or more flow files from the incoming queue.

For your script, I think the reason for the long processing (which I would think would be followed by errors on the processor and in the log?) is because you're reading the entire file into a String, then calling PDDocument.load() on the String, when there is no method for that (you need byte[] or InputStream). The very unfortunate part here is that Groovy will try to print out the value of your String, and for some unknown reason when you call toString() on a PDDocument, it gives the entire content, which for large PDFs you can imagine is quite cumbersome.

Luckily you can skip the representation as a String altogether, since the ProcessSession API gives you an InputStream and/or OutputStream, which you can use for load() and save() methods on a PDDocument.

I took the liberty of refactoring your script above, mine's not super sophisticated (especially in terms of error handling) but should give you the gist of the approach:

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.multipdf.Splitter
 
flowFile = session.get()
if(!flowFile) return
 
def flowFiles = [] as List<FlowFile>
try {
   def document 
   session.read(flowFile, {inputStream ->
      document = PDDocument.load(inputStream)
   } as InputStreamCallback)

   def splitter = new Splitter()
   splitter.setSplitAtPage(2)
 
   try {
      def forms = splitter.split(document)
      forms.each { form -> 
         newFlowFile = session.write(session.create(flowFile), {outputStream ->
             form.save(outputStream)
         } as OutputStreamCallback)
        
         flowFiles << newFlowFile
         form.close()
      }
   } catch(e) {
      log.error('Error writing splits', e)
      throw e
   } finally {
      document?.close()
   }
   session.transfer(flowFiles, REL_SUCCESS)
} catch(Exception e) {
  log.error('Error processing incoming PDF', e)
  session.remove(flowFiles)
}
session.remove(flowFile)

View solution in original post

mburgess · ‎04-03-2019

I believe the flow file is entering the processor and just taking a very long time to process. In the meantime it will show up in the connection on the UI (although if you try to remove it while it's being processed, you will get a message that zero flow files were removed). The indicator that the flow file is being processed is the grid of light/dark dots on the right side of the processor. While that is shown, the processor is executing, ostensibly on one or more flow files from the incoming queue.

For your script, I think the reason for the long processing (which I would think would be followed by errors on the processor and in the log?) is because you're reading the entire file into a String, then calling PDDocument.load() on the String, when there is no method for that (you need byte[] or InputStream). The very unfortunate part here is that Groovy will try to print out the value of your String, and for some unknown reason when you call toString() on a PDDocument, it gives the entire content, which for large PDFs you can imagine is quite cumbersome.

Luckily you can skip the representation as a String altogether, since the ProcessSession API gives you an InputStream and/or OutputStream, which you can use for load() and save() methods on a PDDocument.

I took the liberty of refactoring your script above, mine's not super sophisticated (especially in terms of error handling) but should give you the gist of the approach:

import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.multipdf.Splitter
 
flowFile = session.get()
if(!flowFile) return
 
def flowFiles = [] as List<FlowFile>
try {
   def document 
   session.read(flowFile, {inputStream ->
      document = PDDocument.load(inputStream)
   } as InputStreamCallback)

   def splitter = new Splitter()
   splitter.setSplitAtPage(2)
 
   try {
      def forms = splitter.split(document)
      forms.each { form -> 
         newFlowFile = session.write(session.create(flowFile), {outputStream ->
             form.save(outputStream)
         } as OutputStreamCallback)
        
         flowFiles << newFlowFile
         form.close()
      }
   } catch(e) {
      log.error('Error writing splits', e)
      throw e
   } finally {
      document?.close()
   }
   session.transfer(flowFiles, REL_SUCCESS)
} catch(Exception e) {
  log.error('Error processing incoming PDF', e)
  session.remove(flowFiles)
}
session.remove(flowFile)

rstewart · ‎04-09-2019

Hi Matt. Works like a charm, thanks for your help!

Cloudera Community

Support Questions

How can I use a Groovy ExecuteScript to split a large PDF into smaller, 2-page PDFs?