Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

Solved Go to solution

groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

New Contributor

I'm trying to read a PDF from my flowfile into a groovy/java File object, not sure what I'm doing wrong but when I try to use session.get() I get the error `transfer relationship not specified`

import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
try {
        //Get flowfile into File() object
        File file = session.get()
        PDDocument document = PDDocument.load(file)
        PDFTextStripperByArea stripper = new PDFTextStripperByArea()
} catch (Exception whatever) {
        print(whatever)
}
println('it worked')

I've also tried using stdin instead of relying on the session object but that just hangs. I let it spin for about 5 minutes trying to read an 85 kB PDF file before I gave up.

import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
try {
        //Get flowfile into File() object
        File file = new File()
        OutputStream os = new FileOutputStream(file)
        IOUtils.copy(System.in, os)
        os.close()
        PDDocument document = PDDocument.load(file)
        PDFTextStripperByArea stripper = new PDFTextStripperByArea()
} catch (Exception whatever) {
        print(whatever)
}
println('it worked')

It doesn't really matter which of these works, just as long as I can get the FlowFile into a Java File object so I can continue processing. Any help appreciated I'm extremely frustrated.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

A FlowFile doesn't ever really exist as a Java File object, instead you access its contents as an InputStream. I believe PDDocument has a load(InputStream) method, so you could do something like:

import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
def flowFile = session.get()
if(!flowFile) return
try {
   def inputStream = session.read(flowFile)   PDDocument document = PDDocument.load(inputStream)
   PDFTextStripperByArea stripper = newPDFTextStripperByArea()
   // Do your other stuff here, probably writing something out to flow file(s)?
   inputStream.close()
   // If you changed the original flow file, transfer it here
   session.transfer(flowFile, REL_SUCCESS)
} catch(Exception whatever) {
   print(whatever)
   // Something went wrong, send the original flow file to failure
   session.transfer(flowFile, REL_FAILURE)
}
println('it worked')

If you're going to be replacing the contents of the incoming flow file with some extraction from the PDF, then you can do both the read and the write in a "StreamCallback", check out Part 2 of my ExecuteScript Cookbook for ways to read/write flow files.

6 REPLIES 6

Re: groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

A FlowFile doesn't ever really exist as a Java File object, instead you access its contents as an InputStream. I believe PDDocument has a load(InputStream) method, so you could do something like:

import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
def flowFile = session.get()
if(!flowFile) return
try {
   def inputStream = session.read(flowFile)   PDDocument document = PDDocument.load(inputStream)
   PDFTextStripperByArea stripper = newPDFTextStripperByArea()
   // Do your other stuff here, probably writing something out to flow file(s)?
   inputStream.close()
   // If you changed the original flow file, transfer it here
   session.transfer(flowFile, REL_SUCCESS)
} catch(Exception whatever) {
   print(whatever)
   // Something went wrong, send the original flow file to failure
   session.transfer(flowFile, REL_FAILURE)
}
println('it worked')

If you're going to be replacing the contents of the incoming flow file with some extraction from the PDF, then you can do both the read and the write in a "StreamCallback", check out Part 2 of my ExecuteScript Cookbook for ways to read/write flow files.

Re: groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

New Contributor

Hey @Matt Burgess thanks for such a quick response. I've been messing with your code for a little bit but can't get past the error `MissingPropertyException: No such property: flowFile for class: Script58238`. It still doesn't seem to be getting the flowFile correctly.

Re: groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

Oops I put the def flowFile inside the try, I have since edited the answer to (hopefully!) be correct

Re: groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

New Contributor

Hey @Matt Burgess that worked, thanks!
I'm trying to scale up now and when I try adding that code to a class and calling it from main() I get errors about static keyword and context. I've tried running it from a run() method and then calling that from main, moving the flowFile declaration outside of main but I'm just not understanding.

Sorry to be such a bother I just can't find this in the documentation or examples of doing this from a class.

import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
class nocr {
    static void main(String args) {
        def flowFile = session.get()
        if (!flowFile)
            return
        try {
            def inputStream = session.read(flowFile) 
            PDDocument document = PDDocument.load(inputStream)
            PDFTextStripperByArea stripper = newPDFTextStripperByArea()
            // Do your other stuff here, probably writing something out to flow file(s)?
            inputStream.close()
            // If you changed the original flow file, transfer it here
            session.transfer(flowFile, REL_SUCCESS)
        } catch (
                Exception whatever
                ) {
            print(whatever)
            // Something went wrong, send the original flow file to failure
            session.transfer(flowFile, REL_FAILURE)
        }
        println('it worked')
    }
}

Re: groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

The script code runs basically as the body of an onTrigger() method, which is a Processor's method that gets called when ExecuteScript is triggered to execute. You don't need that code in a class, but if you want to call main() then you have to do it outside the class but in the same script. If you're trying to be able to pass in arguments, then instead of using a class with a main() method, the user would specify them in the ExecuteScript configuration dialog as user-defined properties, and they are available to the script by their name. They get bound to the script as variables with PropertyValue objects, so to get their values you'll need to call getValue() on them (there are examples in the cookbook).

If you want a full-fledged implementation of a Processor, you can use InvokeScriptedProcessor, that expects an implementation of the Processor interface, but still requires a line outside the class to store away a variable containing an instance of your Processor, then InvokeScriptedProcessor's methods get delegated to your Processor implementation. One of the advantages there is that you can add concrete properties to the InvokeScriptedProcessor dialog (via the getSupportedPropertyDescriptors() method) rather than passing in user-defined properties as variables to the script. I have some examples on my blog on how to use InvokeScriptedProcessor.

Re: groovy read flowfile: errorless hang when using stdin or 'transfer relationship not specified' when using session.get

New Contributor

Thanks for everything @Matt Burgess I was able to get this going by learning making my code more Groovy and cutting the need for classes and the main() method out of my implementation.