You can use tika and tess4j. tika works well for PDFs that are exported from word doc or other documents. For scanned images, tess4j which uses tesseract gives better text extraction output.
Once you know how to extract text, depending on how many pdfs land and at what ingest rate, you can choose to use either Nifi (you need a custom processer) or mapreduce job that calls your code in parallel.