Created on 09-26-201704:55 AM - edited 08-17-201910:55 AM
In this article we will go over how we can use nifi to ingest PDFs and while we ingest we will use a custom groovy script with executescript processor to extract images from this PDF. The images will be tagged with PDF filename, page num and imagenum , so that it can be indexed in hbase/solr for quick searching and doing some machine learning or other analytics.
The Groovy code below depends on the following jars. Download and copy them to a folder on your computer. I my case they were under /var/pdimagelib/.
Copy the code below to a file on your computer, in my case the file was under /var/scripts/pdimage.groovy
Below is a screenshot of executescript after it has been setup correctly.
To ingest the PDF, i used a simple GetFile, though this approach should work for pdfs ingested with any other nifi processor. Below is a snapshot of the nifi flow. When a PDF is ingested, executescript will leverage groovy and pdfbox to extract images. The images will be tagged with the pdf filename,pagenum and imagenum. This can now be sent to hbase or any indexing solution for search or analytics.
Hope you find the article useful. Please comment withy your thoughts/questions.