- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 02-06-2017 08:26 PM - edited 08-17-2019 05:02 AM
ExtractText NiFi Custom Processor Powered by Apache Tika
Apache Tika is amazing, it is very easy to use it to analyze file and then to extract text with it. Apache Tika uses other powerful Apache projects like Apache PDFBox and Apache POI.
Example Usage
- Feed in documents, I use my LinkProcessor which grabs links from a website and returns a JSON List.
- Split the resulting JSON list into individual JSON rows with SplitJSON.
- EvaluateJSONPath to extract just the URLs.
- InvokeHTTP to do a GET on that parsed URL.
- RouteOnAttribute to only process file types I am interested in like Microsoft Word.
- The new ExtractTextProcessor to extract the text of the document.
- Then we save the text as a file in some data store, perhaps HDFS.
If you have a directory of files, you can just use GetFile to ingest them en masse.
LinkProcessor (https://github.com/tspannhw/linkextractorprocessor)
URL: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/index.html
This is an example of a URL that I want to grab all the documents from. You can point it at any URL that has links to documents (HTML, Word, Excel, PowerPoint, etc...).
RouteOnAttribute
I only want to process a few types of files, so I limit them here.
${filename:endsWith('.doc'):or(${filename:endsWith('.pdf')}):or(${filename:endsWith('.rtf')}):or(${filename:endsWith('.ppt')}):or( ${filename:endsWith('.docx')}):or(${filename:endsWith('.pptx')}):or(${filename:endsWith('.html')}):or(${filename:endsWith('.htm')}):or(${filename:endsWith('.xls')}):or( ${filename:endsWith('.xlsx')}):or(${filename:endsWith('.xml')}):or(${Content-Type:contains('text/html')}):or(${Content-Type:contains('application/pdf')}):or( ${Content-Type:contains('application/msword')}):or(${Content-Type:contains('application/vnd')}):or(${Content-Type:contains('text/xml')})}
Release:
https://github.com/tspannhw/nifi-extracttext-processor/releases/tag/1.0
Reference:
- https://tika.apache.org/
- https://tika.apache.org/1.14/formats.html
- http://pdfbox.apache.org/
- https://pdfbox.apache.org/1.8/cookbook/documentcreation.html
- http://poi.apache.org/
- https://community.hortonworks.com/repos/81693/nifi-custom-processor-for-extracting-text-from-doc.htm...
- https://dzone.com/articles/cool-projects-big-data-machine-learning-apache-nifi
Created on 03-03-2018 05:01 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hi,
I found your really interesting article about extracting text from any kind of files. I tried to test your processor but every time I have an error message: Error: ExtractTextProcessor[id=...] Apache Tika failed to parse input Unable to extract PDF content
The Apache NiFi is installed on a separate Ubuntu server 16.04 (only NiFi), not on Hortonworks sandbox. Do I need to install also Tika on the same server, because I imported only your processor in NiFi and run it? I tested also on the sandbox HDFS with Apache NiFi, the similar error message.
Sorry, I'm new in this field and any help any help would be appreciated.
Thank you
Gojko
Created on 03-12-2018 02:19 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Try the new one: https://community.hortonworks.com/articles/177370/extracting-html-from-pdf-excel-and-word-documents....
Make sure download the nar from github and put in lib directory then reboot. Make sure you are running NiFi with JDK 8.
Created on 03-16-2018 05:43 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Thanks, the version 1.7 (I saw already new one 1.17) works fantastic.
Great job.
Created on 03-17-2018 09:14 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
copy to NiFi Lib Directory and restart
Created on 05-04-2020 10:32 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
01. I go to GitHub link https://github.com/tspannhw/nifi-extracttext-processor/releases
02. I download 'Source code.zip' to 'nifi-extracttext-processor-html' on my laptop
03. I extract 'nifi-extracttext-processor-html'
04. In Nifi, I upload the template '56409-tika.xml'
05. When I try to add the 'tika' template to the Nifi Canvas, I'm still getting the 'org.apache.nifi.processors.kite.InferAvroSchema is not known to this NiFi instance.' error
Created on 05-04-2020 10:51 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
that's an old example don't use it. Inferavroschema is not needed.
just download the extract nar and put it in your nifi lib and restart. then add extract processor to your nifi to load.
i don't have this example app anymore or this version so i can't resave it without the InferAvroSchema processor which is not needed or used here.
Created on 05-04-2020 12:44 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Hello,
I see the 'nifi-extracttext-nar' folder, but I don't see '.nar' file in it. Am I missing something?
Thanks
Created on 05-04-2020 08:04 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
the link is right in the release folder
Created on 05-08-2020 11:41 AM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Thanks for everything. That worked great.
Created on 05-08-2020 12:03 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
Awesome. Good luck with NiFi.