Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

ExtractText NiFi Custom Processor Powered by Apache Tika

Apache Tika is amazing, it is very easy to use it to analyze file and then to extract text with it. Apache Tika uses other powerful Apache projects like Apache PDFBox and Apache POI.

12171-tikaflow.png

Example Usage

  1. Feed in documents, I use my LinkProcessor which grabs links from a website and returns a JSON List.
  2. Split the resulting JSON list into individual JSON rows with SplitJSON.
  3. EvaluateJSONPath to extract just the URLs.
  4. InvokeHTTP to do a GET on that parsed URL.
  5. RouteOnAttribute to only process file types I am interested in like Microsoft Word.
  6. The new ExtractTextProcessor to extract the text of the document.
  7. 12173-tikaextractadd.png
  8. Then we save the text as a file in some data store, perhaps HDFS.

12174-tikaresultsprops.png

12172-tikaoutput.png

If you have a directory of files, you can just use GetFile to ingest them en masse.

LinkProcessor (https://github.com/tspannhw/linkextractorprocessor)

URL:  http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/index.html

This is an example of a URL that I want to grab all the documents from. You can point it at any URL that has links to documents (HTML, Word, Excel, PowerPoint, etc...).

RouteOnAttribute

I only want to process a few types of files, so I limit them here.

${filename:endsWith('.doc'):or(${filename:endsWith('.pdf')}):or(${filename:endsWith('.rtf')}):or(${filename:endsWith('.ppt')}):or( 
${filename:endsWith('.docx')}):or(${filename:endsWith('.pptx')}):or(${filename:endsWith('.html')}):or(${filename:endsWith('.htm')}):or(${filename:endsWith('.xls')}):or( 
${filename:endsWith('.xlsx')}):or(${filename:endsWith('.xml')}):or(${Content-Type:contains('text/html')}):or(${Content-Type:contains('application/pdf')}):or( 
${Content-Type:contains('application/msword')}):or(${Content-Type:contains('application/vnd')}):or(${Content-Type:contains('text/xml')})}

Release:

https://github.com/tspannhw/nifi-extracttext-processor/releases/tag/1.0

Reference:

4,470 Views
Comments
Not applicable

Hi,

I found your really interesting article about extracting text from any kind of files. I tried to test your processor but every time I have an error message: Error: ExtractTextProcessor[id=...] Apache Tika failed to parse input Unable to extract PDF content

The Apache NiFi is installed on a separate Ubuntu server 16.04 (only NiFi), not on Hortonworks sandbox. Do I need to install also Tika on the same server, because I imported only your processor in NiFi and run it? I tested also on the sandbox HDFS with Apache NiFi, the similar error message.

Sorry, I'm new in this field and any help any help would be appreciated.

Thank you

Gojko

Super Guru

Try the new one: https://community.hortonworks.com/articles/177370/extracting-html-from-pdf-excel-and-word-documents....

Make sure download the nar from github and put in lib directory then reboot. Make sure you are running NiFi with JDK 8.

Not applicable

Thanks, the version 1.7 (I saw already new one 1.17) works fantastic.

Great job.

Super Guru

copy to NiFi Lib Directory and restart

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 05:02 AM
Updated by:
 
Contributors
Top Kudoed Authors