- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 05-23-2018 08:15 PM - edited 08-17-2019 07:21 AM
Updating The Apache OpenNLP Community Apache NiFi Processor to Support Flow Files
In this new release, we add the ability to read content from the FlowFile and analyze that for Locations, Dates, Organizations and Names. We are using the Apache OpenNLP 1.5 Models that are available for download. These do a decent job. You can build new models as needed. I also changed it to output one attribute per type with a String list of locations, organizations, dates and names.
I put out a new release, built around Apache NiFi 1.6.0.
Source and NAR Download
https://github.com/tspannhw/nifi-nlp-processor/releases/tag/1.6
Download the Pre-trained Models for Your Language Here:
http://opennlp.sourceforge.net/models-1.5/
I chose English (en).
In a future release I made add Organization, Money, Time and Percentage to the lists we extract if there is interest.
A Final JSON File Produced
Example Output
The Main Flow For Trying Out The NLP Processor
Set Your Models
New NLP Processor Documentation
Here is the schema to use to process this data. Not nlp_names is a String of comma delimited values. You may want to parse this or do additional processing in these fields.
High Level Flow
Example NiFi Flow
References:
- https://community.hortonworks.com/articles/76240/using-opennlp-for-identifying-names-from-text.html
- https://community.hortonworks.com/articles/163776/parsing-any-document-with-apache-nifi-15-with-apac...
- https://community.hortonworks.com/articles/178510/integration-apache-opennlp-184-into-apache-nifi-15...
- https://community.hortonworks.com/articles/76924/data-processing-pipeline-parsing-pdfs-and-identify....
- https://community.hortonworks.com/articles/80418/open-nlp-example-apache-nifi-processor.html
- https://community.hortonworks.com/articles/76935/using-sentiment-analysis-and-nlp-tools-with-hdp-25....
Created on 05-25-2018 07:48 PM
- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Report Inappropriate Content
One thing we are missing is language detection, may be using Apache Tika or Apache OpenNLP to try that.
Also we should probably add attributes to let you exactly specify the models for Organization, Location, Name, Dates.