Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

Updating The Apache OpenNLP Community Apache NiFi Processor to Support Flow Files

In this new release, we add the ability to read content from the FlowFile and analyze that for Locations, Dates, Organizations and Names. We are using the Apache OpenNLP 1.5 Models that are available for download. These do a decent job. You can build new models as needed. I also changed it to output one attribute per type with a String list of locations, organizations, dates and names.

I put out a new release, built around Apache NiFi 1.6.0.

Source and NAR Download

https://github.com/tspannhw/nifi-nlp-processor/releases/tag/1.6

Download the Pre-trained Models for Your Language Here:

http://opennlp.sourceforge.net/models-1.5/

I chose English (en).

In a future release I made add Organization, Money, Time and Percentage to the lists we extract if there is interest.

A Final JSON File Produced

76399-nlpparseddata.png

Example Output

76400-nlpfieldsextracted.png

The Main Flow For Trying Out The NLP Processor

76401-nlpflowoverview.png

Set Your Models

76402-nlpmodels.png

New NLP Processor Documentation76403-nlpdocs.png

Here is the schema to use to process this data. Not nlp_names is a String of comma delimited values. You may want to parse this or do additional processing in these fields.

76404-nlpschema.png

High Level Flow

76405-nlpserverflowhighlevel.png

Example NiFi Flow

nlpupdates2018.xml

References:

233 Views
Comments
Super Guru

One thing we are missing is language detection, may be using Apache Tika or Apache OpenNLP to try that.

Also we should probably add attributes to let you exactly specify the models for Organization, Location, Name, Dates.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 07:21 AM
Updated by:
 
Contributors
Top Kudoed Authors