- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 01-04-2017 05:01 PM - edited 09-16-2022 01:38 AM
My first caveat would be that in my tests, the pre-trained models is missing a lot of names. If this is for a production work load, I would recommend training your own models using your own data. Maybe use all of your corporate directory, client list, Salesforce data, LinkedIn and social media. I would recommend full name, first names and any nicknames that are commonly used.
The current version is 1.7.0 and there are pre-trained 1.5.0 models that work. They have a number of pre-trained models in a few human languages. I chose English (http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin).
Walk Through:
- Create TokenNameFinderModel from pre-built person model.
- Tokenize the input sentence.
- Find the identified people.
- Convert to JSON array.
You can easily plug this into a custom NiFi processor, microservice, command line tool or routine in a larger Apache Storm or Apache Spark pipeline.
Code (JavaBean)
public class PersonName { private String name = ""; public String getName() { return name; } public void setName(String name) { this.name = name; } }
Code (getPeople)
import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import com.google.gson.Gson; import opennlp.tools.namefind.NameFinderME; import opennlp.tools.namefind.TokenNameFinderModel; import opennlp.tools.tokenize.SimpleTokenizer; import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.InvalidFormatException; import opennlp.tools.util.Span; public String getPeople(String sentence) { String outputJSON = ""; TokenNameFinderModel model = null; try { model = new TokenNameFinderModel( new File("en-ner-person.bin")); } catch (InvalidFormatException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } NameFinderME finder = new NameFinderME(model); Tokenizer tokenizer = SimpleTokenizer.INSTANCE; String[] tokens = tokenizer.tokenize(sentence); Span[] nameSpans = finder.find(tokens); List<PersonName> people = new ArrayList<PersonName>(); String[] spanns = Span.spansToStrings(nameSpans, tokens); for (int i = 0; i < spanns.length; i++) { people.add(new PersonName(spanns[i])); } outputJSON = new Gson().toJson(people); finder.clearAdaptiveData(); return "{\"names\":" + outputJSON + "}"; }
I used Eclipse for building and testing and you can build it with mvn package.
Maven
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.dataflowdeveloper</groupId> <artifactId>categorizer</artifactId> <packaging>jar</packaging> <version>1.0</version> <name>categorizer</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.7</version> </dependency> <dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.7.0</version> </dependency> <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.8.0</version> </dependency> </dependencies> </project>
Run
Input: Tim Spann is going to the store. Peter Smith is using Hortonworks Hive. Output: {"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]}
Reference:
http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.namefind
https://www.packtpub.com/books/content/finding-people-and-things