Created on 01-04-2017 05:01 PM - edited 09-16-2022 01:38 AM
My first caveat would be that in my tests, the pre-trained models is missing a lot of names. If this is for a production work load, I would recommend training your own models using your own data. Maybe use all of your corporate directory, client list, Salesforce data, LinkedIn and social media. I would recommend full name, first names and any nicknames that are commonly used.
The current version is 1.7.0 and there are pre-trained 1.5.0 models that work. They have a number of pre-trained models in a few human languages. I chose English (http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin).
Walk Through:
You can easily plug this into a custom NiFi processor, microservice, command line tool or routine in a larger Apache Storm or Apache Spark pipeline.
Code (JavaBean)
public class PersonName {
private String name = "";
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
}Code (getPeople)
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
public String getPeople(String sentence) {
String outputJSON = "";
TokenNameFinderModel model = null;
try {
model = new TokenNameFinderModel(
new File("en-ner-person.bin"));
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
NameFinderME finder = new NameFinderME(model);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(sentence);
Span[] nameSpans = finder.find(tokens);
List<PersonName> people = new ArrayList<PersonName>();
String[] spanns = Span.spansToStrings(nameSpans, tokens);
for (int i = 0; i < spanns.length; i++) {
people.add(new PersonName(spanns[i]));
}
outputJSON = new Gson().toJson(people);
finder.clearAdaptiveData();
return "{\"names\":" + outputJSON + "}";
}
I used Eclipse for building and testing and you can build it with mvn package.
Maven
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dataflowdeveloper</groupId>
<artifactId>categorizer</artifactId>
<packaging>jar</packaging>
<version>1.0</version>
<name>categorizer</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.0</version>
</dependency>
</dependencies>
</project>
Run
Input: Tim Spann is going to the store. Peter Smith is using Hortonworks Hive.
Output: {"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]}Reference:
http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.namefind
https://www.packtpub.com/books/content/finding-people-and-things