Community Articles

TimothySpann · ‎01-04-2017

My first caveat would be that in my tests, the pre-trained models is missing a lot of names. If this is for a production work load, I would recommend training your own models using your own data. Maybe use all of your corporate directory, client list, Salesforce data, LinkedIn and social media. I would recommend full name, first names and any nicknames that are commonly used.

The current version is 1.7.0 and there are pre-trained 1.5.0 models that work. They have a number of pre-trained models in a few human languages. I chose English (http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin).

Walk Through:

Create TokenNameFinderModel from pre-built person model.
Tokenize the input sentence.
Find the identified people.
Convert to JSON array.

You can easily plug this into a custom NiFi processor, microservice, command line tool or routine in a larger Apache Storm or Apache Spark pipeline.

Code (JavaBean)

public class PersonName {
	private String name = "";
	public String getName() {
		return name;
	}
	public void setName(String name) {
		this.name = name;
	}
}

Code (getPeople)

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
	
public String getPeople(String sentence) {
		String outputJSON = "";
		TokenNameFinderModel model = null;
		try {
			model = new TokenNameFinderModel(
					new File("en-ner-person.bin"));
		} catch (InvalidFormatException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		NameFinderME finder = new NameFinderME(model);
		Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
		String[] tokens = tokenizer.tokenize(sentence);
		Span[] nameSpans = finder.find(tokens);
		List<PersonName> people = new ArrayList<PersonName>();
		String[] spanns = Span.spansToStrings(nameSpans, tokens);
		for (int i = 0; i < spanns.length; i++) {
			people.add(new PersonName(spanns[i]));
		}

		outputJSON = new Gson().toJson(people);
		finder.clearAdaptiveData();
		return "{\"names\":" + outputJSON + "}";
	}

I used Eclipse for building and testing and you can build it with mvn package.

Maven

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.dataflowdeveloper</groupId>
  <artifactId>categorizer</artifactId>
  <packaging>jar</packaging>
  <version>1.0</version>
  <name>categorizer</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>1.7.7</version>
    </dependency>
   <dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.7.0</version>
    </dependency>
    <dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.8.0</version>
</dependency>
  </dependencies>
</project>

Run

Input:  Tim Spann is going to the store.   Peter Smith is using Hortonworks Hive.

Output: {"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]}

Reference:

http://opennlp.apache.org/

http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.namefind

https://www.packtpub.com/books/content/finding-people-and-things

http://opennlp.sourceforge.net/models-1.5/

Cloudera Community

Community Articles

Using OpenNLP for Identifying Names From Text

Apache Hive

Apache NiFi

Apache Spark

Apache Storm