Community Articles

TimothySpann · ‎01-04-2017

My first caveat would be that in my tests, the pre-trained models is missing a lot of names. If this is for a production work load, I would recommend training your own models using your own data. Maybe use all of your corporate directory, client list, Salesforce data, LinkedIn and social media. I would recommend full name, first names and any nicknames that are commonly used.

The current version is 1.7.0 and there are pre-trained 1.5.0 models that work. They have a number of pre-trained models in a few human languages. I chose English (http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin).

Walk Through:

Create TokenNameFinderModel from pre-built person model.
Tokenize the input sentence.
Find the identified people.
Convert to JSON array.

You can easily plug this into a custom NiFi processor, microservice, command line tool or routine in a larger Apache Storm or Apache Spark pipeline.

Code (JavaBean)

public class PersonName {
	private String name = "";
	public String getName() {
		return name;
	}
	public void setName(String name) {
		this.name = name;
	}
}

Code (getPeople)

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
	
public String getPeople(String sentence) {
		String outputJSON = "";
		TokenNameFinderModel model = null;
		try {
			model = new TokenNameFinderModel(
					new File("en-ner-person.bin"));
		} catch (InvalidFormatException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		NameFinderME finder = new NameFinderME(model);
		Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
		String[] tokens = tokenizer.tokenize(sentence);
		Span[] nameSpans = finder.find(tokens);
		List<PersonName> people = new ArrayList<PersonName>();
		String[] spanns = Span.spansToStrings(nameSpans, tokens);
		for (int i = 0; i < spanns.length; i++) {
			people.add(new PersonName(spanns[i]));
		}

		outputJSON = new Gson().toJson(people);
		finder.clearAdaptiveData();
		return "{\"names\":" + outputJSON + "}";
	}

I used Eclipse for building and testing and you can build it with mvn package.

Maven

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.dataflowdeveloper</groupId>
  <artifactId>categorizer</artifactId>
  <packaging>jar</packaging>
  <version>1.0</version>
  <name>categorizer</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>1.7.7</version>
    </dependency>
   <dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.7.0</version>
    </dependency>
    <dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.8.0</version>
</dependency>
  </dependencies>
</project>

Run

Input:  Tim Spann is going to the store.   Peter Smith is using Hortonworks Hive.

Output: {"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]}

Reference:

http://opennlp.apache.org/

http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.namefind

https://www.packtpub.com/books/content/finding-people-and-things

http://opennlp.sourceforge.net/models-1.5/

Cloudera Community

Community Articles

Using OpenNLP for Identifying Names From Text

Apache Hive

Apache NiFi

Apache Spark

Apache Storm

Updating The Apache OpenNLP Community Apache NiFi ...

Spark Text Analytics - Uncovering Data-Driven Topi...

Data pipeline to identify population impacted by C...

Identifying missing Table entries in Atlas

HDFS checklist for identifying missing/corrupt blo...

Identify Outliers using Hive

How to identify in cdp cluster having small files ...

Extracting data from unstructured logs text from m...

Identify where most of the small file are located ...

Using Apache NiFi for Speech Processing: Speech to...