Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Super Guru

My first caveat would be that in my tests, the pre-trained models is missing a lot of names. If this is for a production work load, I would recommend training your own models using your own data. Maybe use all of your corporate directory, client list, Salesforce data, LinkedIn and social media. I would recommend full name, first names and any nicknames that are commonly used.

The current version is 1.7.0 and there are pre-trained 1.5.0 models that work. They have a number of pre-trained models in a few human languages. I chose English (http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin).

Walk Through:

  1. Create TokenNameFinderModel from pre-built person model.
  2. Tokenize the input sentence.
  3. Find the identified people.
  4. Convert to JSON array.

You can easily plug this into a custom NiFi processor, microservice, command line tool or routine in a larger Apache Storm or Apache Spark pipeline.

Code (JavaBean)

public class PersonName {
	private String name = "";
	public String getName() {
		return name;
	}
	public void setName(String name) {
		this.name = name;
	}
}

Code (getPeople)

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
	
public String getPeople(String sentence) {
		String outputJSON = "";
		TokenNameFinderModel model = null;
		try {
			model = new TokenNameFinderModel(
					new File("en-ner-person.bin"));
		} catch (InvalidFormatException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		NameFinderME finder = new NameFinderME(model);
		Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
		String[] tokens = tokenizer.tokenize(sentence);
		Span[] nameSpans = finder.find(tokens);
		List<PersonName> people = new ArrayList<PersonName>();
		String[] spanns = Span.spansToStrings(nameSpans, tokens);
		for (int i = 0; i < spanns.length; i++) {
			people.add(new PersonName(spanns[i]));
		}

		outputJSON = new Gson().toJson(people);
		finder.clearAdaptiveData();
		return "{\"names\":" + outputJSON + "}";
	}

I used Eclipse for building and testing and you can build it with mvn package.

Maven

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.dataflowdeveloper</groupId>
  <artifactId>categorizer</artifactId>
  <packaging>jar</packaging>
  <version>1.0</version>
  <name>categorizer</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>1.7.7</version>
    </dependency>
   <dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.7.0</version>
    </dependency>
    <dependency>
    <groupId>com.google.code.gson</groupId>
    <artifactId>gson</artifactId>
    <version>2.8.0</version>
</dependency>
  </dependencies>
</project>

Run

Input:  Tim Spann is going to the store.   Peter Smith is using Hortonworks Hive.

Output: {"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]}

Reference:

http://opennlp.apache.org/

http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.namefind

https://www.packtpub.com/books/content/finding-people-and-things

http://opennlp.sourceforge.net/models-1.5/

3,345 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎01-04-2017 05:01 PM
Updated by:
 
Contributors
Top Kudoed Authors