Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Master Guru

Introduction

For NLP, mostly I want to do two things,

  1. Entity Recognition (people, facility, organizations, locations, products, events, art, language, groups, dates, time, percent, money, quantity, ordinal and cardinal)
  2. Sentiment Analysis

So basically what is it and why don't people like it. These two features are very useful as part of a real-time streaming processing of social, email, logs and semistructured document data. I can use both of these in Twitter ingest via Apache NiFi or Apache Spark. Don't confuse text entity recognition with image recognition that we looked at with TensorFlow previously. You can certainly add that to your flow as well, but that is working with images and not text.

My debate with sentiment analysis is do you give numbers, really general terms like Neutral, Negative or Positive. Or do you get more detailed like Stanford CoreNLP which has multiple of each.

There are a lot of libraries available for NLP and Sentiment Analysis. The first two decisions are do you want to run JVM programs (good for Hadoop MR, Apache Spark, Apache Storm, Enterprise applications, Spring applications, Microservices, NiFi Processors, Hive UDFs, Pig UDFs and have multiple programming language support (Java, Scala, ...). Or run on Python which is already well-known by many data scientists and engineers, is simple to prototype with no compiling, very easy to call from NiFi and scripts and also has a ton of great Deep Learning libraries and interfaces.

Python Libraries

Like most things in Python, you can use Pip to install them. You will need a Python 2.7 or 3.0 environment setup with PIP to install and use the libraries I have looked at. spaCY requires numpy and so do many of the others.

spaCy

pip install -U spacy
python -m spacy.en.download all

Downloading parsing model
Downloading...
Downloaded 532.28MB 100.00% 9.59MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
Downloading GloVe vectors
Downloading...
Downloaded 708.08MB 100.00% 19.38MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data

After you install you need to download text and models to be used by the tool.

import spacy
nlp = spacy.load('en')
doc5 = nlp(u"Timothy Spann is studying at Princeton University in New Jersey.")
# Named Entity Recognizer (NER)
for ent in doc5.ents:
        print ent, ent.label, ent.label_

NLTK

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
if ss['compound'] == 0.00:
print('Neutral')
elif ss['compound'] < 0.00:
print ('Negative')
else:
print('Positive')



Another NLTK Option

import sys

sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos']))

NLTK does sentiment analysis very easily as shown above. It runs fairly quickly so you can call this in a stream without too much overhead.

TextBlob

from textblob import TextBlob

b = TextBlob("Spellin iz vaerry haerd to do. I do not like this spelling product at all it is terrible and I am very mad.")
print(b.correct())
print(b.sentiment)
print(b.sentiment.polarity)

python tb.py
Spelling in very heard to do. I do not like this spelling product at all it is terrible and I am very mad.
Sentiment(polarity=-0.90625, subjectivity=1.0)
-0.90625

TextBlob is a nice library that does Sentiment Analysis as well as spell checking and other useful text processing.

The install will look familiar.

sudo pip install -U textblob
sudo python -m textblob.download_corpora

JVM

Natural Language Processing for JVM languages (NLP4J) is one option, I have not tried this one yet.

Apache OpenNLP

This one is very widely used and is an Apache project which makes the licensing ideal for most users. I have a long example of this in this article on Apache OpenNLP.

Training Models Pre-built for Entity Recognition in Apache OpenNLP

StanfordNLP

I love StanfordNLP, it works very well, integrates in a Twitter processing flow and is very accurate. The only issue for many is that it is GPLd and for many use cases will require purchasing a license. It is very easy to use Stanford CoreNLP with Java, Scala and Spark.

import java.util.Properties

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._

import com.vader.SentimentAnalyzer
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql._
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer

case class Tweet(coordinates: String, geo:String, handle: String, hashtags: String, language: String,
                 location: String, msg: String, time: String, tweet_id: String, unixtime: String, user_name: String, tag: String, profile_image_url: String,
                 source: String, place: String, friends_count: String, followers_count: String, retweet_count: String, 
                 time_zone: String, sentiment: String, stanfordSentiment: String)

val message = convert(anyMessage)
  val pipeline = new StanfordCoreNLP(nlpProps)
  val annotation = pipeline.process(message)
  var sentiments: ListBuffer[Double] = ListBuffer()
  var sizes: ListBuffer[Int] = ListBuffer()

  var longest = 0
  var mainSentiment = 0

  for (sentence <- annotation.get(classOf[CoreAnnotations.SentencesAnnotation])) {
    val tree = sentence.get(classOf[SentimentCoreAnnotations.AnnotatedTree])
    val sentiment = RNNCoreAnnotations.getPredictedClass(tree)
    val partText = sentence.toString

    if (partText.length() > longest) {
      mainSentiment = sentiment
      longest = partText.length()
    }

    sentiments += sentiment.toDouble
    sizes += partText.length
  }

  val averageSentiment:Double = {
    if(sentiments.nonEmpty) sentiments.sum / sentiments.size
    else -1
  }

  val weightedSentiments = (sentiments, sizes).zipped.map((sentiment, size) => sentiment * size)
  var weightedSentiment = weightedSentiments.sum / (sizes.fold(0)(_ + _))

  if(sentiments.isEmpty) {
    mainSentiment = -1
    weightedSentiment = -1
  }

  weightedSentiment match {
    case s if s <= 0.0 => NOT_UNDERSTOOD
    case s if s < 1.0 => VERY_NEGATIVE
    case s if s < 2.0 => NEGATIVE
    case s if s < 3.0 => NEUTRAL
    case s if s < 4.0 => POSITIVE
    case s if s < 5.0 => VERY_POSITIVE
    case s if s > 5.0 => NOT_UNDERSTOOD
  }
}

trait SENTIMENT_TYPE
case object VERY_NEGATIVE extends SENTIMENT_TYPE
case object NEGATIVE extends SENTIMENT_TYPE
case object NEUTRAL extends SENTIMENT_TYPE
case object POSITIVE extends SENTIMENT_TYPE
case object VERY_POSITIVE extends SENTIMENT_TYPE
case object NOT_UNDERSTOOD extends SENTIMENT_TYPE

Summary

Do you have to use just one of these libraries? Of course not, I use different ones depending on my needs. Licensing, performance, accuracy on your dataset, programming language choice, enterprise environment, volume of data, your corpus, human language involved and many other factors come into play. One size does not fit all. If you have sophisticated data scientists and strong machine learning pipelines, you may want to pick one and build up your own custom models and corpus.

This will work with Hortonworks HDP 2.3 - HDP 2.6 and HDF 1.0 - 3.x.

References:

8,020 Views
Comments

For Sentiment Analysis with NiFi Processors

Download and build these

https://github.com/tspannhw/nifi-corenlp-processor

https://github.com/tspannhw/nifi-nlp-processor