Created on 01-08-2017 03:30 PM - edited 09-16-2022 01:38 AM
Introduction
For NLP, mostly I want to do two things,
So basically what is it and why don't people like it. These two features are very useful as part of a real-time streaming processing of social, email, logs and semistructured document data. I can use both of these in Twitter ingest via Apache NiFi or Apache Spark. Don't confuse text entity recognition with image recognition that we looked at with TensorFlow previously. You can certainly add that to your flow as well, but that is working with images and not text.
My debate with sentiment analysis is do you give numbers, really general terms like Neutral, Negative or Positive. Or do you get more detailed like Stanford CoreNLP which has multiple of each.
There are a lot of libraries available for NLP and Sentiment Analysis. The first two decisions are do you want to run JVM programs (good for Hadoop MR, Apache Spark, Apache Storm, Enterprise applications, Spring applications, Microservices, NiFi Processors, Hive UDFs, Pig UDFs and have multiple programming language support (Java, Scala, ...). Or run on Python which is already well-known by many data scientists and engineers, is simple to prototype with no compiling, very easy to call from NiFi and scripts and also has a ton of great Deep Learning libraries and interfaces.
Python Libraries
Like most things in Python, you can use Pip to install them. You will need a Python 2.7 or 3.0 environment setup with PIP to install and use the libraries I have looked at. spaCY requires numpy and so do many of the others.
spaCy
pip install -U spacy python -m spacy.en.download all Downloading parsing model Downloading... Downloaded 532.28MB 100.00% 9.59MB/s eta 0s archive.gz checksum/md5 OK Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data Downloading GloVe vectors Downloading... Downloaded 708.08MB 100.00% 19.38MB/s eta 0s archive.gz checksum/md5 OK Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
After you install you need to download text and models to be used by the tool.
import spacy nlp = spacy.load('en') doc5 = nlp(u"Timothy Spann is studying at Princeton University in New Jersey.") # Named Entity Recognizer (NER) for ent in doc5.ents: print ent, ent.label, ent.label_
NLTK
from nltk.sentiment.vader import SentimentIntensityAnalyzer import sys sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(sys.argv[1]) if ss['compound'] == 0.00: print('Neutral') elif ss['compound'] < 0.00: print ('Negative') else: print('Positive')
Another NLTK Option
import sys sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(sys.argv[1]) print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos']))
NLTK does sentiment analysis very easily as shown above. It runs fairly quickly so you can call this in a stream without too much overhead.
TextBlob
from textblob import TextBlob b = TextBlob("Spellin iz vaerry haerd to do. I do not like this spelling product at all it is terrible and I am very mad.") print(b.correct()) print(b.sentiment) print(b.sentiment.polarity) python tb.py Spelling in very heard to do. I do not like this spelling product at all it is terrible and I am very mad. Sentiment(polarity=-0.90625, subjectivity=1.0) -0.90625
TextBlob is a nice library that does Sentiment Analysis as well as spell checking and other useful text processing.
The install will look familiar.
sudo pip install -U textblob sudo python -m textblob.download_corpora
JVM
Natural Language Processing for JVM languages (NLP4J) is one option, I have not tried this one yet.
Apache OpenNLP
This one is very widely used and is an Apache project which makes the licensing ideal for most users. I have a long example of this in this article on Apache OpenNLP.
Training Models Pre-built for Entity Recognition in Apache OpenNLP
StanfordNLP
I love StanfordNLP, it works very well, integrates in a Twitter processing flow and is very accurate. The only issue for many is that it is GPLd and for many use cases will require purchasing a license. It is very easy to use Stanford CoreNLP with Java, Scala and Spark.
import java.util.Properties import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.phoenix.spark._ import com.vader.SentimentAnalyzer import edu.stanford.nlp.ling.CoreAnnotations import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations import edu.stanford.nlp.pipeline.StanfordCoreNLP import edu.stanford.nlp.sentiment.SentimentCoreAnnotations import org.apache.log4j.{Level, Logger} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.serializer.KryoSerializer import org.apache.spark.sql._ import scala.collection.JavaConversions._ import scala.collection.mutable.ListBuffer case class Tweet(coordinates: String, geo:String, handle: String, hashtags: String, language: String, location: String, msg: String, time: String, tweet_id: String, unixtime: String, user_name: String, tag: String, profile_image_url: String, source: String, place: String, friends_count: String, followers_count: String, retweet_count: String, time_zone: String, sentiment: String, stanfordSentiment: String) val message = convert(anyMessage) val pipeline = new StanfordCoreNLP(nlpProps) val annotation = pipeline.process(message) var sentiments: ListBuffer[Double] = ListBuffer() var sizes: ListBuffer[Int] = ListBuffer() var longest = 0 var mainSentiment = 0 for (sentence <- annotation.get(classOf[CoreAnnotations.SentencesAnnotation])) { val tree = sentence.get(classOf[SentimentCoreAnnotations.AnnotatedTree]) val sentiment = RNNCoreAnnotations.getPredictedClass(tree) val partText = sentence.toString if (partText.length() > longest) { mainSentiment = sentiment longest = partText.length() } sentiments += sentiment.toDouble sizes += partText.length } val averageSentiment:Double = { if(sentiments.nonEmpty) sentiments.sum / sentiments.size else -1 } val weightedSentiments = (sentiments, sizes).zipped.map((sentiment, size) => sentiment * size) var weightedSentiment = weightedSentiments.sum / (sizes.fold(0)(_ + _)) if(sentiments.isEmpty) { mainSentiment = -1 weightedSentiment = -1 } weightedSentiment match { case s if s <= 0.0 => NOT_UNDERSTOOD case s if s < 1.0 => VERY_NEGATIVE case s if s < 2.0 => NEGATIVE case s if s < 3.0 => NEUTRAL case s if s < 4.0 => POSITIVE case s if s < 5.0 => VERY_POSITIVE case s if s > 5.0 => NOT_UNDERSTOOD } } trait SENTIMENT_TYPE case object VERY_NEGATIVE extends SENTIMENT_TYPE case object NEGATIVE extends SENTIMENT_TYPE case object NEUTRAL extends SENTIMENT_TYPE case object POSITIVE extends SENTIMENT_TYPE case object VERY_POSITIVE extends SENTIMENT_TYPE case object NOT_UNDERSTOOD extends SENTIMENT_TYPE
Summary
Do you have to use just one of these libraries? Of course not, I use different ones depending on my needs. Licensing, performance, accuracy on your dataset, programming language choice, enterprise environment, volume of data, your corpus, human language involved and many other factors come into play. One size does not fit all. If you have sophisticated data scientists and strong machine learning pipelines, you may want to pick one and build up your own custom models and corpus.
This will work with Hortonworks HDP 2.3 - HDP 2.6 and HDF 1.0 - 3.x.
References:
Created on 08-22-2017 04:58 PM
For Sentiment Analysis with NiFi Processors
Download and build these