Community Articles
Find and share helpful community-sourced technical articles.
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.
Cloudera Employee

This article is designed to extend my articles Twitter Sentiment using Spark Core NLP in Apache Zeppelin and Connecting Solr to Spark - Apache Zeppelin Notebook

I have included the complete notebook on my Github site, which can be found on my GitHub site.

Step 1 - Follow the tutorial in the provide articles above, and establish an Apache Solr collection called "tweets"

Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4

Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before
the Spark Context has been initialized</p>
//Must be used before SparkInterpreter (%spark2) initialized
//Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter

Step 4 - Download the Stanford CoreNLP libraries found on here: <a href=""></a>

Upzip the download and move it to the /tmp directory. Note: This can be accomplished on the command line or the following Zeppelin paragraph will work as well

wget /tmp/ unzip /tmp/

Step 5 - In Zeppelin's Interpreters configurations for Spark, include the following artifact: /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar


Step 6 - Include the following Spark dependencies for Stanford CoreNLP and Spark CoreNLP. Important note: This needs to be run before the Spark Context has been initialized

//In Spark Interper Settings Add the following artifact
// /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar

Step 7 Include the following Spark dependencies for JPMML-SparkML and JPMML-Model. Important note: This needs to be run before the Spark Context has been initialized.



Step 8 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names: "zkhost" -> ",,"

val options = Map( "collection" -> 
 "Tweets", "zkhost" -> "localhost:2181/solr", 
 // "query" -> "Keyword, 'More Keywords'" 
val df ="solr").options(options).load df.cache()
 Step 9 - Review results of the Solr query</p><pre>%spark2 
 Step 10 - Filter the
Tweets in the Spark DataFrame to ensure the Tweet text isn't null Once filter
has been completed, add the sentiment value to the tweets.</p>
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._
import com.databricks.spark.corenlp.functions._ 
val df_TweetSentiment = df.filter("text_t is not null").select($"text_t", sentiment($"text_t").as('sentimentScore))

Step 11 - Valid results


Step 12 - Build Stages to build features that will be fed into a Logistic Regression model for classification

Stage 1 -Regex Tokenizer will be used to separate each word into individual "tokens"

Stage 2 -Count Vectorizer will count the number of occurrences each token occurs in the text corpus

Stage 3 -Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.

Stage 4 -Logistic Regression for classification to predict sentiment score

import{HashingTF, IDF, Tokenizer, RegexTokenizer, CountVectorizer, CountVectorizerModel}
val tokenizer = new RegexTokenizer()
val wordsData = tokenizer.transform(df_TweetSentiment)
val cvModel = new CountVectorizer()
val featurizedData = cvModel.transform(wordsData)
val idf = new IDF()
val idfModel =
val rescaledData = idfModel.transform(featurizedData)"sentimentScore", "features").show()
val lr = new LogisticRegression()

Step 13 - Build Spark Pipeline from Stages

import{Pipeline, PipelineModel}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
val pipeline = new Pipeline()
.setStages(Array(tokenizer, cvModel, idfModel, lr))
val PipeLineModel =
//Save Pipeline to Disk (Optional)
val schema = df_TweetSentiment.schema

Step 14 - Export Spark Pipeline to PMML using JPMML-SparkML

import org.jpmml.sparkml.PMMLBuilder
val pmml = new PMMLBuilder(schema, PipeLineModel)
val file = pmml.buildFile(new File("/tmp/TweetPipeline.pmml"))

Hi Ian! Thanks for posting this article.

At the "pmml.buildFile" I'm getting the following error: java.lang.IllegalArgumentException: iperbole_

Any ideas? Thank you very much!

Don't have an account?
Version history
Last update:
‎08-17-2019 06:52 AM
Updated by:
Top Kudoed Authors