Community Articles

Find and share helpful community-sourced technical articles.
Cloudera Employee

This article is designed to extend my articles Twitter Sentiment using Spark Core NLP in Apache Zeppelin and Connecting Solr to Spark - Apache Zeppelin Notebook

I have included the complete notebook on my Github site, which can be found on my GitHub site.

Step 1 - Follow the tutorial in the provide articles above, and establish an Apache Solr collection called "tweets"

Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4

Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before
the Spark Context has been initialized</p>
//Must be used before SparkInterpreter (%spark2) initialized
//Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter

Step 4 - Download the Stanford CoreNLP libraries found on here: <a href=""></a>

Upzip the download and move it to the /tmp directory. Note: This can be accomplished on the command line or the following Zeppelin paragraph will work as well

wget /tmp/ unzip /tmp/

Step 5 - In Zeppelin's Interpreters configurations for Spark, include the following artifact: /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar


Step 6 - Include the following Spark dependencies for Stanford CoreNLP and Spark CoreNLP. Important note: This needs to be run before the Spark Context has been initialized

//In Spark Interper Settings Add the following artifact
// /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar

Step 7 Include the following Spark dependencies for JPMML-SparkML and JPMML-Model. Important note: This needs to be run before the Spark Context has been initialized.



Step 8 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names: "zkhost" -> ",,"

val options = Map( "collection" -> 
 "Tweets", "zkhost" -> "localhost:2181/solr", 
 // "query" -> "Keyword, 'More Keywords'" 
val df ="solr").options(options).load df.cache()
 Step 9 - Review results of the Solr query</p><pre>%spark2 
 Step 10 - Filter the
Tweets in the Spark DataFrame to ensure the Tweet text isn't null Once filter
has been completed, add the sentiment value to the tweets.</p>
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._
import com.databricks.spark.corenlp.functions._ 
val df_TweetSentiment = df.filter("text_t is not null").select($"text_t", sentiment($"text_t").as('sentimentScore))

Step 11 - Valid results


Step 12 - Build Stages to build features that will be fed into a Logistic Regression model for classification

Stage 1 -Regex Tokenizer will be used to separate each word into individual "tokens"

Stage 2 -Count Vectorizer will count the number of occurrences each token occurs in the text corpus

Stage 3 -Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.

Stage 4 -Logistic Regression for classification to predict sentiment score

import{HashingTF, IDF, Tokenizer, RegexTokenizer, CountVectorizer, CountVectorizerModel}
val tokenizer = new RegexTokenizer()
val wordsData = tokenizer.transform(df_TweetSentiment)
val cvModel = new CountVectorizer()
val featurizedData = cvModel.transform(wordsData)
val idf = new IDF()
val idfModel =
val rescaledData = idfModel.transform(featurizedData)"sentimentScore", "features").show()
val lr = new LogisticRegression()

Step 13 - Build Spark Pipeline from Stages

import{Pipeline, PipelineModel}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
val pipeline = new Pipeline()
.setStages(Array(tokenizer, cvModel, idfModel, lr))
val PipeLineModel =
//Save Pipeline to Disk (Optional)
val schema = df_TweetSentiment.schema

Step 14 - Export Spark Pipeline to PMML using JPMML-SparkML

import org.jpmml.sparkml.PMMLBuilder
val pmml = new PMMLBuilder(schema, PipeLineModel)
val file = pmml.buildFile(new File("/tmp/TweetPipeline.pmml"))

Hi Ian! Thanks for posting this article.

At the "pmml.buildFile" I'm getting the following error: java.lang.IllegalArgumentException: iperbole_

Any ideas? Thank you very much!

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.
Version history
Last update:
‎09-16-2022 01:43 AM
Updated by:
Top Kudoed Authors