Community Articles
Find and share helpful community-sourced technical articles
Cloudera Employee

This article is designed to extend the great work by @Ali Bajwa: Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive and my article Connecting Solr to Spark - Apache Zeppelin Notebook

I have included the complete notebook on my Github site, which can be found on my Github site.

Step 1 - Follow Ali's tutorial to establish an Apache Solr collection called "tweets"

Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4

res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@617d134a
res1: String =

Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before the Spark Context has been initialized.

//Must be used before SparkInterpreter (%spark2) initialized
//Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter

Step 4 - Download the Stanford CoreNLP libraries found on here: . Upzip the download and move it to the /tmp directory. Note: This can be accomplished on the command line or the following Zeppelin paragraph will work as well.

wget -O /tmp/
unzip /tmp/

Step 5 - In Zeppelin's Interpreters configurations for Spark, include the following artifact: /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar


Step 6 - Include the following Spark dependencies for Stanford CoreNLP and Spark CoreNLP. Important note: This needs to be run before the Spark Context has been initialized.


//In Spark Interper Settings Add the following artifact 
//  /tmp/stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1-models.jar

Step 7 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names:

"zkhost" -> ",,",

val options = Map(
  "collection" -> "Tweets",
  "zkhost" -> "localhost:2181/solr",
//   "query" -> "Keyword, 'More Keywords'"
val df ="solr").options(options).load

Step 8 - Review results of the Solr query


Step 9 - Filter the Tweets in the Spark DataFrame to ensure the timestamp and language aren't null. Once filter has been completed, add the sentiment value to the tweets.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import com.databricks.spark.corenlp.functions._

val df_TweetSentiment = df.filter("text_t is not null and language_s = 'en' and timestamp_s is not null ").select($"timestamp_s", $"text_t", $"location", sentiment($"text_t").as('sentimentScore))

Step 10 - Correctly cast the timestamp value


val df_TweetSentiment1 = df_TweetSentiment.withColumn("timestampZ",  unix_timestamp($"timestamp_s", "EEE MMM dd HH:mm:ss ZZZZZ yyyy").cast(TimestampType)).drop($"timestamp_s")

Step 11 - Valid results and create temporary table TweetSentiment


Step 12 - Query the table TweetSentiment

select sentimentScore, count(sentimentScore) from TweetSentiment group by sentimentScore



Ian, thanks so much for such a great article.
I'm getting the following error when trying to run step 9 -"error: not found: value sentiment" when trying to set "df_TweetSentiment".

Could you help me with that?

Many thanks!

Don't have an account?
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 07:20 AM
Updated by:
Top Kudoed Authors