Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
New Contributor

This article is designed to extend the great work by @Ali Bajwa: Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive

I have included the complete notebook on my Github site, which can be found here

Step 1 - Follow Ali's tutorial to establish an Apache Solr collection called "tweets"

Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4

%spark2
sc
sc.version
 
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@617d134a
res1: String = 2.2.0.2.6.4.0-91

Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before the Spark Context has been initialized.

%dep
z.load("com.lucidworks.spark:spark-solr:jar:3.4.4")
//Must be used before SparkInterpreter (%spark2) initialized
//Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter

Step 4 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names:

"zkhost" -> "host-1.domain.com:2181,host-2.domain.com:2181,host-3.domain.com:2181/solr",

%spark2
val options = Map(
  "collection" -> "Tweets",
  "zkhost" -> "localhost:2181/solr",
//   "query" -> "Keyword, 'More Keywords'"
)

val df = spark.read.format("solr").options(options).load
df.cache()

Step 5 - Review results of the Solr query

%spark2 
df.count()
df.printSchema()
df.take(1)
1,134 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎05-23-2018 03:52 PM
Updated by:
 
Contributors
Top Kudoed Authors