- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 05-23-2018 03:52 PM
This article is designed to extend the great work by @Ali Bajwa: Sample HDF/NiFi flow to Push Tweets into Solr/Banana, HDFS/Hive
I have included the complete notebook on my Github site, which can be found here
Step 1 - Follow Ali's tutorial to establish an Apache Solr collection called "tweets"
Step 2 - Verify the version of Apache Spark being used, and visit the Solr-Spark connector site. The key is to match the version of Spark the version of the Solr-Spark connector. In the example below, the version of Spark is 2.2.0, and the connector version is 3.4.4
%spark2 sc sc.version res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@617d134a res1: String = 2.2.0.2.6.4.0-91
Step 3 - Include the Solr-Spark dependency in Zeppelin. Important note: This needs to be run before the Spark Context has been initialized.
%dep z.load("com.lucidworks.spark:spark-solr:jar:3.4.4") //Must be used before SparkInterpreter (%spark2) initialized //Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter
Step 4 - Run Solr query and return results into Spark DataFrame. Note: Zookeeper host might need to use full names:
"zkhost" -> "host-1.domain.com:2181,host-2.domain.com:2181,host-3.domain.com:2181/solr",
%spark2 val options = Map( "collection" -> "Tweets", "zkhost" -> "localhost:2181/solr", // "query" -> "Keyword, 'More Keywords'" ) val df = spark.read.format("solr").options(options).load df.cache()
Step 5 - Review results of the Solr query
%spark2 df.count() df.printSchema() df.take(1)