Using Solr's Extracting Request Handler with Apach...
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
Created on 06-28-201610:21 PM - edited 08-17-201911:52 AM
The PutSolrContentStream processor in Apache NiFi makes use of Solr's ContentStreamUpdateRequest which means it can stream arbitrary data to Solr. Typically this processor is used to insert JSON documents, but it can be used to stream any kind of data. The following tutorial shows how to use NiFi to stream data to Solr's Extracting Request Handler.
Download the latest version of Solr (6.0.0 for writing this) and extract the distribution
Start Solr with the cloud example: ./bin/solr start -e cloud -noprompt
We can see that a parameter called "literal.id" is normally passed on the URL. Any user defined properties on PutSolrContentStream will be passed as URL parameters to Solr, so by clicking the + icon in the top-right we can add this property and set it to the UUID of the flow file:
Ingest & Query
At this point we can copy any document into <nifi_home>/data/input and see if Solr can identify it. For this example I copied quickstart.html file from the Solr docs directory. After going to the Solr Admin UI and querying the "gettingstarted" collection for all documents, you should see the following results:
We can see that Solr identified the document as "text/html", extracted the title as "Solr Quick Start", and has the id as the UUID of the FlowFile from NiFi. We can also see the extraction was done using Tika behind the scenes.
From here you can send in any type of documents, PDF, Word, Excel, etc., and have Solr extract the text using Tika.