Community Articles

Find and share helpful community-sourced technical articles.
Labels (2)
avatar
Master Guru

The PutSolrContentStream processor in Apache NiFi makes use of Solr's ContentStreamUpdateRequest which means it can stream arbitrary data to Solr. Typically this processor is used to insert JSON documents, but it can be used to stream any kind of data. The following tutorial shows how to use NiFi to stream data to Solr's Extracting Request Handler.

Setup Solr

  • Download the latest version of Solr (6.0.0 for writing this) and extract the distribution
  • Start Solr with the cloud example: ./bin/solr start -e cloud -noprompt
  • Verify Solr is running by going to http://localhost:8983/solr in your browser

Setup NiFi

  • Download the latest version of NiFi (0.6.1) and extract the distribution
  • Start NiFi: ./bin/nifi.sh start
  • Verify NiFi is running by going to http://localhost:8080/nifi in your browser
  • Create a directory under the NiFi home for listening for new files:
cd nifi-0.6.1
mkdir data
mkdir data/input

Create the NiFi Flow

Create a simple flow of GetFile -> PutSolrContentStream -> LogAttribute:

5320-flow.png

The GetFile Input Directory should be ./data/input corresponding the directory created earlier.

The configuration for PutSolrContentStream should be the following:

5332-putsolr-1.png

  • The Solr Type is set to Cloud since we started the cloud example
  • The Solr Location is the ZooKeeper connection string for the embedded ZK started by Solr
  • The Collection is the example gettingstarted collection created by Solr
  • The Content Stream Path is the path of the update handler in Solr used for extracting text, this corresponds to a path in solrconfix.ml
  • The Content-Type is application/octet-stream so we can stream over any arbitrary data

The extracting request handler is described in detail here: https://wiki.apache.org/solr/ExtractingRequestHandler

We can see that a parameter called "literal.id" is normally passed on the URL. Any user defined properties on PutSolrContentStream will be passed as URL parameters to Solr, so by clicking the + icon in the top-right we can add this property and set it to the UUID of the flow file:

5333-putsolr-2.png

Ingest & Query

At this point we can copy any document into <nifi_home>/data/input and see if Solr can identify it. For this example I copied quickstart.html file from the Solr docs directory. After going to the Solr Admin UI and querying the "gettingstarted" collection for all documents, you should see the following results:

5331-solr-results.png

We can see that Solr identified the document as "text/html", extracted the title as "Solr Quick Start", and has the id as the UUID of the FlowFile from NiFi. We can also see the extraction was done using Tika behind the scenes.

From here you can send in any type of documents, PDF, Word, Excel, etc., and have Solr extract the text using Tika.

7,401 Views
Comments
avatar
New Contributor

I have passed a text file having data in below input format and created respective fields for the text input in solr admin for collection "gettingstarted".

I opted indexed and stored option while creating fields. but still i am not able to see them while querying.

ex:

"response":{"numFound":3,"start":0,"maxScore":1.0,"docs":[
      {
        "stream_size":["null"],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.html.HtmlParser"],
        "stream_content_type":["text/html"],
        "content_encoding":["ISO-8859-1"],
        "content_type":["text/html; charset=ISO-8859-1"],
        "id":"df359cd5-ce8f-44c6-9eb9-681e44eba102",
        "_version_":1576353324702105600},

Input data format : sample.txt

johnd00,pc-1234,john doe,john.doe@dtaa.com,engineer,08/21/2018 00:00:00,deviceplugged,1000.00

fields created in managed-schema : userid,pcname,employeename,email,role,date,activity,score

all data is available under _text_ field ... but not under above fields

what could be the right approach? do i need to add fields to text/extract updater ..?

please let me know

avatar
Master Guru

Hi @sukesh nagaraja

I think the results you got are expected behavior behavior... The extracting request handler has no way to know the field names for the data you sent in. It is generally used to extract text from files like PDFs, or Word documents, where you basically have a title and content, and everything just goes into the content mostly.

For your scenario, you basically have a CSV where you know the field names. Take a look at Solr's CSV update handler:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawit...

You can use this from NiFi by setting the path to /update and setting the Content-Type to application/csv and then add a property fieldnames with your list of fields.

I'd recommend playing around with the update handler outside of NiFi first, just by using curl or a browser tool like Postman, and then once you have the request working the way you want, then get it working in NiFi.

avatar
New Contributor

Thanks for the guidance. I will look into it.

avatar
New Contributor

It Worked. 🙂