Support Questions

Find answers, ask questions, and share your expertise

Issue indexing html files using nifi and PutSolrContentStream

avatar
Explorer

I'm having trouble streaming html files into solr. I have a GetFile processor that gets html files from local disk and connects to PutSolrContentStream, but I am getting JSON parse error in the PutSolrContentStream processor. I have tried changing content-type value to "text/html" or "text" and is still getting the same error.

How can I resolve this issue?

Thanks!

1 ACCEPTED SOLUTION

avatar
Master Guru

For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.

HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:

https://wiki.apache.org/solr/ExtractingRequestHandler

To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).

2911-nifi-solr-extract.png

View solution in original post

2 REPLIES 2

avatar
Master Guru

For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.

HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:

https://wiki.apache.org/solr/ExtractingRequestHandler

To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).

2911-nifi-solr-extract.png

avatar
Explorer

Thank you! @bbende