Created 03-18-2016 09:59 PM
I'm having trouble streaming html files into solr. I have a GetFile processor that gets html files from local disk and connects to PutSolrContentStream, but I am getting JSON parse error in the PutSolrContentStream processor. I have tried changing content-type value to "text/html" or "text" and is still getting the same error.
How can I resolve this issue?
Thanks!
Created on 03-19-2016 02:51 PM - edited 08-19-2019 01:59 AM
For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.
HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:
https://wiki.apache.org/solr/ExtractingRequestHandler
To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).
Created on 03-19-2016 02:51 PM - edited 08-19-2019 01:59 AM
For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.
HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:
https://wiki.apache.org/solr/ExtractingRequestHandler
To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).
Created 03-21-2016 01:50 PM
Thank you! @bbende