- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Issue indexing html files using nifi and PutSolrContentStream
- Labels:
-
Apache NiFi
-
Apache Solr
Created 03-18-2016 09:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm having trouble streaming html files into solr. I have a GetFile processor that gets html files from local disk and connects to PutSolrContentStream, but I am getting JSON parse error in the PutSolrContentStream processor. I have tried changing content-type value to "text/html" or "text" and is still getting the same error.
How can I resolve this issue?
Thanks!
Created on 03-19-2016 02:51 PM - edited 08-19-2019 01:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.
HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:
https://wiki.apache.org/solr/ExtractingRequestHandler
To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).
Created on 03-19-2016 02:51 PM - edited 08-19-2019 01:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.
HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:
https://wiki.apache.org/solr/ExtractingRequestHandler
To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).
Created 03-21-2016 01:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you! @bbende
