<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Issue indexing html files using nifi and PutSolrContentStream in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Issue-indexing-html-files-using-nifi-and/m-p/135029#M23290</link>
    <description>&lt;P&gt;For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.&lt;/P&gt;&lt;P&gt;HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:&lt;/P&gt;&lt;P&gt;&lt;A href="https://wiki.apache.org/solr/ExtractingRequestHandler" target="_blank" rel="nofollow noopener noreferrer"&gt;https://wiki.apache.org/solr/ExtractingRequestHandler&lt;/A&gt;&lt;/P&gt;&lt;P&gt;To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="2911-nifi-solr-extract.png" style="width: 1646px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/22096i6BE1D9EAC479BF03/image-size/medium?v=v2&amp;amp;px=400" role="button" title="2911-nifi-solr-extract.png" alt="2911-nifi-solr-extract.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 19 Aug 2019 08:59:31 GMT</pubDate>
    <dc:creator>bbende</dc:creator>
    <dc:date>2019-08-19T08:59:31Z</dc:date>
    <item>
      <title>Issue indexing html files using nifi and PutSolrContentStream</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Issue-indexing-html-files-using-nifi-and/m-p/135028#M23289</link>
      <description>&lt;P&gt;I'm having trouble streaming html files into solr.  I have a GetFile processor that gets html files from local disk and connects to PutSolrContentStream, but I am getting JSON parse error in the PutSolrContentStream processor.  I have tried changing content-type value to "text/html" or "text" and is still getting the same error.&lt;/P&gt;&lt;P&gt;How can I resolve this issue?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Sat, 19 Mar 2016 04:59:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Issue-indexing-html-files-using-nifi-and/m-p/135028#M23289</guid>
      <dc:creator>dlam</dc:creator>
      <dc:date>2016-03-19T04:59:30Z</dc:date>
    </item>
    <item>
      <title>Re: Issue indexing html files using nifi and PutSolrContentStream</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Issue-indexing-html-files-using-nifi-and/m-p/135029#M23290</link>
      <description>&lt;P&gt;For this question you have to first take NiFi out of the picture and think about how you would index HTML with Solr.&lt;/P&gt;&lt;P&gt;HTML is not typically one of the standard input formats like JSON, XML, and CSV, but Solr has an "extracting request handler" which is capable of handling HTML, see this page:&lt;/P&gt;&lt;P&gt;&lt;A href="https://wiki.apache.org/solr/ExtractingRequestHandler" target="_blank" rel="nofollow noopener noreferrer"&gt;https://wiki.apache.org/solr/ExtractingRequestHandler&lt;/A&gt;&lt;/P&gt;&lt;P&gt;To use that from NiFi you need to set the "Content Stream Path" to "/update/extract", set the "Content Type" to "text/html", and add a user defined property for "literal.id" and set it to some id (you can use the FlowFile uuid by setting it to ${uuid}).&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="2911-nifi-solr-extract.png" style="width: 1646px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/22096i6BE1D9EAC479BF03/image-size/medium?v=v2&amp;amp;px=400" role="button" title="2911-nifi-solr-extract.png" alt="2911-nifi-solr-extract.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 19 Aug 2019 08:59:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Issue-indexing-html-files-using-nifi-and/m-p/135029#M23290</guid>
      <dc:creator>bbende</dc:creator>
      <dc:date>2019-08-19T08:59:31Z</dc:date>
    </item>
    <item>
      <title>Re: Issue indexing html files using nifi and PutSolrContentStream</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Issue-indexing-html-files-using-nifi-and/m-p/135030#M23291</link>
      <description>&lt;P&gt;Thank you! @bbende&lt;/P&gt;</description>
      <pubDate>Mon, 21 Mar 2016 20:50:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Issue-indexing-html-files-using-nifi-and/m-p/135030#M23291</guid>
      <dc:creator>dlam</dc:creator>
      <dc:date>2016-03-21T20:50:42Z</dc:date>
    </item>
  </channel>
</rss>

