<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Solr indexing in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Solr-indexing/m-p/121029#M34277</link>
    <description>&lt;P&gt;If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files. &lt;/P&gt;&lt;P&gt;From there you have two options to index the documents...&lt;/P&gt;&lt;P&gt;1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor.&lt;/P&gt;&lt;P&gt;2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes:&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache.html"&gt;https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 11 Jul 2016 21:52:09 GMT</pubDate>
    <dc:creator>bbende</dc:creator>
    <dc:date>2016-07-11T21:52:09Z</dc:date>
    <item>
      <title>Solr indexing</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Solr-indexing/m-p/121027#M34275</link>
      <description>&lt;P&gt;Hello&lt;/P&gt;&lt;P&gt;What is the best approach to index a folder in HDFS containing documents (pdfs, emails, word, excel, etc..). This folder gets updated on a daily basis. Its size is two terabytes. &lt;/P&gt;&lt;P&gt;Should i write code to loop over files, extract content through tika parser and push them to solr index using solrj maybe? What about the new documents? &lt;/P&gt;&lt;P&gt;Or is there a better approach to bulk insert all the content of this folder into my solr index and update my solr index everyday?&lt;/P&gt;&lt;P&gt;What about apache nifi? Which approach should i follow?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Sat, 09 Jul 2016 16:37:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Solr-indexing/m-p/121027#M34275</guid>
      <dc:creator>afdebbas</dc:creator>
      <dc:date>2016-07-09T16:37:51Z</dc:date>
    </item>
    <item>
      <title>Re: Solr indexing</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Solr-indexing/m-p/121028#M34276</link>
      <description>&lt;P style="margin-left: 40px;"&gt;  &lt;A rel="user" href="https://community.cloudera.com/users/2262/ahmaddebbas.html" nodeid="2262"&gt;@Ahmad Debbas&lt;/A&gt;  I have done this using storm to parse emails/pdfs using tika as documents land onto hdfs.  You can use storm hdfs spout (info &lt;A href="https://github.com/apache/storm/tree/master/external/storm-hdfs"&gt;here&lt;/A&gt;). Once data is parsed, using another bolt to sink into solr.  Pretty straight forward solution.  NiFi is definitely a consideration.  You will need a build a NiFi tiki processor.  As each event is then run through processor --&amp;gt; parsed text--&amp;gt; into solr.  this could work as well&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2016 11:17:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Solr-indexing/m-p/121028#M34276</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-07-11T11:17:53Z</dc:date>
    </item>
    <item>
      <title>Re: Solr indexing</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Solr-indexing/m-p/121029#M34277</link>
      <description>&lt;P&gt;If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files. &lt;/P&gt;&lt;P&gt;From there you have two options to index the documents...&lt;/P&gt;&lt;P&gt;1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor.&lt;/P&gt;&lt;P&gt;2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes:&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache.html"&gt;https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2016 21:52:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Solr-indexing/m-p/121029#M34277</guid>
      <dc:creator>bbende</dc:creator>
      <dc:date>2016-07-11T21:52:09Z</dc:date>
    </item>
  </channel>
</rss>

