Created 07-09-2016 09:37 AM
Hello
What is the best approach to index a folder in HDFS containing documents (pdfs, emails, word, excel, etc..). This folder gets updated on a daily basis. Its size is two terabytes.
Should i write code to loop over files, extract content through tika parser and push them to solr index using solrj maybe? What about the new documents?
Or is there a better approach to bulk insert all the content of this folder into my solr index and update my solr index everyday?
What about apache nifi? Which approach should i follow?
Thanks
Created 07-11-2016 02:52 PM
If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files.
From there you have two options to index the documents...
1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor.
2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes:
Created 07-11-2016 04:17 AM
@Ahmad Debbas I have done this using storm to parse emails/pdfs using tika as documents land onto hdfs. You can use storm hdfs spout (info here). Once data is parsed, using another bolt to sink into solr. Pretty straight forward solution. NiFi is definitely a consideration. You will need a build a NiFi tiki processor. As each event is then run through processor --> parsed text--> into solr. this could work as well
Created 07-11-2016 02:52 PM
If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files.
From there you have two options to index the documents...
1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor.
2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes: