Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Solr indexing

avatar
Expert Contributor

Hello

What is the best approach to index a folder in HDFS containing documents (pdfs, emails, word, excel, etc..). This folder gets updated on a daily basis. Its size is two terabytes.

Should i write code to loop over files, extract content through tika parser and push them to solr index using solrj maybe? What about the new documents?

Or is there a better approach to bulk insert all the content of this folder into my solr index and update my solr index everyday?

What about apache nifi? Which approach should i follow?

Thanks

1 ACCEPTED SOLUTION

avatar
Master Guru

If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files.

From there you have two options to index the documents...

1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor.

2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes:

https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache....

View solution in original post

2 REPLIES 2

avatar
Master Guru

@Ahmad Debbas I have done this using storm to parse emails/pdfs using tika as documents land onto hdfs. You can use storm hdfs spout (info here). Once data is parsed, using another bolt to sink into solr. Pretty straight forward solution. NiFi is definitely a consideration. You will need a build a NiFi tiki processor. As each event is then run through processor --> parsed text--> into solr. this could work as well

avatar
Master Guru

If you use NiFi you can use the ListHDFS + FetchHDFS processors to monitor an HDFS directory for new files.

From there you have two options to index the documents...

1) As Sunile mentioned you could write a processor that extracts the information using Tika and then send that to PutSolrContentStream processor. There is going to be a new ExtractMediaMetadata processor in the next release, but it doesn't extract the body content, so you would likely need to implement your own processor.

2) You could send the documents (PDFS, emails, word) right from FetchHDFS to PutSolrContentStream, and configure PutSolrContentStream to use Solr's extracting request handler which uses Tika behind scenes:

https://community.hortonworks.com/articles/42210/using-solrs-extracting-request-handler-with-apache....