Support Questions
Find answers, ask questions, and share your expertise

indexing doc from HDFS in Solr

Super Collaborator

Hi:

When you index file from HDFS into Solr, those files are stored into local FS like /use/local/de??? or indexing directly from HDFS, i mean if we will double copy of files HDFS and LOCAL FYLESYSTEM

4 REPLIES 4

Super Collaborator

If you configured Solr to use HDFS, it will not write to the local FS. Since it's not on the local FS, you will not get the advantage of OS file caching. Therefore you need to configure Solr to use off-heap cache instead of the OS using cache. If you are doing frequent updates, HDFS may not be the best solution for your Solr files because files change frequently and you need to do a lot of file IO. You can find details here:

https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

Super Collaborator

@Roberto Sancho did this answer your question?

Super Collaborator

Ok, thanks, the idea is not use local Files, just indexing directly from hdfs

I will update every batch day at the moment

Super Collaborator

That should work ok then. If you issue a commit with openSearcher=true only after adding all of your documents you'll have best results. Normally you want Solr to do commits (via the solrconfig.xml) rather than your client. Also you want to do it as infrequently as your use case can tolerate for best performance since there is overhead in creating new searchers and warming the caches. And, if using HDFS, you'll have to pull data across the network to your Solr nodes. Good luck. Please update me on how it's working out on HDFS.