Created 11-26-2015 09:12 AM
According to the docs, Solr relies heavily on fast bulk reads and writes during index updates. Lets say I want to index thousands of documents (word, pdf, html, ...) or I want to store my Ranger audit logs in my SolrCloud. Is it a good idea to use HDFS as index and data store or should I go with a local non-hdfs data directory?
Ranger Audit Logs documentation mentions "1 TB free space in the volume where Solr will store the index data.", which sounds like non-hdfs?!
Created 11-26-2015 05:11 PM
While I won't provide exact numbers here, would still like to draw your attention to a few major considerations for hdfs backed indexes. The goals here are:
Created 11-26-2015 12:32 PM
It was shared by someone from the field. Link
Hadoop provides a system for processing large amounts of data, this could be a great way to actually build your index. You can store the data to be indexed on HDFS and then run a map/reduce job that processes this data and feeds it to your Solr instances that then build up your index. With an index a terabyte in size you will see great performance gains when you both process this data in parallel on your Hadoop cluster and then index it in parallel with your Solr cluster.
Created 11-27-2015 08:18 AM
thanks @Neeraj Sabharwal, great article. This helped a lot with my decision.
Created 11-26-2015 05:11 PM
While I won't provide exact numbers here, would still like to draw your attention to a few major considerations for hdfs backed indexes. The goals here are:
Created 11-26-2015 06:36 PM
All of @Andrew Grande points are value. You should also consider the performance impact when you store in HDFS, because Solr pulls indexes from HDFS and keeps it in memory. So you will have to plan your hardware capacity carefully
Created 11-27-2015 08:07 AM
thanks @Andrew Grande very good points, I totally forgot about these. I mean for Ranger Audit Logs these points don't matter that much (logs also saved to HDFS, Solr log retention < 30days, etc.), but for other projects they do! I really like the point about scalability and reliability, I dont have to plan the storage for solr separately or reserve space on my nodes, I can scale with HDFS 🙂
Created 11-26-2015 06:33 PM
For your question specific to storing Ranger Audits, if you envision lot of audit logs will be generated, then you should create multiple shards with enough replication factors for high available and performance.
Another recommendation is to store Ranger Audits in both HDFS and Solr. HDFS storage will be for archival and compliance reason. On the Solr side, you can setup maximum retention to delete the audit logs after certain number of days.
Created 11-27-2015 07:51 AM
@bdurai thanks. I have already set up Solr and HDFS Ranger Audit Logs. Solr Logs will automatically be deleted after 30 days (Document expiration). Currently I am using a factor 2 replication as well as 2 shards, but I might be able to increase this even more.
Created 02-02-2016 02:52 PM
@Jonas Straub has this been resolved? Can you post your solution or accept best answer :)?