Support Questions

jstraub · ‎11-26-2015

According to the docs, Solr relies heavily on fast bulk reads and writes during index updates. Lets say I want to index thousands of documents (word, pdf, html, ...) or I want to store my Ranger audit logs in my SolrCloud. Is it a good idea to use HDFS as index and data store or should I go with a local non-hdfs data directory?

Ranger Audit Logs documentation mentions "1 TB free space in the volume where Solr will store the index data.", which sounds like non-hdfs?!

andrewg · ‎11-26-2015

While I won't provide exact numbers here, would still like to draw your attention to a few major considerations for hdfs backed indexes. The goals here are:

Scalability of storage. Local disk performance doesn't matter when your disk is full
Portability of nodes. As long as the node can access the index, it doesn't matter where they areally
Reliability of storage. Hdfs replication, auto healing etc.

View solution in original post

nsabharwal · ‎11-26-2015

@Jonas Straub

It was shared by someone from the field. Link

Hadoop provides a system for processing large amounts of data, this could be a great way to actually build your index. You can store the data to be indexed on HDFS and then run a map/reduce job that processes this data and feeds it to your Solr instances that then build up your index. With an index a terabyte in size you will see great performance gains when you both process this data in parallel on your Hadoop cluster and then index it in parallel with your Solr cluster.

jstraub · ‎11-27-2015

thanks @Neeraj Sabharwal, great article. This helped a lot with my decision.

andrewg · ‎11-26-2015

While I won't provide exact numbers here, would still like to draw your attention to a few major considerations for hdfs backed indexes. The goals here are:

Scalability of storage. Local disk performance doesn't matter when your disk is full
Portability of nodes. As long as the node can access the index, it doesn't matter where they areally
Reliability of storage. Hdfs replication, auto healing etc.

bdurai · ‎11-26-2015

All of @Andrew Grande points are value. You should also consider the performance impact when you store in HDFS, because Solr pulls indexes from HDFS and keeps it in memory. So you will have to plan your hardware capacity carefully

jstraub · ‎11-27-2015

thanks @Andrew Grande very good points, I totally forgot about these. I mean for Ranger Audit Logs these points don't matter that much (logs also saved to HDFS, Solr log retention < 30days, etc.), but for other projects they do! I really like the point about scalability and reliability, I dont have to plan the storage for solr separately or reserve space on my nodes, I can scale with HDFS 🙂

bdurai · ‎11-26-2015

For your question specific to storing Ranger Audits, if you envision lot of audit logs will be generated, then you should create multiple shards with enough replication factors for high available and performance.

Another recommendation is to store Ranger Audits in both HDFS and Solr. HDFS storage will be for archival and compliance reason. On the Solr side, you can setup maximum retention to delete the audit logs after certain number of days.

jstraub · ‎11-27-2015

@bdurai thanks. I have already set up Solr and HDFS Ranger Audit Logs. Solr Logs will automatically be deleted after 30 days (Document expiration). Currently I am using a factor 2 replication as well as 2 shards, but I might be able to increase this even more.

aervits · ‎02-02-2016

@Jonas Straub has this been resolved? Can you post your solution or accept best answer :)?

Cloudera Community

Support Questions

SolrCloud Performance - HDFS index/data