Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

SolrCloud Performance - HDFS index/data

avatar

According to the docs, Solr relies heavily on fast bulk reads and writes during index updates. Lets say I want to index thousands of documents (word, pdf, html, ...) or I want to store my Ranger audit logs in my SolrCloud. Is it a good idea to use HDFS as index and data store or should I go with a local non-hdfs data directory?

Ranger Audit Logs documentation mentions "1 TB free space in the volume where Solr will store the index data.", which sounds like non-hdfs?!

1 ACCEPTED SOLUTION

avatar

While I won't provide exact numbers here, would still like to draw your attention to a few major considerations for hdfs backed indexes. The goals here are:

  • Scalability of storage. Local disk performance doesn't matter when your disk is full
  • Portability of nodes. As long as the node can access the index, it doesn't matter where they areally
  • Reliability of storage. Hdfs replication, auto healing etc.

View solution in original post

8 REPLIES 8

avatar
Master Mentor
@Jonas Straub

It was shared by someone from the field. Link

Hadoop provides a system for processing large amounts of data, this could be a great way to actually build your index. You can store the data to be indexed on HDFS and then run a map/reduce job that processes this data and feeds it to your Solr instances that then build up your index. With an index a terabyte in size you will see great performance gains when you both process this data in parallel on your Hadoop cluster and then index it in parallel with your Solr cluster.

avatar

thanks @Neeraj Sabharwal, great article. This helped a lot with my decision.

avatar

While I won't provide exact numbers here, would still like to draw your attention to a few major considerations for hdfs backed indexes. The goals here are:

  • Scalability of storage. Local disk performance doesn't matter when your disk is full
  • Portability of nodes. As long as the node can access the index, it doesn't matter where they areally
  • Reliability of storage. Hdfs replication, auto healing etc.

avatar
Rising Star

All of @Andrew Grande points are value. You should also consider the performance impact when you store in HDFS, because Solr pulls indexes from HDFS and keeps it in memory. So you will have to plan your hardware capacity carefully

avatar

thanks @Andrew Grande very good points, I totally forgot about these. I mean for Ranger Audit Logs these points don't matter that much (logs also saved to HDFS, Solr log retention < 30days, etc.), but for other projects they do! I really like the point about scalability and reliability, I dont have to plan the storage for solr separately or reserve space on my nodes, I can scale with HDFS 🙂

avatar
Rising Star

For your question specific to storing Ranger Audits, if you envision lot of audit logs will be generated, then you should create multiple shards with enough replication factors for high available and performance.

Another recommendation is to store Ranger Audits in both HDFS and Solr. HDFS storage will be for archival and compliance reason. On the Solr side, you can setup maximum retention to delete the audit logs after certain number of days.

avatar

@bdurai thanks. I have already set up Solr and HDFS Ranger Audit Logs. Solr Logs will automatically be deleted after 30 days (Document expiration). Currently I am using a factor 2 replication as well as 2 shards, but I might be able to increase this even more.

avatar
Master Mentor

@Jonas Straub has this been resolved? Can you post your solution or accept best answer :)?