Support Questions

wfloyd · ‎01-06-2016

Two questions on Solr:

Sizing for Solr Cores (collections): If I need to index 1 TB of data via Solr – do we have any knowledge on how large the Solr data footprint would be? At what point does it make sense to store the cores in HDFS vs local disk to Solr server?
Response latency expectations - Will solr always return indexed fields quickly ( < 1 s) regardless of the data size? Or should we think of it more like HBase where fast results depend on memory cache strategy and overall data size?

ccasano · ‎01-06-2016

Hey Wes - A few things to consider when sizing. Data is obviously 1, but the characteristics of the data are even more important for thinking about ingest performance and index sizing. For instance, if there is a lot of free form text, # of attributes, # of rows, etc. all of these way in on the indexing process and index size. Also there are other items in SOLR such as facets that can increase the index size. So definitely look at the shape of the data to get an idea of the index size as well as the features of SOLR that you may be using that can affect index size (i.e. faceting). Also, If you have a sample data set, you can try indexing it to see what the index size is and try to extrapolate from here. Also, however big your index is, make sure you have 3 times that on disk for commits and snapshots.

The other item to look at (which is also the 2nd part of your question) is the amount of concurrency / query requests. SOLR is built to return data very quickly but lots of concurrency/request on an under replicated index can certainly create latency and has more impact on the heap than indexing. Also, bad queries are probably more at fault for being latent than SOLR itself. Index fields will always be returned quickly especially if you’re doing a field query (fq=) as opposed to a general query (q=), but both are pretty fast. If you can figure out the number of requests in a 10 second window, this may help you consider the number of replicas you may need for responding to queries without latency.

As far as caching, OS caching (fitting the index in memory) will do more for you then working with java heap. In your case, since the index will probably be rather large, you’ll want to use SOLR cloud and utilize shards and replicas to spread the index out across machines to try to keep the index in-memory.

As far as HDFS vs local disk. There's a good post here on why to use one over the other. Also. HDFS and SOLR cloud both have data replication and they are mutually exclusive. So if you're using SOLR cloud, you definitely want to make sure the indexes in HDFS have a replication factor of 1.

HTH

View solution in original post

ccasano · ‎01-06-2016

Hey Wes - A few things to consider when sizing. Data is obviously 1, but the characteristics of the data are even more important for thinking about ingest performance and index sizing. For instance, if there is a lot of free form text, # of attributes, # of rows, etc. all of these way in on the indexing process and index size. Also there are other items in SOLR such as facets that can increase the index size. So definitely look at the shape of the data to get an idea of the index size as well as the features of SOLR that you may be using that can affect index size (i.e. faceting). Also, If you have a sample data set, you can try indexing it to see what the index size is and try to extrapolate from here. Also, however big your index is, make sure you have 3 times that on disk for commits and snapshots.

The other item to look at (which is also the 2nd part of your question) is the amount of concurrency / query requests. SOLR is built to return data very quickly but lots of concurrency/request on an under replicated index can certainly create latency and has more impact on the heap than indexing. Also, bad queries are probably more at fault for being latent than SOLR itself. Index fields will always be returned quickly especially if you’re doing a field query (fq=) as opposed to a general query (q=), but both are pretty fast. If you can figure out the number of requests in a 10 second window, this may help you consider the number of replicas you may need for responding to queries without latency.

As far as caching, OS caching (fitting the index in memory) will do more for you then working with java heap. In your case, since the index will probably be rather large, you’ll want to use SOLR cloud and utilize shards and replicas to spread the index out across machines to try to keep the index in-memory.

As far as HDFS vs local disk. There's a good post here on why to use one over the other. Also. HDFS and SOLR cloud both have data replication and they are mutually exclusive. So if you're using SOLR cloud, you definitely want to make sure the indexes in HDFS have a replication factor of 1.

HTH

wfloyd · ‎01-06-2016

Super helpful - thanks!

Cloudera Community

Support Questions

Solr Sizing and Query response time expectations

Set Time to Live (TTL) on solr records:

LLAP sizing and setup

Visualize near-real-time stock price changes using...

Zookeeper Sizing and Placement

Solr Best Practices

Update Zeppelin JDCB Interpreter To Support Solr S...

Convert the Unix time to date time using QueryReco...

Joining Collections in SOLR (Part I)

Enterprise Data Quality at Scale with Spark and Gr...

Securing Solr Collections with Ranger + Kerberos