Created 01-06-2016 04:04 PM
Two questions on Solr:
Created 01-06-2016 06:21 PM
Hey Wes - A few things to consider when sizing. Data is obviously 1, but the characteristics of the data are even more important for thinking about ingest performance and index sizing. For instance, if there is a lot of free form text, # of attributes, # of rows, etc. all of these way in on the indexing process and index size. Also there are other items in SOLR such as facets that can increase the index size. So definitely look at the shape of the data to get an idea of the index size as well as the features of SOLR that you may be using that can affect index size (i.e. faceting). Also, If you have a sample data set, you can try indexing it to see what the index size is and try to extrapolate from here. Also, however big your index is, make sure you have 3 times that on disk for commits and snapshots.
The other item to look at (which is also the 2nd part of your question) is the amount of concurrency / query requests. SOLR is built to return data very quickly but lots of concurrency/request on an under replicated index can certainly create latency and has more impact on the heap than indexing. Also, bad queries are probably more at fault for being latent than SOLR itself. Index fields will always be returned quickly especially if you’re doing a field query (fq=) as opposed to a general query (q=), but both are pretty fast. If you can figure out the number of requests in a 10 second window, this may help you consider the number of replicas you may need for responding to queries without latency.
As far as caching, OS caching (fitting the index in memory) will do more for you then working with java heap. In your case, since the index will probably be rather large, you’ll want to use SOLR cloud and utilize shards and replicas to spread the index out across machines to try to keep the index in-memory.
As far as HDFS vs local disk. There's a good post here on why to use one over the other. Also. HDFS and SOLR cloud both have data replication and they are mutually exclusive. So if you're using SOLR cloud, you definitely want to make sure the indexes in HDFS have a replication factor of 1.
HTH
Created 01-06-2016 06:21 PM
Hey Wes - A few things to consider when sizing. Data is obviously 1, but the characteristics of the data are even more important for thinking about ingest performance and index sizing. For instance, if there is a lot of free form text, # of attributes, # of rows, etc. all of these way in on the indexing process and index size. Also there are other items in SOLR such as facets that can increase the index size. So definitely look at the shape of the data to get an idea of the index size as well as the features of SOLR that you may be using that can affect index size (i.e. faceting). Also, If you have a sample data set, you can try indexing it to see what the index size is and try to extrapolate from here. Also, however big your index is, make sure you have 3 times that on disk for commits and snapshots.
The other item to look at (which is also the 2nd part of your question) is the amount of concurrency / query requests. SOLR is built to return data very quickly but lots of concurrency/request on an under replicated index can certainly create latency and has more impact on the heap than indexing. Also, bad queries are probably more at fault for being latent than SOLR itself. Index fields will always be returned quickly especially if you’re doing a field query (fq=) as opposed to a general query (q=), but both are pretty fast. If you can figure out the number of requests in a 10 second window, this may help you consider the number of replicas you may need for responding to queries without latency.
As far as caching, OS caching (fitting the index in memory) will do more for you then working with java heap. In your case, since the index will probably be rather large, you’ll want to use SOLR cloud and utilize shards and replicas to spread the index out across machines to try to keep the index in-memory.
As far as HDFS vs local disk. There's a good post here on why to use one over the other. Also. HDFS and SOLR cloud both have data replication and they are mutually exclusive. So if you're using SOLR cloud, you definitely want to make sure the indexes in HDFS have a replication factor of 1.
HTH
Created 01-06-2016 08:59 PM
Super helpful - thanks!