Understanding Solr Architecture and Best practices
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
Created on 08-18-201806:08 PM - edited 08-17-201906:41 AM
In Solr, a cluster is a set of Solr nodes operating in coordination with each other via ZooKeeper, and managed as a unit. A cluster may contain many collections. See also SolrCloud.
In Solr, one or more Documents grouped together in a single logical index using a single configuration and Schema.
a collection may be divided up into multiple logical shards, which may
in turn be distributed across many nodes, or in a Single node Solr
installation, a collection may be a single Core.
To make document changes permanent in the index. In the case of added documents, they would be searchable after a commit.
An individual Solr instance (represents a logical index). Multiple cores can run on a single node. See also SolrCloud.
Key Take Away 1. Solr works on a non master-slave architecture, every solr node is master of its own. Solr nodes uses Zookeper to learn about the state of the cluster. 2. A solr Node (JVM) can host multiple core 3. Core is the place where Lucene (Index) engine is running. Every core has its own Lucene engine 4. A collection will be divided in shards. 5. A shard will be represented as core (A part of JVM) in the Solr Node (JVM) 6. Every solr node keeps sending heartbeat to Zookeeper to inform about its availability. 7. Usage of Local FS provides the most stable and best IO for solr. 8. A replication factor of 2 should be maintained on local mode to avoid any data loss. 9. Do remember every replication will have a core attached to it and also space is disk. 10. If a collection is divided into 3 shards with replication factor of 3 : in total 9 cores will be hosted across the solr nodes. Data saved on local fs will be 3X 11. Solr node doesnt publish data to ambari metrics by default. A solr metric process ( a seperate process that solr node) needs to be run on every node where solr node is hosted. The metric process fetches data from solr node and pushes to ambari metrics.
Solr on HDFS
1. Solr node should be colocated with data nodes for best performance. 2. Because of DataNodes are used used by Spark, Hbase this setup can result into unstable Solr Cloud easily. 3. Because of heavy CPU consumption on data nodes solr nodes can loose to establish heart beat connection to zookeeper resulting in the solr node being removed from solr cluster. 4. Watch for solr logs to make sure short-circuit writes are being used. 5. At collection level you are compelled to use replication factor of 2 else a restart of one node will result in the collection being unavailable. 6. Replication of 2 at collection level and Replication Factor of 3 at HDFS can significantly impact the Write peroformance. 7. Ensure the RF of the Sole HDFS directory is set to 1.