Community Articles

kgautam · ‎08-18-2018

Cluster

In Solr, a cluster is a set of Solr nodes operating in coordination with each other via ZooKeeper, and managed as a unit. A cluster may contain many collections. See also SolrCloud.

Collection

In Solr, one or more Documents grouped together in a single logical index using a single configuration and Schema.

In SolrCloud a collection may be divided up into multiple logical shards, which may in turn be distributed across many nodes, or in a Single node Solr installation, a collection may be a single Core.

Commit

To make document changes permanent in the index. In the case of added documents, they would be searchable after a commit.

Core

An individual Solr instance (represents a logical index). Multiple cores can run on a single node. See also SolrCloud.

Key Take Away
1. Solr works on a non master-slave architecture, every solr node is master of its own. Solr nodes uses Zookeper to learn about the state of the cluster.
2. A solr Node (JVM) can host multiple core
3. Core is the place where Lucene (Index) engine is running. Every core has its own Lucene engine
4. A collection will be divided in shards.
5. A shard will be represented as core (A part of JVM) in the Solr Node (JVM)
6. Every solr node keeps sending heartbeat to Zookeeper to inform about its availability.
7. Usage of Local FS provides the most stable and best IO for solr.
8. A replication factor of 2 should be maintained on local mode to avoid any data loss.
9. Do remember every replication will have a core attached to it and also space is disk.
10. If a collection is divided into 3 shards with replication factor of 3 : in total 9 cores will be hosted across the solr nodes.
Data saved on local fs will be 3X
11. Solr node doesnt publish data to ambari metrics by default. A solr metric process ( a seperate process that solr node) needs to be run on every node where solr node is hosted. The metric process fetches data from solr node and pushes to ambari metrics.

Solr on HDFS

1. Solr node should be colocated with data nodes for best performance.
2. Because of DataNodes are used used by Spark, Hbase this setup can result into unstable Solr Cloud easily.
3. Because of heavy CPU consumption on data nodes solr nodes can loose to establish heart beat connection to zookeeper resulting in the solr node being removed from solr cluster.
4. Watch for solr logs to make sure short-circuit writes are being used.
5. At collection level you are compelled to use replication factor of 2 else a restart of one node will result in the collection being unavailable.
6. Replication of 2 at collection level and Replication Factor of 3 at HDFS can significantly impact the Write peroformance.
7. Ensure the RF of the Sole HDFS directory is set to 1.

Lucene Indexing on single Core

Pic taken from : https://www.quora.com/What-is-the-internal-architecture-of-Apache-solr

Reference :

https://lucene.apache.org/solr/guide/6_6/solr-glossary.html#solrclouddef

Cloudera Community

Community Articles

Understanding Solr Architecture and Best practices

Apache Solr

Solr Best Practices

Tips and best practices for optimizing Hive perfor...

Understanding Tez Application submission and its f...

Kafka Best Practices

Best Practice: 'chroot' your Solr Cloud in ZooKeep...

LLAP - a one-page architecture overview

Understanding how NiFi's Content Repository Archiv...

ORC Creation Best Practices

Kafka Mirror Maker Best Practices

Understanding Apache ZooKeeper Connection Rate Lim...