Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDFS Replication for SOLR 4.10.3

Solved Go to solution
Highlighted

HDFS Replication for SOLR 4.10.3

Super Guru

@Michael Young

One last question. We have a Prod SOLR cluster using HDFS as file system. Assume following two scenarios:

1. SOLR is also running on DR. When we replicate data to DR using Snapshot/Disctcp combo, how does DR SOLR know which data belong to which index? I am guessing it doesn't. So in that case, how do we manage that?

2. SOLR is not running on DR. We replicate the data to DR. Some issue occurs in production and now we need to restore data back to Prod. Can we restore only some indexes? If yes, how is it possible since DR doesn't have any SOLR and for DR its simply some HDFS data.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: HDFS Replication for SOLR 4.10.3

@mqureshi

@james.jones

I recommend you read up on information about SolrCloud. The reference guide provides a good overview for how it works starting on page 419: http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf

A SolrCloud cluster uses Zookeeper for cluster coordination. This means keeping track of which nodes are up, how many shards a collection has and which hosts are currently serving those shards, etc. Zookeeper is also used to store configuration sets. These are the index and schema configuration files that are used for your indexes. When you create a collection using the Solr scripts, the configuration files for the collection are uploaded to Zookeeper.

An collection is comprised of 1 or more shard indexes and 0 or more replica indexes. When you use HDFS to store the indexes, it is much easier to add/remove SolrCloud nodes to your cluster. You don't have to copy the indexes which are normally stored locally. The new SolrCloud node is configured to coordinate with Zookeeper. Upon startup, the new SolrCloud node will be told by Zookeeper which shards for which it is responsible and then use the respective indexes stored on HDFS. All of the index data itself is stored within the index directories on HDFS. These directories are self contained.

Solr stores collections within index directories where each index has its own directory within the top level Solr index directory. This is true for local storage and HDFS. When you replicate your HDFS index directories to another HDFS cluster, all of the data is maintained within the respective index directories.

HDFS: /solr/collectionname_shard1_replica1/<index files>

HDFS: /solr/collectionname_shard2_replica1/<index files>

1. In the case of having Solr running on a DR cluster, you would need to ensure the index configuration (schemas, configuration sets, etc) are updated in the DR Solr Zookeeper. If you create collections on your primary cluster, then you would need to similarly create collections on the DR cluster. This is primarily to ensure the collection metadata exists in both clusters. As long as these settings are in sync, copying the index directories from one HDFS cluster to another HDFS cluster is all you need to do to keep DR the cluster in sync with the production cluster. As I mentioned above, both clusters will be configured to store indexes in an HDFS location. As long as the index directories exist, the SolrCloud nodes will read the indexes from those HDFS directories. Solr creates those index directories based on the name of the collection/index. That is how it knows which data goes with which index.

2. Yes, you should be able to do this. If you need to "restore" a collection from backup, then you would have to copy each of the collection index shards. If you create a collection with 5 shards, then you will have 5 index directories that you need to restore from DR.

Using something like Cross Data Center Replication in SolrCloud 6 is the easiest way to get Solr DR in place. Second to that, using the native Backup/Restore functionality in SolrCloud 5 is a viable alternative. Unfortunately, SolrCloud 4 has neither of these more user friendly approaches. I highly recommend upgrading to at least Solr 5 to get a better handle on backups and disaster recovery.

1 REPLY 1

Re: HDFS Replication for SOLR 4.10.3

@mqureshi

@james.jones

I recommend you read up on information about SolrCloud. The reference guide provides a good overview for how it works starting on page 419: http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf

A SolrCloud cluster uses Zookeeper for cluster coordination. This means keeping track of which nodes are up, how many shards a collection has and which hosts are currently serving those shards, etc. Zookeeper is also used to store configuration sets. These are the index and schema configuration files that are used for your indexes. When you create a collection using the Solr scripts, the configuration files for the collection are uploaded to Zookeeper.

An collection is comprised of 1 or more shard indexes and 0 or more replica indexes. When you use HDFS to store the indexes, it is much easier to add/remove SolrCloud nodes to your cluster. You don't have to copy the indexes which are normally stored locally. The new SolrCloud node is configured to coordinate with Zookeeper. Upon startup, the new SolrCloud node will be told by Zookeeper which shards for which it is responsible and then use the respective indexes stored on HDFS. All of the index data itself is stored within the index directories on HDFS. These directories are self contained.

Solr stores collections within index directories where each index has its own directory within the top level Solr index directory. This is true for local storage and HDFS. When you replicate your HDFS index directories to another HDFS cluster, all of the data is maintained within the respective index directories.

HDFS: /solr/collectionname_shard1_replica1/<index files>

HDFS: /solr/collectionname_shard2_replica1/<index files>

1. In the case of having Solr running on a DR cluster, you would need to ensure the index configuration (schemas, configuration sets, etc) are updated in the DR Solr Zookeeper. If you create collections on your primary cluster, then you would need to similarly create collections on the DR cluster. This is primarily to ensure the collection metadata exists in both clusters. As long as these settings are in sync, copying the index directories from one HDFS cluster to another HDFS cluster is all you need to do to keep DR the cluster in sync with the production cluster. As I mentioned above, both clusters will be configured to store indexes in an HDFS location. As long as the index directories exist, the SolrCloud nodes will read the indexes from those HDFS directories. Solr creates those index directories based on the name of the collection/index. That is how it knows which data goes with which index.

2. Yes, you should be able to do this. If you need to "restore" a collection from backup, then you would have to copy each of the collection index shards. If you create a collection with 5 shards, then you will have 5 index directories that you need to restore from DR.

Using something like Cross Data Center Replication in SolrCloud 6 is the easiest way to get Solr DR in place. Second to that, using the native Backup/Restore functionality in SolrCloud 5 is a viable alternative. Unfortunately, SolrCloud 4 has neither of these more user friendly approaches. I highly recommend upgrading to at least Solr 5 to get a better handle on backups and disaster recovery.

Don't have an account?
Coming from Hortonworks? Activate your account here