Support Questions

StuBrandt · ‎03-19-2019

I'm running CDH6.1 with HDFS underneath Solrcloud with a number of collections. I'd like to create indexes on one cluster and move them using hadoop distcp to another cluster once the data ingest is complete for the collection.

An issue I've run into is that when creating collections via either 'solrctl collection --create' or via the API 'admin/collections?action=CREATE', the replicas aren't always named predictably. For example, on a 6 shard collection with replicationFactor of 1, I've seen anything from....

core_node3, core_node5, core_node7, core_node9, core_node11, core_node12

...to....

core_node362, core_node363, core_node364, core_node365, core_node366, core_node367

Since these names end up being used in the HDFS based dataDir/ulogDir values for each replica, it means I have to do a bunch of HDFS renaming to get things to line up with the target cluster's collection's replica names.

I've recently started using createNodeSet=EMPTY in the collections/CREATE API, and then calling collections/ADDREPLICA API to create my own replicas with the predictable dataDir and ulogDir. That mostly solves it. But the replica names are still this unpredictable value as showing in CLUSTERSTATUS, and now they're no longer related ot the HDFS dataDir/ulogDir.

Is there some parameter I'm missing during ADDREPLICA that allows me to assign the replica name?

pdvorak · ‎03-20-2019

The snapshots are part of the indexes, representing a point in time list of the segments in the index. When you perform the backup, the metadata (information about the cluster) and the snapshot specified indicate s which set of index files to be backup up/copied to the destination hdfs directory (as specified in the <backup> section of the source solr.xml)

This blog walks through the process
https://blog.cloudera.com/blog/2017/05/how-to-backup-and-disaster-recovery-for-apache-solr-part-i/

When you run the --prepare-snapshot-export, it creates a copy of the metadata, and a copylisting of all the files that will be copied by the distcp command, to the remote cluster. Then, when you execute the snapshot export, the distcp command will copy those files to the remote cluster.

The -b on the restore command is just the name of the directory (represented by the snapshot name) that was created and copied by distcp.

-pd

View solution in original post

pdvorak · ‎03-19-2019

You are correct, thtat there isn't a predictable or guaranteed order for the core_noden names. The recommendation would be to use the solr backup and restore functionality (which uses distcp to transfer the index files and metadata) between your source cluster and your target cluster:

https://www.cloudera.com/documentation/enterprise/latest/topics/search_backup_restore.html

-pd

StuBrandt · ‎03-20-2019

If I understand correctly, I have 2 choices with the backup portion of the suggested approach:

local export of snapshot on the ingest cluster followed by a hadoop distcp to move the backup data to the search cluster
remote export of the snapshot to the search cluster from the start.

So far, so good.

What I'm not understanding is how the named snapshot (made on ingest cluster) becomes known by the search cluster so that the restore, which needs the snapshot name as the -b option, can work. Is it possible to restore on a different cluster that has no knowledge of the original snapshot?

Additionally, the backup/restore approach seems like it might require 2x the amount of writes on the search cluster compared to just distcp'ing the data from one cluster to the other and pointing new replicas at that data.

What I'm assuming happens on the search cluster is:

writing all data during the backup portion (i.e. either local export + distcp, or remote export)
reading and rewriting all data during the restore portion

Is this an incorrect understanding of how the backup/restore would work?

pdvorak · ‎03-20-2019

The snapshots are part of the indexes, representing a point in time list of the segments in the index. When you perform the backup, the metadata (information about the cluster) and the snapshot specified indicate s which set of index files to be backup up/copied to the destination hdfs directory (as specified in the <backup> section of the source solr.xml)

This blog walks through the process
https://blog.cloudera.com/blog/2017/05/how-to-backup-and-disaster-recovery-for-apache-solr-part-i/

When you run the --prepare-snapshot-export, it creates a copy of the metadata, and a copylisting of all the files that will be copied by the distcp command, to the remote cluster. Then, when you execute the snapshot export, the distcp command will copy those files to the remote cluster.

The -b on the restore command is just the name of the directory (represented by the snapshot name) that was created and copied by distcp.

-pd

Cloudera Community

Support Questions

Solrcloud Replica Names

split shards and add replicas for collection "rang...

How-To: Cleanup SolrCloud entries in ZooKeeper

Solr vs SolrCloud

Hive Naming conventions and database naming standa...

Re: Solrcloud

Solr Rule-Based Authorization Plugin With External...

How to Deploy Apache Solr as SolrCloud on HDFS in ...

How to setup cross data center replication in Solr...

Change in company name

Installing Apache Ranger using Ambari Infra (SolrC...