11-04-2015 03:05 AM - edited 11-04-2015 03:08 AM
Some context information :
We have in production a cluster using CDH 5.3.0.
This cluster is composed of 4 data-nodes (each data-node host a solr-server too).
Every "collection" created in SolrCloud have the following parameters : 2 shards / 2 replicas
Sometimes, (after a Solr restart or Cluster restart) we can observe that 1 replica of a shard is DOWN.
Of course, the other replica became the leader and the collections stay "UP". But that also mean that we can't loose the last replica of that shard now.
We searched for a way to "restore" the DOWN replica but surprisingly enough we couldn't find any solution (nor real documentation about that issue).
I mean, there is nearly no documentation or discussion about "restoring" down replica. Why ? This should be a common requirement.
I can see that the lastest "collections API" has some new actions like "ADDREPLICA" and "DELETEREPLICA".
I believe that these two new actions can fulfill our requirement of restoring a replica by creating a new replica and then deleting the one that is down.
The problem is that in Solr 4.4 (the one bundled with CDH 5.3.0) these actions do not exist yet. And I can't find any other mean to "restore" or "recreate" a replica.
Maybe someone could highlight us with a solution ?
Today, our only solution is to recreate the whole collection and reindex (which appear to be an ungraceful workaround).
11-09-2015 06:42 AM
If a replica is in the DOWN status, it means it cannot move to the RECOVERING status. It's either failed somehow, hit a bug, is replaying transaction logs, or something else. You have to look at the logs to determine why a replica won't come out of the DOWN state usually.
11-09-2015 08:09 AM - edited 11-09-2015 08:15 AM
Thank you for your answer.
I can guess that in this particular case the replica is DOWN because of the incident we have encoutered (hdfs not available for a short period + restart of the whole cluster).
But the question is : how to fix that after the problem is encoutered.
Checking the solr log files did not really help on "why the replica is down". We can just observe that this particular shard do not "log" that it became active after the restart (and thus stay down).
Of course, we might have miss something.
I have open a support ticket to help us. They might found the problem from the logs.
Isn't there any "manual" way to restore it ? replace it ?
11-23-2015 09:24 PM
Can you get the trace for that particular solr replica logs?.If your tlogs are large it could be replaying those but then it would normally been in recovery state.Are your tlogs are corrupt?.
12-22-2017 08:54 AM
did you find solution on this issue? we are facing the same problem and planning to restart the cluster. will the problem get solved or the restart creates problem with the active shard.
afraid of restarting the cluster.
Please help me.
01-16-2018 01:33 AM - edited 01-16-2018 01:34 AM
It's been a while ! If I remember correctly, we did not find any solution back then (with CDH5.3.0) - at least other than recreating the collection and re-indexing the data.
But after upgrading the CDH version using a version of Solr supporting the "ADDREPLICA" and "DELETEREPLICA" functions in the API you can add an other replica and then delete the one which is down.