Support Questions

Sayed016 · ‎10-20-2021

Dear team,

We are facing the below issue on one of the Solr nodes.

2021-10-17 04:05:57.006 ERROR (qtp1916575798-2477) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
 Cannot talk to ZooKeeper - Updates are disabled.
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.zkCheck(DistributedZkUpdateProcessor.java:1245)
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.setupRequest(DistributedZkUpdateProcessor.java:582)
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:239)
        at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
        at org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:477)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
        at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
        at org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:118)

However, after some time Solr server is able to reconnect.

2021-10-17 04:05:57.028 WARN  (Thread-2414) [   ] o.a.z.Login TGT renewal thread has been interrupted and will exit.
2021-10-17 04:05:57.043 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.c.ConnectionManager zkClient has connected
2021-10-17 04:05:57.043 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper
2021-10-17 04:05:57.043 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.ZkController ZooKeeper session re-connected ... refreshing core states after session expiration.
2021-10-17 04:05:57.047 WARN  (qtp1916575798-2461) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.h.s.a.u.KerberosName auth_to_local rule mechanism not set.Using default of hadoop
2021-10-17 04:05:57.072 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (3) -> (2)
2021-10-17 04:05:57.085 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.Overseer Overseer (id=72334547140450792-192.168.0.17:8985_solr-n_0000000153) closing
2021-10-17 04:05:57.085 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.Overseer Overseer (id=72334547140450792-192.168.0.17:8985_solr-n_0000000153) closing
2021-10-17 04:05:57.085 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.Overseer Overseer (id=72334547140450792-192.168.0.17:8985_solr-n_0000000153) closing
2021-10-17 04:05:57.087 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.Overseer Overseer (id=72334547140450792-192.168.0.17:8985_solr-n_0000000153) closing
2021-10-17 04:05:57.089 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.ZkController Publish node=192.168.0.17:8985_solr as DOWN
2021-10-17 04:05:57.093 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.ZkController Register node as live in ZooKeeper:/live_nodes/192.168.0.17:8985_solr
2021-10-17 04:05:57.097 INFO  (zkCallback-10-thread-28) [   ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/ranger_audits/state.json] for collection [ranger_audits] has occurred - updating... (live nodes size: [2])
2021-10-17 04:05:57.098 INFO  (coreZkRegister-1-thread-5) [   ] o.a.s.c.ZkController Registering core ranger_audits_shard1_replica_n1 afterExpiration? true
2021-10-17 04:05:57.099 INFO  (coreZkRegister-1-thread-6) [   ] o.a.s.s.ZkIndexSchemaReader Creating ZooKeeper watch for the managed schema at /configs/ranger_audits/managed-schema
2021-10-17 04:05:57.099 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.c.DefaultConnectionStrategy Reconnected to ZooKeeper
2021-10-17 04:05:57.099 INFO  (zkConnectionManagerCallback-11-thread-1-EventThread) [   ] o.a.s.c.c.ConnectionManager zkClient Connected:true
2021-10-17 04:05:57.102 INFO  (zkCallback-10-thread-24) [   ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (2) -> (3)
2021-10-17 04:05:57.102 INFO  (Thread-2418) [   ] o.a.s.c.SolrCore config update listener called for core ranger_audits_shard1_replica_n1
2021-10-17 04:05:57.103 INFO  (coreZkRegister-1-thread-6) [   ] o.a.s.s.ZkIndexSchemaReader Current schema version 0 is already the latest
2021-10-17 04:05:57.109 INFO  (coreZkRegister-1-thread-5) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.c.ShardLeaderElectionContextBase make sure parent is created /collections/ranger_audits/leaders/shard1
2021-10-17 04:05:57.114 INFO  (coreZkRegister-1-thread-5) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.c.ShardLeaderElectionContext Enough replicas found to continue.
2021-10-17 04:05:57.114 INFO  (coreZkRegister-1-thread-5) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.c.ShardLeaderElectionContext I may be the new leader - try and sync
2021-10-17 04:05:57.114 INFO  (coreZkRegister-1-thread-5) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.c.SyncStrategy Sync replicas to https://192.168.0.17:8985/solr/ranger_audits_shard1_replica_n1/
2021-10-17 04:05:57.114 INFO  (coreZkRegister-1-thread-5) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.c.SyncStrategy Sync Success - now sync replicas to me
2021-10-17 04:05:57.114 INFO  (coreZkRegister-1-thread-5) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.c.SyncStrategy https://192.168.0.17:8985/solr/ranger_audits_shard1_replica_n1/ has no replicas
2021-10-17 04:05:57.114 INFO  (coreZkRegister-1-thread-5) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.c.ShardLeaderElectionContextBase Creating leader registration node /collections/ranger_audits/leaders/shard1/leader after winning as /collections/ranger_audits/leader_elect/shard1/election/216449719079380911-core_node2-n_0000000061

But these keep on repeating and after around 10 minutes, we see the below error and the Solr server finally gives up.

2021-10-17 04:14:25.112 ERROR (qtp1916575798-2487) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.u.p.DistributedZkUpdateProcessor ClusterState says we are
the leader, but locally we don't think so
2021-10-17 04:14:25.112 ERROR (qtp1916575798-2325) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.u.p.DistributedZkUpdateProcessor ClusterState says we are
the leader, but locally we don't think so
2021-10-17 04:14:25.114 INFO  (qtp1916575798-2487) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.u.p.LogUpdateProcessorFactory [ranger_audits_shard1_replic
a_n1]  webapp=/solr path=/update params={wt=javabin&version=2}{} 0 36703
2021-10-17 04:14:25.114 WARN  (qtp1916575798-2492) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.h.s.a.u.KerberosName auth_to_local rule mechanism not set.Us
ing default of hadoop
2021-10-17 04:14:25.116 INFO  (qtp1916575798-2325) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.u.p.LogUpdateProcessorFactory [ranger_audits_shard1_replic
a_n1]  webapp=/solr path=/update params={wt=javabin&version=2}{} 0 36707
2021-10-17 04:14:37.503 WARN  (Thread-2474) [   ] o.a.z.Login TGT renewal thread has been interrupted and will exit.
2021-10-17 04:14:37.504 ERROR (qtp1916575798-2325) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ClusterState says we are the leader (https://192.168.0.17:8985/solr/ranger_audits_shard1_replica_n1), but locally we don't think so. Request came from null
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDefensiveChecks(DistributedZkUpdateProcessor.java:1017)
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.setupRequest(DistributedZkUpdateProcessor.java:655)
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.setupRequest(DistributedZkUpdateProcessor.java:593)
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.setupRequest(DistributedZkUpdateProcessor.java:585)

Please help to resolve this issue.

Thanks

smdas · ‎10-21-2021

Hello @Sayed016

Thanks for the response. In short, your Team identified GC Pauses during the concerned period wherein Solr Service was losing Connectivity with ZooKeeper. Your Team increased the zkClientTimeout to 30 Seconds & now, the Leader is elected for the Collection. In short, the Solr Service was impacted owing to GC Pauses of Solr JVM & the same has been addressed by your Team.

Thank You for sharing your experience on the Community, which would help our fellow Community Users as well. Let us know if we are good to close the Post as Solved as well.

Regards, Smarak

View solution in original post

smdas · ‎10-20-2021

Hello @Sayed016

Thanks for using Cloudera Community. Based on the Post, Your Team is experiencing Exception [1] for RangerAudits Shard1 Replica1, followed by Successful Connect with ZooKeeper to eventually failing with [2]. Once the same happened, What is being observed by your Team i.e. Whether the Collection RangerAudits Shard1 Replica1 enters "Down" State as opposed to "Active" State.

Since the Logs shows RangerAudits Shard1 has no Replica, It's feasible the Issue arises from Consistency concerns between Solr & ZooKeeper. There are few things we wish to verify with your assistance:

When Solr reports [1], Whether the ZooKeeper Quorum are Healthy Or, any issues with the ZooKeeper Server wherein the Solr ZooKeeperClient is connected,
What happens after [2] for RangerAudits Shard1 Replica1. This is requested above as well,
Whether Restarting Solr Service on Host wherein "ranger_audits_shard1_replica_n1" Core is hosted helps in mitigating the "ClusterState Says We Are Leader, But Locally We Don't Think So".
The HDP, CDH, CDP Version being discussed in the Post.

Regards, Smarak

[1] 2021-10-17 04:05:57.006 ERROR (qtp1916575798-2477) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
Cannot talk to ZooKeeper - Updates are disabled.

[2] 2021-10-17 04:14:37.504 ERROR (qtp1916575798-2325) [c:ranger_audits s:shard1 r:core_node2 x:ranger_audits_shard1_replica_n1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ClusterState says we are the leader (https://192.168.0.17:8985/solr/ranger_audits_shard1_replica_n1), but locally we don't think so. Request came from null

Sayed016 · ‎10-21-2021

Hello @smdas

Thank you for your detailed reply.

We looked into the ZooKeeper logs and couldn't find any issue there.
After [2] for RangerAudits Shard1 Replica1 to kept showing the same error couple of times and then the Solr server stopped.
We investigated this further and found that there was a long GC pause during that time due to which the application (Solr server) lost connection with the ZooKeeper and started throwing the error. We have increased the zkClientTimeout to 30 seconds and restarted the Solr service. We can see that a leader is elected for the collection.
Version: CDP 7.1.6

Thanks

smdas · ‎10-21-2021

Hello @Sayed016

Thanks for the response. In short, your Team identified GC Pauses during the concerned period wherein Solr Service was losing Connectivity with ZooKeeper. Your Team increased the zkClientTimeout to 30 Seconds & now, the Leader is elected for the Collection. In short, the Solr Service was impacted owing to GC Pauses of Solr JVM & the same has been addressed by your Team.

Thank You for sharing your experience on the Community, which would help our fellow Community Users as well. Let us know if we are good to close the Post as Solved as well.

Regards, Smarak

Support Questions

SolrException: ClusterState says we are the leader, but locally we don't think so. Request came from null