Support Questions

cjervis · ‎03-23-2022

Hi Cloudera gurús,

This is my CDP.

3 Master Nodes+3 Worker Nodes

HA enabled and testing it.

Here is the issue: when I shut down Master 2 some queries are randomly failing showing this:

# impala-shell -i haproxy-server.com -q "use dbschema; select * from table_foo limit 10;"
Starting Impala Shell without Kerberos authentication
Warning: live_progress only applies to interactive shell sessions, and is being skipped for now.
Opened TCP connection to haproxy-server.com:21000
Connected to haproxy-server.com:21000
Server version: impalad version 3.4.0-SNAPSHOT RELEASE (build 0cadcf7ac76ecec87d9786048db3672c37d41c6f)
Query: use dbschema
Query: select * from table_foo limit 10
Query submitted at: 2022-03-23 11:28:37 (Coordinator: http://worker1:25000)
ERROR: ImpalaRuntimeException: Unable to initialize the Kudu scan node
CAUSED BY: AnalysisException: Unable to open the Kudu table: dbschema.table_foo
CAUSED BY: NonRecoverableException: cannot complete before timeout: KuduRpc(method=GetTableSchema, tablet=Kudu Master, attempt=1, TimeoutTracker(timeout=180000, elapsed=180004), Trace Summary(0 ms): Sent(1), Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (master-192.168.1.10:7051, [ GetTableSchema, 1 ]))

Could not execute command: select * from table_foo limit 10

The thing is that all leaders are correctly re-balanced to other nodes and something is working, because most queries are working.

Does someone have any clue? I was thinking about Hive server but not sure how to trace it.

Note: as CM is in Master2, this is unavailable (this is not affecting, some different tests have been done having CM out of service and queries were working fine)

Note2: does it affects that the kudu Master were in Master2?

Many thanks in advance for your help.

Best Regards

shehbazk · ‎03-24-2022

Hello @Juanes ,

Could you please check the

ksck report

ksck report from kudu, Please if you have any unhealthy tables also verify the replicas as well.

Please refer doc[1]

doc[1]:

https://kudu.apache.org/docs/administration.html#tablet_majority_down_recovery

Thanks,

Juanes · ‎03-24-2022

Hello ,

the ksck is showing that tables are OK (Recovering | Under-replicated | Unavailable are all = 0)

W0324 12:15:41.325619 18080 negotiation.cc:313] Failed RPC negotiation. Trace:

0324 12:15:40.627405 (+ 0us) reactor.cc:609] Submitting negotiation task for client connection to master2:7051

0324 12:15:40.627616 (+ 211us) negotiation.cc:98] Waiting for socket to connect

0324 12:15:41.325243 (+697627us) negotiation.cc:304] Negotiation complete: Network error: Client connection negotiation failed: client connection to master2:7051: connect: No route to host (error 113)

Metrics: {"client-negotiator.queue_time_us":187,"thread_start_us":157,"threads_started":1}

W0324 12:15:44.329090 18080 negotiation.cc:313] Failed RPC negotiation. Trace:

0324 12:15:41.325954 (+ 0us) reactor.cc:609] Submitting negotiation task for client connection to master2:7051

0324 12:15:41.326027 (+ 73us) negotiation.cc:98] Waiting for socket to connect

0324 12:15:44.329031 (+3003004us) negotiation.cc:304] Negotiation complete: Timed out: Client connection negotiation failed: client connection to master2:7051: Timeout exceeded waiting to connect

Metrics: {"client-negotiator.queue_time_us":38}

W0324 12:15:44.331089 18080 negotiation.cc:313] Failed RPC negotiation. Trace:

0324 12:15:44.329518 (+ 0us) reactor.cc:609] Submitting negotiation task for client connection to master2:7051

0324 12:15:44.329580 (+ 62us) negotiation.cc:98] Waiting for socket to connect

0324 12:15:44.331065 (+ 1485us) negotiation.cc:304] Negotiation complete: Network error: Client connection negotiation failed: client connection to master2:7051: connect: No route to host (error 113)

Metrics: {"client-negotiator.queue_time_us":36}

Master Summary

UUID | Address | Status

----------------------------------------+----------------------------+-------------

8a168b68a6dd487c952419672ba32088 | master3.server.com | HEALTHY

daa9129e78244be2aaa7e5e649cc1dc8 | master1.server.com | HEALTHY

<unknown> (master2.server.com) | master2.server.com | UNAVAILABLE

Error from master2.server.com: Network error: Client connection negotiation failed: client connection to master2:7051: connect: No route to host (error 113) (UNAVAILABLE)

All reported replicas are:

A = daa9129e78244be2aaa7e5e649cc1dc8

B = <unknown> (master2.server.com)

C = 8a168b68a6dd487c952419672ba32088

D = 1f02c618009c44d381c55841dcb5a498

The consensus matrix is:

Config source | Replicas | Current term | Config index | Committed?

---------------+------------------------+--------------+--------------+------------

A | A* C D | 88 | -1 | Yes

B | [config not available] | | |

C | A* C D | 88 | -1 | Yes

Flags of checked categories for Master:

Flag | Value | Master

---------------------+-------------------------------------------------------------+--------------------------------------------------------

builtin_ntp_servers | 0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org | master1.server.com, master3.server.com

time_source | system | master1.server.com, master3.server.com

Tablet Server Summary

----------------------------------+---------------------------------+---------+----------+----------------+-----------------

Tablet Server Location Summary

Location | Count

----------+---------

/default | 3

Flags of checked categories for Tablet Server:

Flag | Value | Tablet Server

---------------------+-------------------------------------------------------------+-------------------------

builtin_ntp_servers | 0.pool.ntp.org,1.pool.ntp.org,2.pool.ntp.org,3.pool.ntp.org | all 3 server(s) checked

time_source | system | all 3 server(s) checked

Version Summary

Version | Servers

--------------------+--------------------------------------------------------------------------------------------------------------------------------------

1.13.0.7.1.6.0-297 | master@master3.server.com, master@master1.server.com, tserver@worker3.server.com:7050, and 2 other server(s)

Tablet Summary

Summary by table

-------------------------------------------------------------------------+----+---------+---------------+---------+------------+------------------+-------------

==================
Warnings:
==================
master unusual flags check error: 1 of 3 masters were not available to retrieve unusual flags
master diverged flags check error: 1 of 3 masters were not available to retrieve time_source category flags

==================
Errors:
==================
Network error: error fetching info from masters: failed to gather info from all masters: 1 of 3 had errors
Corruption: master consensus error: there are master consensus conflicts

That I have no clear is why I'm having a consensus error if I have 2of3 Master UP and all 3 Tablet servers UP

Many thanks for your help.

Juanes · ‎03-30-2022

Hello,

does anyone knows if exists any table reference with the errors?

Just wanted to know what it means : Unable to initialize the Kudu scan node

no relevant traces found in the following logs:

Impala Daemon

Impala Catalog Server

Impala State Store

Kudu Master Leader

Kudu tablet

Hive Metastore

Hive Server2

I'm getting out of resources 😞

Juanes · ‎08-24-2022

Hello,

it seems the main error is related to Impala, Kudu is balancing and responding well during the tests, the issue is that Impala breaks the connection to whoever that inform where the new Kudu Master LEADER is. I'm suspicious about the Cloudera Management services that are already down.

Will update the solution whenever I have it.