question Impala Metadata Sync Issue Disk I/O error datanode.fqdn:22000: Failed to open HDFS file in Support Questions

Impala Metadata Sync Issue Disk I/O error datanode.fqdn:22000: Failed to open HDFS file

SohamR — Fri, 03 Mar 2023 06:20:57 GMT

Hi All,

We are facing an issue where no matter what we try Impala queries will randomly throw a "Failed to open HDFS file" error. This seemingly started out of nowhere and we are not sure what else to try.

Below are some of the things we have tried.

1. Enforce SYNC_DDL

2. We used to have 87 impala daemons (both executor and coordinator). We setup dedicated coordinators for Impala (4 coordinator + 83 executors) and load balanced with haproxy.

3. Tried adding invalidate metadata, and then removing it.

Below is the sequence of queries.

1. Insert Overwrite a table. (approx every 1 hour)

2. Refresh

3. Compute stats.

4. Select.

The select never fails on the same coordinator as insert, but randomly on other coordinators. And it keeps failing until a refresh. As soon as a refresh is run on the other failing coordinator, query succeeds.

This leads me to believe it is a metadata sync issue across coordinators. The problem is that multiple applications/dashboards are using Impala and we cannot ask them to do a refresh every time.

impalad version 3.2.0-cdh6.3.3

Any help is appreciated.

Regards

SohamR

Re: Impala Metadata Sync Issue Disk I/O error datanode.fqdn:22000: Failed to open HDFS file

SohamR — Sat, 04 Mar 2023 17:55:02 GMT

Hi All,

Just something I have noticed. Whenever we SYNC_DDL and try to run a refresh, sometimes the query does not even register and produce a queryID, and at the same time I see below errors in the coordinator logs.:

I0304 18:53:17.278281 174197 thrift-util.cc:124] TAcceptQueueServer: Caught TException: SSL_read: Connection reset by peer

Does this point to any actual Network/SSL error? Any insights would be helpful.

Regards
SohamR

Re: Impala Metadata Sync Issue Disk I/O error datanode.fqdn:22000: Failed to open HDFS file

SohamR — Thu, 09 Mar 2023 06:12:54 GMT

Hi All,

Here is an example of an even worse scenario.

INSERT OVERWRITE (with SYNC_DDL) took approx 114mins : 03/09/2023 3:00 AM - 03/09/2023 4:53 AM 2. From 03/09/2023 4:53 AM - 03/09/2023 6:02 AM, all selects failed, for over 1 hour.
In the script, there is a COMPUTE stats just after the INSERT OVERWRITE. Even though the INSERT OVERWRITE completed by 03/09/2023 4:53 AM, the COMPUTE STATS did not start until 03/09/2023 6:02 AM.
Was it waiting for SYNC_DDL to complete before starting the next DDL query? If yes, then why did INSERT OVERWRITE complete before SYNC_DDL was complete?

Can anyone please help with any ideas?

Regards

SohamR