- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Impala Metadata Sync Issue Disk I/O error datanode.fqdn:22000: Failed to open HDFS file
- Labels:
-
Apache Impala
Created on 03-02-2023 10:18 PM - edited 03-02-2023 10:20 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
We are facing an issue where no matter what we try Impala queries will randomly throw a "Failed to open HDFS file" error. This seemingly started out of nowhere and we are not sure what else to try.
Below are some of the things we have tried.
1. Enforce SYNC_DDL
2. We used to have 87 impala daemons (both executor and coordinator). We setup dedicated coordinators for Impala (4 coordinator + 83 executors) and load balanced with haproxy.
3. Tried adding invalidate metadata, and then removing it.
Below is the sequence of queries.
1. Insert Overwrite a table. (approx every 1 hour)
2. Refresh
3. Compute stats.
4. Select.
The select never fails on the same coordinator as insert, but randomly on other coordinators. And it keeps failing until a refresh. As soon as a refresh is run on the other failing coordinator, query succeeds.
This leads me to believe it is a metadata sync issue across coordinators. The problem is that multiple applications/dashboards are using Impala and we cannot ask them to do a refresh every time.
impalad version 3.2.0-cdh6.3.3
Any help is appreciated.
Regards
SohamR
Created 03-04-2023 09:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Just something I have noticed. Whenever we SYNC_DDL and try to run a refresh, sometimes the query does not even register and produce a queryID, and at the same time I see below errors in the coordinator logs.:
I0304 18:53:17.278281 174197 thrift-util.cc:124] TAcceptQueueServer: Caught TException: SSL_read: Connection reset by peer
Does this point to any actual Network/SSL error? Any insights would be helpful.
Regards
SohamR
Created 03-08-2023 10:12 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Here is an example of an even worse scenario.
- INSERT OVERWRITE (with SYNC_DDL) took approx 114mins : 03/09/2023 3:00 AM - 03/09/2023 4:53 AM 2. From 03/09/2023 4:53 AM - 03/09/2023 6:02 AM, all selects failed, for over 1 hour.
- In the script, there is a COMPUTE stats just after the INSERT OVERWRITE. Even though the INSERT OVERWRITE completed by 03/09/2023 4:53 AM, the COMPUTE STATS did not start until 03/09/2023 6:02 AM.
- Was it waiting for SYNC_DDL to complete before starting the next DDL query? If yes, then why did INSERT OVERWRITE complete before SYNC_DDL was complete?
Can anyone please help with any ideas?
Regards
SohamR