Support Questions

Find answers, ask questions, and share your expertise

Kudu read fails: Tablet is lagging too much to be able to serve snapshot scan

avatar
Explorer

Hi,

 

We have a Kudu cluster with replication factor 3. All Kudu machines have time synchronization via NTP.

Our application performs a SELECT from a Kudu table via Impala JDBC driver. The Impala Daemon is configured with the command-line option: --kudu_read_mode=READ_AT_SNAPSHOT.

This operation fails:

 

[Simba][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: Unable to open scanner: Timed out: unable to retry before timeout: Remote error: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Tablet is lagging too much to be able to serve snapshot scan. Lagging by: 61148 ms, (max is 30000 ms):: Remote error: Service unavailable: Timed out: could not wait for desired snapshot timestamp to be consistent: Tablet is lagging too much to be able to serve snapshot scan. Lagging by: 30821 ms, (max is 30000 ms):
, Query: SELECT COUNT(*) as `expr_0` FROM `ci_pdo_PDO_Func_Tests`.`alarm_relationship_rules` WHERE (1 = 1).

What is the cause of this error? Is it due to misbehaving NTP or something else?

Which Kudu parameter to tune? Is it --safe_time_max_lag_ms=30000 for Kudu master or something else?

I tried to find the necessary info in Kudu configuration reference, but the parameters are quite poorly documented.

 

Thanks

1 ACCEPTED SOLUTION

avatar
Rising Star

The error message you posted is indicative that there is some lag in replicating operations between tablet replicas (nothing NTP-related here). Here is a post that goes into a bit of detail about what these options are doing. Behind the scenes, you can imagine roughly the following sequence of events:

 

1. We have a tablet with a healthy quorum and the leader gets a write request.

2. The leader will assign the write a timestamp T1, and once this timestamp has been persisted to a majority of replicas, the write is accepted and the timestamp T1 gets sent back to clients.

3a. For the replicas that successfully replicate T1, the scannable range for READ_AT_SNAPSHOT scans will be bumped to include T1.

3b. Note that it's possible that at this point there is a replica that has not replicated/seen the write at T1; the only guarantee at this point is that a majority of replicas have seen it. This replica is "lagging".

4. A READ_AT_SNAPSHOT scan will try to enforce that it only sees rows written past T1. If this scan goes to a lagging replica, it may wait up to `--safe_time_max_lag_ms` for that replica to receive the write for T1, which should eventually happen based on Kudu's usage of Raft replication, but it seems in this case it was too slow.

 

Why was the replica lagging? One possibility is that there happened to be a lot of consensus traffic caused by a slow network and that was led to slower replication. It's also possible there was a network partition to result in that too. Without knowing more about the cluster and the load on it, it's hard to say. There might be information in the tablet server logs, if you search around for one of the tablet IDs that you noticed being slow. Also worth running `ksck` to see if there is anything wrong with the state of the tablets.

 

Beyond addressing the underlying lag, if you are choosing READ_AT_SNAPSHOT for its read-your-writes properties, I don't think there is an easy path forward. If you don't need these guarantees, using the READ_LATEST read mode should get you around this. It's also worth noting that Kudu exposes an additional READ_YOUR_WRITES mode, but I don't think it's integrated with Impala at the moment.

View solution in original post

5 REPLIES 5

avatar
Rising Star

The error message you posted is indicative that there is some lag in replicating operations between tablet replicas (nothing NTP-related here). Here is a post that goes into a bit of detail about what these options are doing. Behind the scenes, you can imagine roughly the following sequence of events:

 

1. We have a tablet with a healthy quorum and the leader gets a write request.

2. The leader will assign the write a timestamp T1, and once this timestamp has been persisted to a majority of replicas, the write is accepted and the timestamp T1 gets sent back to clients.

3a. For the replicas that successfully replicate T1, the scannable range for READ_AT_SNAPSHOT scans will be bumped to include T1.

3b. Note that it's possible that at this point there is a replica that has not replicated/seen the write at T1; the only guarantee at this point is that a majority of replicas have seen it. This replica is "lagging".

4. A READ_AT_SNAPSHOT scan will try to enforce that it only sees rows written past T1. If this scan goes to a lagging replica, it may wait up to `--safe_time_max_lag_ms` for that replica to receive the write for T1, which should eventually happen based on Kudu's usage of Raft replication, but it seems in this case it was too slow.

 

Why was the replica lagging? One possibility is that there happened to be a lot of consensus traffic caused by a slow network and that was led to slower replication. It's also possible there was a network partition to result in that too. Without knowing more about the cluster and the load on it, it's hard to say. There might be information in the tablet server logs, if you search around for one of the tablet IDs that you noticed being slow. Also worth running `ksck` to see if there is anything wrong with the state of the tablets.

 

Beyond addressing the underlying lag, if you are choosing READ_AT_SNAPSHOT for its read-your-writes properties, I don't think there is an easy path forward. If you don't need these guarantees, using the READ_LATEST read mode should get you around this. It's also worth noting that Kudu exposes an additional READ_YOUR_WRITES mode, but I don't think it's integrated with Impala at the moment.

avatar
Explorer

Thank you for your detailed reply.

 

Indeed, this lagging replica is located on a machine belonging to a different subnetwork. This might cause an additional network delay.

 

You are mentioning the '--safe_time_max_lag_ms' parameter that controls the acceptible replica lag. I see in the Kudu Configuration Reference, that this parameter is available both for kudu-master and kudu-tserver. Which of them two should I tune? The kudu-tserver one? What is the purpose of the another one - for kudu-master? The descriptions for both are identical and do not bring much clarity:

'The maximum amount of time we allow safe time to lag behind the requested timestampbefore forcing the client to retry, in milliseconds.'

Is it valid to have different '--safe_time_max_lag_ms' values for different kudu-tservers, so the distant tserver has a higher max lag value?

 

Regarding READ_YOUR_WRITES mode, I have checked the Impala SQL Reference for CDH 6.1 - it does not provide this mode. Anyway, we are stuck with CDH 5.14, so we cannot even use the 'SET KUDU_READ_MODE ...' Impala statement. With CDH 5.14, the only option is to configure the Impala Daemon with '--kudu_read_mode=READ_AT_SNAPSHOT'.

 

Could you also explain how this new READ_YOUR_WRITES mode works? The API doc is not clear on this. Does this mean, that the client automatically takes the read timestamp as the timestamp of the preceding write? In this case, what is the difference with the READ_AT_SNAPSHOT mode when the client does not specify the read timestamp at all?

avatar
Explorer

Thank you for your detailed reply.

Indeed, this lagging replica is located on a machine belonging to a different subnetwork. This might cause an additional network delay.

You are mentioning the '--safe_time_max_lag_ms' parameter that controls the acceptible replica lag. I see in the Kudu Configuration Reference, that this parameter is available both for kudu-master and kudu-tserver. Which of them two should I tune? The kudu-tserver one? What is the purpose of the another one - for kudu-master? The descriptions for both are identical and do not bring much clarity:
'The maximum amount of time we allow safe time to lag behind the requested timestampbefore forcing the client to retry, in milliseconds.'
Is it valid to have different '--safe_time_max_lag_ms' values for different kudu-tservers, so the distant tserver has a higher max lag value?

Regarding READ_YOUR_WRITES mode, I have checked the Impala SQL Reference for CDH 6.1 - it does not provide this mode. Anyway, we are stuck with CDH 5.14, so we cannot even use the 'SET KUDU_READ_MODE ...' Impala statement. With CDH 5.14, the only option is to configure the Impala Daemon with '--kudu_read_mode=READ_AT_SNAPSHOT'.

Could you also explain how this new READ_YOUR_WRITES mode works? The API doc is not clear on this. Does this mean, that the client automatically takes the read timestamp as the timestamp of the preceding write? In this case, what is the difference with the READ_AT_SNAPSHOT mode when the client does not specify the read timestamp at all?

 

Thanks

avatar
Rising Star

The flag is on both the tservers and the master because both of them need to maintain consensus, and thus, track replica lag -- the masters maintain a consistent, replicated catalog in HA deployments; the tservers maintain consistent, replicated table partitions. In this case, probably only the flag on the tserver is important.

 

Ah, that's unfortunate, so probably raising that configuration is worth a shot, though note that it's an experimental flag and is thus not well-tested. Also note that you'll also need `--unlock_experimental_flags` to change that flag. I think it's probably fine to have different values for different tservers, but again, it's experimental, meaning not-well-tested, and it might lead to odd behavior (experimenting is certainly not discouraged though if you can accept that!).

 

I believe RYW is already guaranteed by Impala based on what it's doing (described this above, and also see IMPALA-3788 for the details). Kudu's RYW scan mode works similarly by internally passing around written timestamps and sending those timestamps with scan requests. Kudu can make the optimization, though, that if Kudu knows that there exists a timestamp T2 > T1 that can be scanned without waiting (it internally tracks what operations have been applied and are visible locally), it will scan at T2 instead of T1, which would show a more up-to-date picture to scanners. There's an open ticket to support this mode in Impala (see IMPALA-7184), but I don't think anyone is working on it. I also don't think this mode would help in this case.

avatar
Explorer

You have mentioned that NTP is not related to the problem. Let's consider this scenario:

1. Impala Daemon is working with the READ_AT_SNAPSHOT setting enabled. Impala daemon makes a read operation in Kudu. It sets the read timestamp T1 immediately after the preceiding write operation.

2. Kudu despatches the read request to some replica R1. This replica R1 is running on a machine with poorly configured NTP, so the local time on this machine is 1 minute behind.

3. The replica R1 waits for the timeout specified by '--safe_time_max_lag_ms': 30 seconds. After the timeout, the local time is still 30 seconds behind T1 (ideally).

Does this lead to the problem under discussion: 'Tablet is lagging too much to be able to serve snapshot scan'?