Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala inc_stats_size_limit_bytes - what does it mean

avatar
Explorer

We are still using cdh 5.10 and impala 2.7 and there is a startup option `inc_stats_size_limit_bytes` which is described in https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_perf_stats.html section "Maximum Serialized Stats Size" (which is identical to 5.10 docs) but it does not really describe what this setting means and how to predict that this is a problem.

 

> The inc_stats_size_limit_bytes limit is set as a safety check, to prevent Impala from hitting the maximum limit for the table metadata. Note that this limit is only one part of the entire table's metadata all of which together must be below 2 GB.

 

With pretty big tables and a lot of live partitions we had to increase catalogd memory to be able to cope with the amount of metadata. Now additionally singular large tables were not loaded during compute incremental stats because they reached inc_stats_size_limit_bytes which is far less than catalogd memory. But what does that mean? How can we calculate this limit or predict it? We didn't find any metrics in https://docs.cloudera.com/documentation/enterprise/5-10-x/topics/cm_metrics_impala_catalog_server.ht... nor can we calculate the expected limit / restriction by this config through anything found in https://docs.huihoo.com/cloudera/The-Impala-Cookbook.pdf (which works just fine for catalogd expected memory).

 

In short:

- What does inc_stats_size_limit_bytes mean

- can we predict / calculate a needed value for inc_stats_size_limit_bytes for our tables

 

1 ACCEPTED SOLUTION

avatar

The estimated stats size is calculated as 400 bytes * # columns * # partitions. The option prevents you from computing incremental stats on tables with too many columns and partitions (it guards against the scenario where memory usage from incremental stats creeps up and up as tables get larger, eventually causing an outage).

 

So you probably want to set it based on the expected size of the largest table that you will be using incremental stats on (that would help prevent someone accidentally computing incremental stats on an even larger table).


A few other comments.

Generally non-incremental stats will be more robust but we understand that it's sometimes challenging or less practical to do a full compute stats on all tables. So if the calculation above spits out a huge number, you might want to reconsider that.

 

You need to be careful with bumping *only* the catalog heap size. On versions prior to CDH5.16, you need all coordinator impala daemons to have a heap size as large as the catalogd, since the catalog cache is replicated. That was addressed for incremental stats specifically in CDH5.16 by *not* replicating the incremental stats (all other state is still replicated).

 

In CDH5.16 the memory consumption was improved substantially as well (the incremental stats use ~5x less memory). The estimated stats size is actually reduce to 200 bytes * # columns * #partitions.

View solution in original post

2 REPLIES 2

avatar

The estimated stats size is calculated as 400 bytes * # columns * # partitions. The option prevents you from computing incremental stats on tables with too many columns and partitions (it guards against the scenario where memory usage from incremental stats creeps up and up as tables get larger, eventually causing an outage).

 

So you probably want to set it based on the expected size of the largest table that you will be using incremental stats on (that would help prevent someone accidentally computing incremental stats on an even larger table).


A few other comments.

Generally non-incremental stats will be more robust but we understand that it's sometimes challenging or less practical to do a full compute stats on all tables. So if the calculation above spits out a huge number, you might want to reconsider that.

 

You need to be careful with bumping *only* the catalog heap size. On versions prior to CDH5.16, you need all coordinator impala daemons to have a heap size as large as the catalogd, since the catalog cache is replicated. That was addressed for incremental stats specifically in CDH5.16 by *not* replicating the incremental stats (all other state is still replicated).

 

In CDH5.16 the memory consumption was improved substantially as well (the incremental stats use ~5x less memory). The estimated stats size is actually reduce to 200 bytes * # columns * #partitions.

avatar
Explorer

Thank you for that explanation