Created 08-21-2019 08:17 AM
Hi,
I want to gather stats on a big partition table,
but want to do it only on some of the partitions and not on all the columns because it can take lots of data.
I don't see in the documentation of "compute incremental stats" option to do it,
How can I run stats only on some of the partitions and some/none of the columns?
Thanks
Created on 08-26-2019 02:04 AM - edited 08-26-2019 02:08 AM
How can I run stats only on some of the partitions and some/none of the columns?
@hores in order to take stats for a certain partition as you mentioned above you have to run the following command
COMPUTE INCREMENTAL STATS [db_name.]table_name [PARTITION (partition_spec)]
partition_spec ::= partition_col=constant_value
For further info you can read the documentation : impala_compute_stats.html
Created 08-27-2019 07:27 AM
Thanks @eMazarakis
but I mean stats on specific partitions AND specific columns (BOTH).
If I'll run as you suggest It will collect statistics on all the columns which we don't want,
So how can I collect stats on specific partitions AND specific columns?
Created on 08-28-2019 12:34 AM - edited 08-28-2019 12:35 AM
@hores As you can see the doc there is nothing about computing stats for specific table-columns. Stats are for the whole table.
How can you know which info for statistics impalad daemons might want to use during the join queries ?
Created 08-28-2019 12:38 AM
Created 08-29-2019 12:37 AM
@EricL @eMazarakis We have tables with lots of non-filtered columns, so I know we don't want to collect statistics on them. Impala docs say that:
"For a table with a huge number of partitions and many columns, the approximately 400 bytes of metadata per column per partition can add up to significant memory overhead, as it must be cached on the CatalogD host and on every ImpalaD host that is eligible to be a coordinator. If this metadata for all tables combined exceeds 2 GB, you might experience service downtime."
so for me, it's strange user don't have options to minimize the statistics on tables.
Hive has this option but if I use it it won't sync to Impala:
"If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS, Impala can only use the resulting column statistics if the table is unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned table."
Created 08-29-2019 01:20 AM
Created 08-29-2019 02:11 AM
It only supports on table stats but not on per partitions stats (incremental stats),
it says in your link:
"For non-incremental COMPUTE STATS statement, the columns for which statistics are computed can be specified with an optional comma-separated list of columns."
So it looks like column specific is only on a table without partitions (non-incremental)
It really strange that it works only in this way
Created 09-20-2019 10:11 AM
So it looks like column specific is only on a table without partitions (non-incremental)
@hores that's incorrect, non-incremental compute stats works on partitioned tables and is generally the preferred method for collecting stats on partitioned tables.
We've generally tried to steer people away from incremental stats because of the size issues on large tables,
It would also be error-prone to use correctly and complex to implement - what happens if you compute incremental stats with different subsets of the columns? You can end up with different subsets of the columns on different partitions and then you have to somehow reconcile it all each time.