Support Questions

Find answers, ask questions, and share your expertise

Impala compute incremental stats on specific columns

avatar
Contributor

Hi,

 

I want to gather stats on a big partition table,

but want to do it only on some of the partitions and not on all the columns because it can take lots of data.

I don't see in the documentation of "compute incremental stats" option to do it,

How can I run stats only on some of the partitions and some/none of the columns?

 

Thanks

8 REPLIES 8

avatar
Contributor

How can I run stats only on some of the partitions and some/none of the columns?

 

@hores  in order to take stats for a certain partition  as you mentioned above you have to run the following command 

COMPUTE INCREMENTAL STATS [db_name.]table_name [PARTITION (partition_spec)]
partition_spec ::= partition_col=constant_value

For further info you can read the documentation : impala_compute_stats.html

 

avatar
Contributor

Thanks @eMazarakis 

but I mean stats on specific partitions AND specific columns (BOTH).

If I'll run as you suggest It will collect statistics on all the columns which we don't want,

So how can I collect stats on specific partitions AND specific columns? 

avatar
Contributor

@hores  As you can see the doc there is nothing about computing stats for specific table-columns. Stats are for the whole table.

How can you know which info for statistics impalad daemons might want to use during the join queries ?

 

avatar
Super Guru
You can't specify columns as Impala will collect all, but you can do it at partition level.

avatar
Contributor

@EricL @eMazarakis We have tables with lots of non-filtered columns, so I know we don't want to collect statistics on them. Impala docs say that:

"For a table with a huge number of partitions and many columns, the approximately 400 bytes of metadata per column per partition can add up to significant memory overhead, as it must be cached on the CatalogD host and on every ImpalaD host that is eligible to be a coordinator. If this metadata for all tables combined exceeds 2 GB, you might experience service downtime."

 

so for me, it's strange user don't have options to minimize the statistics on tables. 

Hive has this option but if I use it it won't sync to Impala:

"If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS, Impala can only use the resulting column statistics if the table is unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned table."

avatar
Super Guru
@hores,

You are right! Looks like latest Impala in CDH6.x supports column level stats:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_compute_stats.html

COMPUTE STATS [db_name.]table_name [ ( column_list ) ] [TABLESAMPLE SYSTEM(percentage) [REPEATABLE(seed)]]

column_list ::= column_name [ , column_name, ... ]

What version are you using?

Cheers
Eric

avatar
Contributor

It only supports on table stats but not on per partitions stats (incremental stats),

it says in your link:

"For non-incremental COMPUTE STATS statement, the columns for which statistics are computed can be specified with an optional comma-separated list of columns."

So it looks like column specific is only on a table without partitions (non-incremental)

 

It really strange that it works only in this way

avatar

So it looks like column specific is only on a table without partitions (non-incremental)


@hores that's incorrect, non-incremental compute stats works on partitioned tables and is generally the preferred method for collecting stats on partitioned tables.

 

We've generally tried to steer people away from incremental stats because of the size issues on large tables,

 

It would also be error-prone to use correctly and complex to implement - what happens if you compute incremental stats with different subsets of the columns? You can end up with different subsets of the columns on different partitions and then you have to somehow reconcile it all each time.