- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Impala compute incremental stats on specific columns
- Labels:
-
Apache Impala
Created ‎08-21-2019 08:17 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I want to gather stats on a big partition table,
but want to do it only on some of the partitions and not on all the columns because it can take lots of data.
I don't see in the documentation of "compute incremental stats" option to do it,
How can I run stats only on some of the partitions and some/none of the columns?
Thanks
Created on ‎08-26-2019 02:04 AM - edited ‎08-26-2019 02:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How can I run stats only on some of the partitions and some/none of the columns?
@hores in order to take stats for a certain partition as you mentioned above you have to run the following command
COMPUTE INCREMENTAL STATS [db_name.]table_name [PARTITION (partition_spec)]
partition_spec ::= partition_col=constant_value
For further info you can read the documentation : impala_compute_stats.html
Created ‎08-27-2019 07:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @eMazarakis
but I mean stats on specific partitions AND specific columns (BOTH).
If I'll run as you suggest It will collect statistics on all the columns which we don't want,
So how can I collect stats on specific partitions AND specific columns?
Created on ‎08-28-2019 12:34 AM - edited ‎08-28-2019 12:35 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@hores As you can see the doc there is nothing about computing stats for specific table-columns. Stats are for the whole table.
How can you know which info for statistics impalad daemons might want to use during the join queries ?
Created ‎08-28-2019 12:38 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎08-29-2019 12:37 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@EricL @eMazarakis We have tables with lots of non-filtered columns, so I know we don't want to collect statistics on them. Impala docs say that:
"For a table with a huge number of partitions and many columns, the approximately 400 bytes of metadata per column per partition can add up to significant memory overhead, as it must be cached on the CatalogD host and on every ImpalaD host that is eligible to be a coordinator. If this metadata for all tables combined exceeds 2 GB, you might experience service downtime."
so for me, it's strange user don't have options to minimize the statistics on tables.
Hive has this option but if I use it it won't sync to Impala:
"If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS, Impala can only use the resulting column statistics if the table is unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned table."
Created ‎08-29-2019 01:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are right! Looks like latest Impala in CDH6.x supports column level stats:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_compute_stats.html
COMPUTE STATS [db_name.]table_name [ ( column_list ) ] [TABLESAMPLE SYSTEM(percentage) [REPEATABLE(seed)]]
column_list ::= column_name [ , column_name, ... ]
What version are you using?
Cheers
Eric
Created ‎08-29-2019 02:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It only supports on table stats but not on per partitions stats (incremental stats),
it says in your link:
"For non-incremental COMPUTE STATS statement, the columns for which statistics are computed can be specified with an optional comma-separated list of columns."
So it looks like column specific is only on a table without partitions (non-incremental)
It really strange that it works only in this way
Created ‎09-20-2019 10:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So it looks like column specific is only on a table without partitions (non-incremental)
@hores that's incorrect, non-incremental compute stats works on partitioned tables and is generally the preferred method for collecting stats on partitioned tables.
We've generally tried to steer people away from incremental stats because of the size issues on large tables,
It would also be error-prone to use correctly and complex to implement - what happens if you compute incremental stats with different subsets of the columns? You can end up with different subsets of the columns on different partitions and then you have to somehow reconcile it all each time.
