what will be the impact on the cluster if we turn on auto stats ? or how can we calculate the impact ?
There shouldn't be any impact on the cluster as the stats would be collected an store in Hive metastore database at the time of new partition creation or insertions of data.
Only effect, there will be an additional steps of calculating the stats and updating the Metastore database which is masked from user.
thanks for reply..
running auto stats on hive tables.. calculating stats on table which is default while create or insert.(hive.stats.autogather=true)
compute stats for table , calculates number of rows on that table by scanning the table and there wont be significant impact on the cluster or analyze job wont run for longer time.
compute stats for columns, it has to calculate num of distinct, nulls,avg min/max lenght of column etc., so, analyze jobs are running for longer time with more num.of mappers and reducers (this depends on the size of the table and num of columns). In such situations the impact of the cluster or resource utilisation is high.
Is there any best practices before running stats for table columns ? Even though the stats task is run as a batch job , we want it to be executed as efficiently as possible. Basically, we expect to compute statistics on terabytes of data or more num of columns at a given time
also, as part of stats calculation what are the important metastore tables involved or referred or updated?