Created 07-24-2020 08:22 AM
Hi,
I'm using the Impala function "show table stats":
show table stats table1;
I get following results
business_date | tec_execution_date | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |
13/05/2020 | 13/05/2020 20:08 | 0 | 0 | 0B | NOT CACHED | NOT CACHED | PARQUET |
14/07/2020 | 15/07/2020 16:39 | 6 | 1 | 16.08KB | NOT CACHED | NOT CACHED | PARQUET |
15/07/2020 | 16/07/2020 16:07 | 770 | 2 | 114.68KB | NOT CACHED | NOT CACHED | PARQUET |
16/07/2020 | 20/07/2020 16:50 | 272 | 1 | 41.88KB | NOT CACHED | NOT CACHED | PARQUET |
17/07/2020 | 21/07/2020 12:38 | 290 | 1 | 45.33KB | NOT CACHED | NOT CACHED | PARQUET |
20/07/2020 | 20/07/2020 12:16 | 379 | 1 | 51.10KB | NOT CACHED | NOT CACHED | PARQUET |
21/07/2020 | 22/07/2020 12:45 | 156 | 1 | 31.50KB | NOT CACHED | NOT CACHED | PARQUET |
24/07/2020 | 24/07/2020 15:15 | 278 | 1 | 40.26KB | NOT CACHED | NOT CACHED | PARQUET |
Total | 2431 | 8 | 340.84KB | 0B |
However if I sum up the rows in column "#Rows" I get 2151...!
Am I'm missing something? How is that possible?
Thanks
Thanks
Created on 07-24-2020 01:21 PM - edited 07-24-2020 01:22 PM
The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples).
One scenario where this could happen is if a partition was dropped since the last compute stats was run.
The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate.
If you want to know the accurate counts, you can run queries like
select count(*) from table;
select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;
Created on 07-24-2020 01:21 PM - edited 07-24-2020 01:22 PM
The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples).
One scenario where this could happen is if a partition was dropped since the last compute stats was run.
The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate.
If you want to know the accurate counts, you can run queries like
select count(*) from table;
select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;