Support Questions

Find answers, ask questions, and share your expertise

Impala show table stats - total of rows doesn't sum up

avatar
Explorer

Hi,

 

I'm using the Impala function "show table stats":

 

show table stats table1;

 

 
I get following results

business_datetec_execution_date#Rows#FilesSizeBytes CachedCache ReplicationFormat
13/05/202013/05/2020 20:08000BNOT CACHEDNOT CACHEDPARQUET
14/07/202015/07/2020 16:396116.08KBNOT CACHEDNOT CACHEDPARQUET
15/07/202016/07/2020 16:077702114.68KBNOT CACHEDNOT CACHEDPARQUET
16/07/202020/07/2020 16:50272141.88KBNOT CACHEDNOT CACHEDPARQUET
17/07/202021/07/2020 12:38290145.33KBNOT CACHEDNOT CACHEDPARQUET
20/07/202020/07/2020 12:16379151.10KBNOT CACHEDNOT CACHEDPARQUET
21/07/202022/07/2020 12:45156131.50KBNOT CACHEDNOT CACHEDPARQUET
24/07/202024/07/2020 15:15278140.26KBNOT CACHEDNOT CACHEDPARQUET
Total 24318340.84KB0B  


However if I sum up the rows in column "#Rows" I get 2151...!

Am I'm missing something? How is that possible?

Thanks


Thanks

1 ACCEPTED SOLUTION

avatar

The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples).

 

One scenario where this could happen is if a partition was dropped since the last compute stats was run.

 

The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate.

 

If you want to know the accurate counts, you can run queries like

  select count(*) from table;
  select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;

 

 

View solution in original post

1 REPLY 1

avatar

The row counts reflect the status of the partition or table the last time its stats were updated by "compute stats" in Impala (or analyze in Hive). Or that the stats were updated manually via an alter table. (There are also other cases where stats are updated, e.g. they can be automatically gathered by hive, but those are a few examples).

 

One scenario where this could happen is if a partition was dropped since the last compute stats was run.

 

The stats generally can be out of sync with the # of rows in the underlying table - we don't use them for answering queries, just for query optimization, so it's fine if they're a little inaccurate.

 

If you want to know the accurate counts, you can run queries like

  select count(*) from table;
  select count(*) from table where business_date = "13/05/2020" and tec_execution_date = "13/05/2020 20:08;