Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Table ...

avatar
Rising Star

 

Hello,

I am researching how to statistics to optimize queries.
From I read from here, http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_perf_stats.h...
Although Impala recommends to use `COMPUTE STATS` or `COMPUTE INCREMENTAL STATS` depending on use cases, 
I can still generate statistics from Hive by using `ANALYZE TABLE ... COMPUTE STATISTICS` and/or `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS;`

After I generate statistics by `ANALYZE TABLE ... COMPUTE STATISTICS` on Hive,
I could see statistics for table and partition on Impala by running `show table stats Table1`.

However, I couldn't see any updated column statistics even after `ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS;`.
Of course, I used `SHOW COLUMN STATS Tables1`.

Is this a bug or am I missing anything?

Thank you for your help in advance.

2 ACCEPTED SOLUTIONS

avatar

Ok, looks like I was wrong in assuming that Hive would compute the column stats on a table level.

For a partitioned table, Hive's ANALYZE TABLE command will compute the column stats on a per-partition basis.

It's not clear that this approach even makes sense because how will one then aggregate the different distinct-value stats across partitions? Seems like those stats would be wildly inaccurate, so maybe this is not a good flow anyway, even if we could make it work.

 

That's why the stats do not show up in Impala. The flow of computing column stats in Hive and then using them in Impala will currently not work for partitioned tables.

 

View solution in original post

avatar

Thanks for the detailed explanation!

 

See my responses below:

 

The term "partition" usually means a complete partition specification, i.e., you can construct a path into HDFS that contains data files (and not folders for the next level partition values).

 

RE: Question 1

That's correct. It will do a full recomputation. However, it might be worth questioning your partitioning strategy. Typically we do not recommend more than 10k partitions per table. In extreme cases you might go up to 100k, but that is really stretching it, and you need to fully understand the consequences.

The 10k recommendation stems from limitations in Impala and the Hive Metastore in scaling up to an extreme number of partitions.

 

RE: Question 2

Probably yes. If you have an extreme number of partitions, then compute incremental stats is not recommended due to the additional memory requirements.

 

RE: Question 3

Doing concurrent reads should be fine. Doing concurrent writes with INSER INTO should also be fine as long as you do not OVERWRITE. However, there is no guarantee as to which partitions/files will be seen by COMPUTE STATS. It may see some but not other results from your INSERT.

 

View solution in original post

15 REPLIES 15

avatar
Rising Star
Alex,

Thank you so much for your time and explanation.
All your comments are really valuable to me 🙂

Thank you again.

avatar
Rising Star

Alex,

 

I have a question.

Like I said, one of the tables I have 31k partitions. ( Table1 )

 

It seems the table performs ok.

However, since you said that more than 10k is not recommended per table, I have thought reducing the partitions.

Recently, I have read about bucketing in Hive and generated a testing table with partition for the first level and bucket for the secend level.

 

BTW, I got this warning from Impala.

WARNINGS: For better performance, snappy, gzip and bzip-compressed files should not be split into multiple hdfs-blocks. file=/user/hive/external_warehouse/test_table/yrmonth=201512/000151_0.snappy offset 134217728

 

 

Here are my questions.

 

Question 1:

In general, how is Impala query performace with Hive bucketing?

 

Question 2:

How to avoid to use cached Impala query?

 

I'd like to do some performance testing with this new table.

However, if I run the same query two times, it seems the second query use the cached data by the prev. one.

 

Thank you

avatar

1. Impala does not support bucketed tables.

 

2. Your first run is likely slow because Impala first needs to load the table metadata from the catalogd (as explain in another thread). After the first reference to the table, the metadata is cached. So when perf testing you should discard the first run, unless you specifically want to test performance on a cold cache (which seems weird).

avatar
Rising Star

Thank you,

 

1. Any plan to support bucketed tables?

 

2. Yeap. I remember what you said about catalogd loading metadata.

However, my question was that how to avoid to hit Impala cache if Impala caches query which ran before.

The reason why I'm asking this question is that I saw some difference in performace.

For example, let's say catalogd already has metadata for `Table1` because I already wamp up  the table by running `SELECT query`. After this, I run two different query like this.

 

I run QUERY1 - this takes 15sec

I run QUERY2 - this takes 20sec

I run QUERY1 again - this takes 6sec

I run QUERY2 again - this takes 10sec

 

 

How should I interpret this difference?

avatar

1. It's on our radar, but we don't have concrete plans yet.

 

2. Impala does not do query plan or result caching, so the differences you see are not due to Impala doing caching. There could be several reasons why you see a difference among runs, e.g., the OS buffer cache warming up The first step would be to study the runtime profiles of those queries. The exec summary is often enough to determine where time was saved the second time around,

 

avatar
Rising Star

Ah, I got it.

 

Tmr, I will try to compare the difference between first and second run by checking exec summary.

 

MANY Thank you.