Reply
Contributor
Posts: 66
Registered: ‎12-30-2015

Re: why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Ta

Alex,

Thank you so much for your time and explanation.
All your comments are really valuable to me :)

Thank you again.

Contributor
Posts: 66
Registered: ‎12-30-2015

Re: why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Ta

[ Edited ]

Alex,

 

I have a question.

Like I said, one of the tables I have 31k partitions. ( Table1 )

 

It seems the table performs ok.

However, since you said that more than 10k is not recommended per table, I have thought reducing the partitions.

Recently, I have read about bucketing in Hive and generated a testing table with partition for the first level and bucket for the secend level.

 

BTW, I got this warning from Impala.

WARNINGS: For better performance, snappy, gzip and bzip-compressed files should not be split into multiple hdfs-blocks. file=/user/hive/external_warehouse/test_table/yrmonth=201512/000151_0.snappy offset 134217728

 

 

Here are my questions.

 

Question 1:

In general, how is Impala query performace with Hive bucketing?

 

Question 2:

How to avoid to use cached Impala query?

 

I'd like to do some performance testing with this new table.

However, if I run the same query two times, it seems the second query use the cached data by the prev. one.

 

Thank you

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Ta

1. Impala does not support bucketed tables.

 

2. Your first run is likely slow because Impala first needs to load the table metadata from the catalogd (as explain in another thread). After the first reference to the table, the metadata is cached. So when perf testing you should discard the first run, unless you specifically want to test performance on a cold cache (which seems weird).

Contributor
Posts: 66
Registered: ‎12-30-2015

Re: why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Ta

Thank you,

 

1. Any plan to support bucketed tables?

 

2. Yeap. I remember what you said about catalogd loading metadata.

However, my question was that how to avoid to hit Impala cache if Impala caches query which ran before.

The reason why I'm asking this question is that I saw some difference in performace.

For example, let's say catalogd already has metadata for `Table1` because I already wamp up  the table by running `SELECT query`. After this, I run two different query like this.

 

I run QUERY1 - this takes 15sec

I run QUERY2 - this takes 20sec

I run QUERY1 again - this takes 6sec

I run QUERY2 again - this takes 10sec

 

 

How should I interpret this difference?

Cloudera Employee
Posts: 307
Registered: ‎10-16-2013

Re: why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Ta

1. It's on our radar, but we don't have concrete plans yet.

 

2. Impala does not do query plan or result caching, so the differences you see are not due to Impala doing caching. There could be several reasons why you see a difference among runs, e.g., the OS buffer cache warming up The first step would be to study the runtime profiles of those queries. The exec summary is often enough to determine where time was saved the second time around,

 

Contributor
Posts: 66
Registered: ‎12-30-2015

Re: why 'show column stats <table_name>` doesn't show statistics generated by Hive 'Analyze Ta

Ah, I got it.

 

Tmr, I will try to compare the difference between first and second run by checking exec summary.

 

MANY Thank you.