Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Impala 2.8 vs 2.7 on CDH 5.10 upgrade

avatar
Explorer

Hi,

 

I had a cluster[CDH 5.8.2] in which I was using Impala and Kudu.

Impala Parcel is downloaded from - http://archive.cloudera.com/beta/impala-kudu/parcels/latest/

I have upgraded this cluster to CDH 5.10 with cloudera manager 5.10.

 

Now, running the select verison() query on this upgraded cluster in Impala gives me below details;

 

impalad version 2.7.0-cdh5.10.0 RELEASE

 

However, in CDH 5.10 upgrade they mentioned the support for Imapal 2.8. I can not find the parcel for the Impala 2.8.

Also, running the delete command on Impala table give me below error.

 

"ERROR: AnalysisException: Impala does not support modifying a non-Kudu table: default.impala_testtable"

 

Questions:

 

1. Can anybody suggest me how can I upgrade to Impala 2.8? Is there any parcel for the same or the one which I'm currently using is the latest?

 

2. As running delete command on Impala table gives me the error what is the alternative to delete data from existing impala table? However, the delete command works fine with Kudu tables.

 

Can anybody please help me on the same.

 

Thanks,

Amit

 

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi Amit,

 

Your first question has already been discussed in this thread

There's a bit of a story there. When we started preparing the 5.10 CDH release, the Apache 2.8 Impala release was not ready, so we had to call it "Impala 2.7" in the version number. Impala 2.8 was officially released after we finished putting together the CDH5.10 release - too late to bump the version in all places.

 

CDH5.10 Impala is almost exactly the same as 2.8, plus or minus a few patches, so in most of the announcements we've just called it 2.8.

 

You can find a full list of commits in CDH5.10.0 here: https://github.com/cloudera/Impala/commits/cdh5-2.7.0_5.10.0

The full list of commits in Impala 2.8 are here: https://github.com/apache/incubator-impala/commits/branch-2.8.0

 

To your second question: Impala does indeed not support the DELETE command for non-Kudu tables. You can use the TRUNCATE command to completely delete all data in a table.

 

Cheers, Lars

 

View solution in original post

12 REPLIES 12

avatar
Super Collaborator

Hi Amit,

 

Your first question has already been discussed in this thread

There's a bit of a story there. When we started preparing the 5.10 CDH release, the Apache 2.8 Impala release was not ready, so we had to call it "Impala 2.7" in the version number. Impala 2.8 was officially released after we finished putting together the CDH5.10 release - too late to bump the version in all places.

 

CDH5.10 Impala is almost exactly the same as 2.8, plus or minus a few patches, so in most of the announcements we've just called it 2.8.

 

You can find a full list of commits in CDH5.10.0 here: https://github.com/cloudera/Impala/commits/cdh5-2.7.0_5.10.0

The full list of commits in Impala 2.8 are here: https://github.com/apache/incubator-impala/commits/branch-2.8.0

 

To your second question: Impala does indeed not support the DELETE command for non-Kudu tables. You can use the TRUNCATE command to completely delete all data in a table.

 

Cheers, Lars

 

avatar
Explorer

Thanks Lars for your help.

 

On Impala DELETE, is their any specific reason to stop DELETE on Impala tables?

 

May be I'm missing something, but as I have the kudu service installed on my cluster. In cloudera Impala configuration, what is the difference between setting kudu service or selecting none as in both cases kudu queries works fine. 

 

kuduservice.png

 

Thank you for your help.

 

Thanks,

Amit

 

avatar
Contributor

Hi Lars ,

 

I also installed CDH5.10.1 hoping to find Impala 2.8 with the fix for the compute stats on large partition table failing on exceeding the limit of 200M .

 

"The new configuration setting inc_stats_size_limit_bytes lets you reduce the load on the catalog server when running the COMPUTE INCREMENTAL STATS statement for very large tables"

 

Do you have a way how to resolve it ?

 

its a big issue when our big tables do not have statistics queries are running for long time and impala loss points here .

 

https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Incremental-stats-size-estimate-exceed...

 

Also you need to update your documentation :

 

https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html#new_...

 

as Impala 2.8 is not in CDH 10 and its confusing .

 

Thanks

 

Alon

avatar

Hi Lars ,

 

I also installed CDH 5.10.1 hoping to find Impala 2.8 with the new hint SORTBY(cols):

 

"

A new hint, SORTBY(cols), allows Impala INSERT operations on a Parquet table to produce optimized output files with better compressibility and a more compact range of min/max values within each data file.the fix for the compute stats on large partition table failing on exceeding the limit of 200M .

"

 

You know exactly in which version of CDH, we will have all impala 2.8 new features?

 

Thanks,

Gustavo

 

avatar

CDH5.10 has essentially all of the Impala 2.8 improvements in it, as mentioned earlier in the thread.

 

Lars can confirm, but I don't believe that the "SORT BY" fix made it into either Impala 2.8 or CDH5.10, I think it got pushed out to the next release. I think the docs are incorrect: https://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html#new_...

avatar

AlonEdi: the increment stats change should be in CDH5.10. Did you have trouble using it?

avatar
Contributor

Hi Tim ,

 

Same Problem .

 

We can not go to production with this problem .

 

Following example for table we have with 4560 partitions and 382 columns .

 

The Incremental statistics fail but full statistics succeed ( why ???)

 

BTW its empty tables .

 

It does not happen in CDH 5.4.3 and meen that we will need to downgrade our CDH version to support huge tables .

 

CDH 5.9

 

compute INCREMENTAL STATS test_partitions.dwh_events
ERROR: AnalysisException: Incremental stats size estimate exceeds 200.00MB. Please try COMPUTE STATS instead.

L STATS test_partitions.dwh_events;ent-1271.internal:21000] > COMPUTE INCREMENTAL
Query: compute INCREMENTAL STATS test_partitions.dwh_events
ERROR: AnalysisException: Incremental stats size estimate exceeds 200.00MB. Please try COMPUTE STATS instead.

t_partitions.dwh_events;bi-environment-1271.internal:21000] > COMPUTE STATS  tes
Query: compute STATS  test_partitions.dwh_events
+----------------------------------------------+
| summary                                      |
+----------------------------------------------+
| Updated 4560 partition(s) and 382 column(s). |
+----------------------------------------------+
Fetched 1 row(s) in 263.45s
[gc-dp-pdpprd-data-04.c.bi-environment-1271.internal:21000] > select version();
Query: select version()
Query submitted at: 2017-04-04 10:43:49 (Coordinator: http://gc-dp-pdpprd-data-04:25000)
Query progress can be monitored at: http://gc-dp-pdpprd-data-04:25000/query_plan?query_id=eb4e9a2e3c6eca6c:242f978800000000
+-----------------------------------------------------------------------------------------+
| version()                                                                               |
+-----------------------------------------------------------------------------------------+
| impalad version 2.7.0-cdh5.9.0 RELEASE (build 4b4cf1936bd6cdf34fda5e2f32827e7d60c07a9c) |
| Built on Fri Oct 21 01:07:22 PDT 2016                                                   |
+-----------------------------------------------------------------------------------------+
Fetched 1 row(s) in 0.02s

 

CDH 5.10

 

compute INCREMENTAL STATS  dwh.dwh_events
ERROR: AnalysisException: Incremental stats size estimate exceeds 200.00MB. Please try COMPUTE STATS instead.

[gc-test-impala28-02.c.bi-environment-1271.internal:21000] > COMPUTE STATS  dwh.dwh_events;
Query: compute STATS  dwh.dwh_events
+----------------------------------------------+
| summary                                      |
+----------------------------------------------+
| Updated 4560 partition(s) and 382 column(s). |
+----------------------------------------------+
Fetched 1 row(s) in 219.37s
[gc-test-impala28-02.c.bi-environment-1271.internal:21000] > select version();
Query: select version()
Query submitted at: 2017-04-04 10:48:20 (Coordinator: http://gc-test-impala28-02:25000)
Query progress can be monitored at: http://gc-test-impala28-02:25000/query_plan?query_id=2f49dded87976155:3610426b00000000
+------------------------------------------------------------------------------------------+
| version()                                                                                |
+------------------------------------------------------------------------------------------+
| impalad version 2.7.0-cdh5.10.1 RELEASE (build 876895d2a90346e69f2aea02d5528c2125ae7a32) |
| Built on Mon Mar 20 02:28:53 PDT 2017                                                    |
+------------------------------------------------------------------------------------------+
Fetched 1 row(s) in 0.01s

 

Recommedation : It seems that Impala read all data to compute its statistics ,will be good to have estimate statistics between 0.xx - 100% .

 

so the process will run faster , less heavy and produce statitics that are close to real .

 

Thanks

 

Alon

avatar
Super Collaborator

Hi Alon,

 

Have you tried the inc_stats_size_limit_bytes command line flag as suggested by Tim? It is supported on CDH5.10.0. Here's the full help text from impalad:

 

-inc_stats_size_limit_bytes (Maximum size of incremental stats the catalog
is allowed to serialize per table. This limit is set as a safety check,
to prevent the JVM from hitting a maximum array limit of 1GB (or OOM)
while building the thrift objects to send to impalads. By default, it's
set to 200MB) type: int64 default: 209715200

 

 

This should allow you to increase the limit you are hitting.

 

Cheers, Lars

avatar
Contributor

Thanks Lars ,

 

looked for this parameter as part of impala configuration.

 

Added it to impalad parameters and its working with out error 🙂

 

what this message mean :

WARNINGS: Too many partitions selected, doing full recomputation of incremental stats

 

did it compute all table partions or just ones without statistics ?

 

I use table without data so compute stats (without incremental) completed on the same time (previous post - 219.37 seconds)

 

[gc-test-impala28-02.c.bi-environment-1271.internal:21000] > COMPUTE INCREMENTAL STATS  dwh.dwh_events;
Query: compute INCREMENTAL STATS  dwh.dwh_events
+----------------------------------------------+
| summary                                      |
+----------------------------------------------+
| Updated 4560 partition(s) and 382 column(s). |
+----------------------------------------------+
WARNINGS: Too many partitions selected, doing full recomputation of incremental stats
Fetched 1 row(s) in 262.02s

 

Thanks

 

Alon