About Tim Armstrong

kcyea · ‎04-12-2017

Column a.t_date is a string field, not a timestamp field. The two tables are Parquet file format. By adding more nodes into the cluster, more Impala Daemon are running, can we aspect the performance for such query will be improve?

Uque · ‎03-29-2017

Unfortunately, the content of this file is under NDA, so I can't provice you the file. Some information that I can give is summarized here: Output from "hdfs dfs -ls": -rwxrwx--x+ 3 hive hive 1093251527 2016-09-30 21:15 /path/to/file/month=12/part-r-00000-be7725db-da77-4a34-a3c6-2e5a9276228c.snappy.parquet We have a _metadata and an _common_metadata file in the same directory (I tried removing them, but this did not resolve the issue) Compression: snappy It was created using: parquet-mr version 1.5.0-cdh5.7.1 (build ${buildNumber}) (output from parquet-tools, version 1.9.0) Software used for creation: Bundled Spark 1.6.0 from CDH 5.7.1 (in the meantime we are using CDH 5.9.0) The file contains 713 row groups The file contains 867 columns (of types int64, double and binary) One further things that I tried is copying the problematic file to a seperate directory (without the two metadata files) create a new table from this file with Impala and do the test here. Unfortunately this produces the exactly same behaviour. When it is cached I get the error message, when it is not cached, everything works fine. Let me know if this helps you in understanding this problem or if you need further information (except from the contents of the file. Thanks a lot already! Kind Regards

mjrice04 · ‎03-23-2017

Hi Tim, I'd like to resond and say that we were running into the issues that you brought up, I will also note that changing our double values of 1200 to 1200.0 does seem to remedy that particular problem. Thank you for your response.

Tim Armstrong · ‎03-21-2017

I think it was probably unable to get enough memory because of other concurrently executing queries. This is somewhat counterintuitive, but if you set the mem_limit query option to an amount of memory that the query can reliably obtain, e.g. 2GB, then when it hits that limit spill-to-disk will kick in and the query should be able to complete (albeit slow than running fully in-memory). We generally recommend that all queries run with a mem_limit set. You can configure a default mem_limit via the "default query options" config or by setting up memory-based admission control. We have some good docs about how to set up memory-based admission control here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_admission.html#admission_memory We're actively working on improving this so that it's more hands-off.

Tim Armstrong · ‎03-17-2017

This is a bug in the impala-udf-dev package versions 5.9.x to 5.10.x. I was alway intended to be compilable with older versions of gcc. It will be fixed in 5.11+ once that is released. If you downgrade the package to a version 5.8.x or earlier it should also work.

Tim Armstrong · ‎03-14-2017

One possible explanation is a crash if there is some problem with the data file. Are there any hs_err_pid*.log files in /var/log/impalad? Or any *.dmp files?

Tomas79 · ‎02-08-2017

It looks like the problem is really in the timestamp field. Running a similar query on table without timestamp show much better results on the new environment. Thanks for the help

Tim Armstrong · ‎02-02-2017

I think this is related to https://issues.cloudera.org/browse/IMPALA-4610 . I think you already discovered the workaround of using full subqueries.

thewayofthinkin · ‎01-31-2017

@Tim Armstrong Thank you very much for your explanation. 🙂 Gatsby

Tim Armstrong · ‎01-31-2017

We don't support UDFs messing around with Impala's runtime data structures. We don't expose this to UDFs since UDFs aren't really meant to do things like I/O.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Impala join query running slow

Re: Impala not working with some Parquet files whe...

Re: Decimal Data Type returning slightly incorrect...

Re: Impala query failed on memory limit

Re: Is there an Impala UDF C++ dev package for Red...

Re: Connection reset by peer when query specific t...

Re: Impala - performance degradation between 2.6 a...

Re: Right deep join SQL syntax

Re: CDH5.10 redhat repo doesn't have Impala 2.8 RP...

Re: Is there is any method to read property config...