About Tim Armstrong

bigdata-beast · ‎12-14-2018

The same can be peformed in hive using concat_ws('.',from_unixtime(cast(epochmillis/1000 as BIGINT),'yyyy-MM-dd HH:mm:ss'),cast(floor(epochmillis % 1000) as STRING)) to get the timestamp with milliseconds. Is this efficient way of doing it ?

Tim Armstrong · ‎12-08-2018

Actually, scratch what I just said - that advice applies if the query is stuck in the FINISHED state. If it's stuck in the RUNNING state, it means the query is just taking a long time to produce any results. So you're probably getting a bad query plan on one cluster that is extremely slow to execute. E.g. the order of the joins chosen by the planner is inefficient. Usually computing stats on all the tables will improve the query plan.

Tim Armstrong · ‎11-26-2018

CDH5.10.2 should have the fix for that specific issue.

Tim Armstrong · ‎11-19-2018

Hi @scuffster There are some interesting issues here with the different numeric data types here - INT, DOUBLE, DECIMAL, etc. The behaviour you're seeing is because the first input to round() is a DOUBLE expression, which cannot exactly represent all decimal values. Generally the output type of the round() function is the same as the input type. Impala does support precise decimal arithmetic with the DECIMAL type. If you are operating on DECIMAL columns or you cast the input to a decimal type with the right precision and scale, you may get the behaviour you're hoping for. Here's a query showing the type of your expressions and an alternative version with a cast to DECIMAL: > select typeof(269586/334026 * 100), typeof(round(269586/334026 * 100, 2)), round(269586/334026 * 100, 2), round(cast(269586/334026 * 100 as DECIMAL(20, 8)), 2); +-------------------------------+-----------------------------------------+---------------------------------+--------------------------------------------------------+ | typeof(269586 / 334026 * 100) | typeof(round(269586 / 334026 * 100, 2)) | round(269586 / 334026 * 100, 2) | round(cast(269586 / 334026 * 100 as decimal(20,8)), 2) | +-------------------------------+-----------------------------------------+---------------------------------+--------------------------------------------------------+ | DOUBLE | DOUBLE | 80.70999999999999 | 80.71 | +-------------------------------+-----------------------------------------+---------------------------------+--------------------------------------------------------+

Selu · ‎11-19-2018

Is there a workaround for this as we are on Impala version 2.8.0. We are always stuck with compute incremental stats queries that need tobe manually cancelled?

CsabaR · ‎10-26-2018

I have checked the writer in the file's metadata, and it is Parquet.Net version 2.1.4.298. So it seems that this is not an Impala reader issue, but a Parquet.Net writer issue. The definition levels of NULLs in collections are wrong (according to Parquet spec). This issue it causes is that if the first column read is the collection with NULL in the row, then the 0 def level is interpreted as "the whole row is NULL". If there is another (non NULL) column read first, then its def will be used to determine parents's NULLness, so it will not be NULL. This is why adding 'id' leads to returning the expected results. I would not consider this a bug, rather an optimisation (checking every columns's def level could affect performance). Parquet.Net is not part of CDH and is not an Apachee project at the moment. I am not familiar with the project, so I do not know whether this is a known issue or not. My advice is to contact the maintainer mentioned at https://github.com/elastacloud/parquet-dotnet

BikramjeetVig · ‎10-10-2018

The MEM_LIMIT is a hard limit on the amount of memory that can be used by the query and cannot be re-negotiated during execution. If the default mem_limit that you set does not suffice, you can either increase it OR you can set the mem_limit query option to a higher value only for that query.

Tomas79 · ‎10-04-2018

I have figured out that this is coming from the third party tool, so it has nothing to do with the Simba driver. Thanks

Tim Armstrong · ‎10-03-2018

I think the Kudu min-max filter pushdown optimisation in C5.14+ would achieve this: https://issues.apache.org/jira/browse/IMPALA-4252

recurse · ‎09-21-2018

I would like this too. My use case is new data files written to existing partitions, so I'm not concerned with partition discovery. Even having REFRESH tabA PARTITION (...) PARTITION (...) PARTITION (...) will be useful

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: Why not from_unixtime() function handles an un...

Re: Queries having joins stuck in running state in...

Re: Jar files created when invalidate metadata is ...

Re: Impala round function does not return expected...

Re: impala-shell operations getting stuck, spinnin...

Re: Impala bug with nested arrays of structures wh...

Re: Error Impala admission control

Re: Impala unsupported set commands

Re: Impala hash join optimization

Re: Refreshing multiple partitions in single quer...