About Tim Armstrong

Tim Armstrong · ‎01-23-2017

Hi Akhil, The only way I can think of to achieve this is to refer the udf by its fully-qualified name. E.g. if you create a function "my_fn" in a database "my_db" you can call it as my_db.my_fn() from any database.

Tim Armstrong · ‎01-19-2017

It looks like the query was only able to get 223MB of memory - perhaps there are other queries running at the same time?

Tim Armstrong · ‎01-12-2017

What kind of performance difference are we talking about? 5%? 100%? It's helpful to look at execution summaries or profiles to drill down on where the difference is (if you're using impala-shell, you can get them with the summary; and profile; commands after running a query). If the whole data set you're querying fits in memory, HDFS caching may not be that beneficial, since the OS buffer cache can be pretty effective at keeping the data in memory, especially if you're re-running the same query on the same data back-to-back. Also if the query is somewhat complex, it can get CPU-bound pretty quickly.

Tim Armstrong · ‎01-12-2017

Hi efumas, What version of Impala are you running? For more recent versions of Impala the query error log will include a more detailed dump of which query operators are using memory. It will also likely show up in the impalad* logs. Generally this error means that you don't have enough memory to execute the query. The memory limits that can apply are the total process memory limit (set for an entire Impala daemon when it is started) or the query memory limit (set via the mem_limit query option). - Tim

Tim Armstrong · ‎01-11-2017

Unfortunately there are some known issues with rand(). This is essentially the same issue as https://issues.cloudera.org/browse/IMPALA-397 (Order by rand() does not work). Impala's planner doesn't currently fully understand the concept of a non-deterministic or random function, so it will often produce plans that either evaluate rand() repeatedly when logically it shouldn't or caches the value of rand(). In this particular case, it evaluates essentially substitutes random for rand() and re-evaluates it multiple times. [localhost:21000] > explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment; Query: explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=80.00MB VCores=1 | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: min(rand(1484133976)), max(rand(1484133976)), count(l_orderkey) | | | group by: CASE WHEN rand(1484133976) < 0.005 THEN 1 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.005 THEN 2 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.0175 THEN 3 WHEN rand(1484133976) < 0.2500 AND rand(1484133976) >= 0.0800 THEN 4 ELSE 0 END | | | | | 02:EXCHANGE [UNPARTITIONED] | | | limit: 1000000 | | | | | 00:SCAN HDFS [tpch_parquet.lineitem] | | partitions=1/1 files=3 size=193.61MB | | limit: 1000000 | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Generally rand() will work as expected if it's in the select list of the outer query. E.g. "create table tmp_rand as select rand(unix_timestamp()) from table" would do what you expect. So you could maybe work around it by creating a temporary table instead of using a subquery (I know that's not ideal).

Tim Armstrong · ‎01-10-2017

Hi Petter, This was on our radar - we usually triage anything with a "correctness" label (which you added) periodically - it's obviously a serious issue. I updated the JIRA. - Tim

Tim Armstrong · ‎12-23-2016

Hi RPAT, The values of .ptr and .len are invalid if .is_null is true. For a null string value, in some cases Impala just sets the is_null field in this case and doesn't overwrite the ptr and len fields. You should rewrite the condition as: if (sInput.is_null) { ... } else { ... } This isn't explicitly documented so we should improve that: https://issues.cloudera.org/browse/IMPALA-4711

Tim Armstrong · ‎12-08-2016

Good point - we should handle this more gracefully. I filed https://issues.cloudera.org/browse/IMPALA-4629 to track the issue.

Tim Armstrong · ‎12-05-2016

Yeah Cloudera Manager's agent will restart it automatically (at least in the default config I believe).

Tim Armstrong · ‎12-05-2016

lt looks like maybe your catalog service is having problems. It would be worth looking in the catalogd logs for clues.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: How to create impala UDF common to all databas...

Re: impala memory limit exceed

Re: Impala performance with HDFS caching enabled

Re: impala memory limit exceed

Re: Impala RANDOM, cases

Re: Impala has problems reading complex types from...

Re: In Impala UDF - While fetching NULL record St...

Re: impala-shell operations getting stuck, spinnin...

Re: impala-shell operations getting stuck, spinnin...

Re: impala-shell operations getting stuck, spinnin...