Created on 01-11-2017 07:12 AM - edited 09-16-2022 03:53 AM
Hi Everyone,
I've come across a very strange behavior of impala.
My aim is to generate random numbers for each rows, and then classified based on the probability to each segment. For some reason, the below query fails, and mixing up segments.
select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(ANYID) from ( select ANYID,RAND(unix_timestamp()) random from ANYTABLE limit 1000000) j group by segment
Can anyone help me with any explanation? I simply can't get my head around.
Thank you so much,
Andras
Created 01-11-2017 11:29 AM
Unfortunately there are some known issues with rand(). This is essentially the same issue as https://issues.cloudera.org/browse/IMPALA-397 (Order by rand() does not work).
Impala's planner doesn't currently fully understand the concept of a non-deterministic or random function, so it will often produce plans that either evaluate rand() repeatedly when logically it shouldn't or caches the value of rand().
In this particular case, it evaluates essentially substitutes random for rand() and re-evaluates it multiple times.
[localhost:21000] > explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment; Query: explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=80.00MB VCores=1 | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: min(rand(1484133976)), max(rand(1484133976)), count(l_orderkey) | | | group by: CASE WHEN rand(1484133976) < 0.005 THEN 1 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.005 THEN 2 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.0175 THEN 3 WHEN rand(1484133976) < 0.2500 AND rand(1484133976) >= 0.0800 THEN 4 ELSE 0 END | | | | | 02:EXCHANGE [UNPARTITIONED] | | | limit: 1000000 | | | | | 00:SCAN HDFS [tpch_parquet.lineitem] | | partitions=1/1 files=3 size=193.61MB | | limit: 1000000 | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Generally rand() will work as expected if it's in the select list of the outer query. E.g. "create table tmp_rand as select rand(unix_timestamp()) from table" would do what you expect. So you could maybe work around it by creating a temporary table instead of using a subquery (I know that's not ideal).
Created 01-11-2017 12:51 PM
Thank you very much for your very quick and useful answer! Glad to know that it is a known bug.
Anyway, I have been playing around, and made it work!
For some reason it does work. Have no idea why though..
select a+b+c+d segment ,min(r) ,max(r) ,count(ANYID) COUNTER from ( select ANYID ,r ,if(r >= 0.005,1,0) a ,if(r >= 0.0175,1,0) b ,if(r >= 0.08,1,0) c ,if(r >= 0.025,1,0) d from ( select ANYID,RAND(unix_timestamp()) r from ANYTABLE ) foo ) bar group by segment order by segment
Do you have any idea why It does work? 🙂
Thanks
Andras