Support Questions

AndrasK · ‎01-11-2017

Hi Everyone,

I've come across a very strange behavior of impala.
My aim is to generate random numbers for each rows, and then classified based on the probability to each segment. For some reason, the below query fails, and mixing up segments.

select 
case when random < 0.005 then 1
when random < 0.0175 and random >= 0.005 then 2
when random < 0.0175 and random >= 0.0175 then 3
when random < 0.2500 and random >= 0.0800 then 4
else 0 end segment,
min(random),max(random),
count(ANYID)
from (
select ANYID,RAND(unix_timestamp()) random from ANYTABLE limit 1000000) j
group by segment

Can anyone help me with any explanation? I simply can't get my head around.

Thank you so much,
Andras

Tim Armstrong · ‎01-11-2017

Unfortunately there are some known issues with rand(). This is essentially the same issue as https://issues.cloudera.org/browse/IMPALA-397 (Order by rand() does not work).

Impala's planner doesn't currently fully understand the concept of a non-deterministic or random function, so it will often produce plans that either evaluate rand() repeatedly when logically it shouldn't or caches the value of rand().

In this particular case, it evaluates essentially substitutes random for rand() and re-evaluates it multiple times.

[localhost:21000] > explain select     
case when random < 0.005 then 1
when random < 0.0175 and random >= 0.005 then 2
when random < 0.0175 and random >= 0.0175 then 3
when random < 0.2500 and random >= 0.0800 then 4
else 0 end segment,
min(random),max(random),
count(id)
from (
select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j
group by segment;
Query: explain select
case when random < 0.005 then 1
when random < 0.0175 and random >= 0.005 then 2
when random < 0.0175 and random >= 0.0175 then 3
when random < 0.2500 and random >= 0.0800 then 4
else 0 end segment,
min(random),max(random),
count(id)
from (
select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j
group by segment
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Explain String                                                                                                                                                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=80.00MB VCores=1                                                                                                                                                                                                                        |
|                                                                                                                                                                                                                                                                                 |
| PLAN-ROOT SINK                                                                                                                                                                                                                                                                  |
| |                                                                                                                                                                                                                                                                               |
| 01:AGGREGATE [FINALIZE]                                                                                                                                                                                                                                                         |
| |  output: min(rand(1484133976)), max(rand(1484133976)), count(l_orderkey)                                                                                                                                                                                                      |
| |  group by: CASE WHEN rand(1484133976) < 0.005 THEN 1 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.005 THEN 2 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.0175 THEN 3 WHEN rand(1484133976) < 0.2500 AND rand(1484133976) >= 0.0800 THEN 4 ELSE 0 END |
| |                                                                                                                                                                                                                                                                               |
| 02:EXCHANGE [UNPARTITIONED]                                                                                                                                                                                                                                                     |
| |  limit: 1000000                                                                                                                                                                                                                                                               |
| |                                                                                                                                                                                                                                                                               |
| 00:SCAN HDFS [tpch_parquet.lineitem]                                                                                                                                                                                                                                            |
|    partitions=1/1 files=3 size=193.61MB                                                                                                                                                                                                                                         |
|    limit: 1000000                                                                                                                                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Generally rand() will work as expected if it's in the select list of the outer query. E.g. "create table tmp_rand as select rand(unix_timestamp()) from table" would do what you expect. So you could maybe work around it by creating a temporary table instead of using a subquery (I know that's not ideal).

AndrasK · ‎01-11-2017

Thank you very much for your very quick and useful answer! Glad to know that it is a known bug.

Anyway, I have been playing around, and made it work!
For some reason it does work. Have no idea why though..

select
  a+b+c+d segment
  ,min(r)
  ,max(r)
  ,count(ANYID) COUNTER
from (
  select
    ANYID
    ,r
    ,if(r >= 0.005,1,0) a
    ,if(r >= 0.0175,1,0) b
    ,if(r >= 0.08,1,0) c
    ,if(r >= 0.025,1,0) d
  from (
    select ANYID,RAND(unix_timestamp()) r from ANYTABLE
  ) foo
) bar
group by segment
order by segment

Do you have any idea why It does work? 🙂
Thanks
Andras

Cloudera Community

Support Questions

Impala RANDOM, cases