Support Questions

Find answers, ask questions, and share your expertise

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- :
- Support
- :
- Support Questions
- :
- Impala RANDOM, cases

Announcements

Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Impala RANDOM, cases

New Contributor

Created 01-11-2017 07:12 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Everyone,

I've come across a very strange behavior of impala.

My aim is to generate random numbers for each rows, and then classified based on the probability to each segment. For some reason, the below query fails, and mixing up segments.

select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(ANYID) from ( select ANYID,RAND(unix_timestamp()) random from ANYTABLE limit 1000000) j group by segment

Can anyone help me with any explanation? I simply can't get my head around.

Thank you so much,

Andras

2 REPLIES 2

Re: Impala RANDOM, cases

Master Collaborator

Created 01-11-2017 11:29 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Unfortunately there are some known issues with rand(). This is essentially the same issue as https://issues.cloudera.org/browse/IMPALA-397 (Order by rand() does not work).

Impala's planner doesn't currently fully understand the concept of a non-deterministic or random function, so it will often produce plans that either evaluate rand() repeatedly when logically it shouldn't or caches the value of rand().

In this particular case, it evaluates essentially substitutes random for rand() and re-evaluates it multiple times.

[localhost:21000] > explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment; Query: explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=80.00MB VCores=1 | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: min(rand(1484133976)), max(rand(1484133976)), count(l_orderkey) | | | group by: CASE WHEN rand(1484133976) < 0.005 THEN 1 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.005 THEN 2 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.0175 THEN 3 WHEN rand(1484133976) < 0.2500 AND rand(1484133976) >= 0.0800 THEN 4 ELSE 0 END | | | | | 02:EXCHANGE [UNPARTITIONED] | | | limit: 1000000 | | | | | 00:SCAN HDFS [tpch_parquet.lineitem] | | partitions=1/1 files=3 size=193.61MB | | limit: 1000000 | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Generally rand() will work as expected if it's in the select list of the outer query. E.g. "create table tmp_rand as select rand(unix_timestamp()) from table" would do what you expect. So you could maybe work around it by creating a temporary table instead of using a subquery (I know that's not ideal).

Re: Impala RANDOM, cases

New Contributor

Created 01-11-2017 12:51 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Thank you very much for your very quick and useful answer! Glad to know that it is a known bug.

Anyway, I have been playing around, and made it work!

For some reason it does work. Have no idea why though..

select a+b+c+d segment ,min(r) ,max(r) ,count(ANYID) COUNTER from ( select ANYID ,r ,if(r >= 0.005,1,0) a ,if(r >= 0.0175,1,0) b ,if(r >= 0.08,1,0) c ,if(r >= 0.025,1,0) d from ( select ANYID,RAND(unix_timestamp()) r from ANYTABLE ) foo ) bar group by segment order by segment

Do you have any idea why It does work? :)

Thanks

Andras

Coming from Hortonworks? Activate your account here