- |
- Sign Out

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Cloudera Community
- News
- News & Announcements
- Getting Started
- Hadoop 101
- Configuring and Managing
- Cloudera Manager
- CDH Topics (w/o CM)
- Using the Platform
- Batch (MR, YARN, Oozie)
- Data Ingest (Sqoop, Flume...
- Storage (HDFS, HBase...
- Hue
- Hive
- Impala
- Cloudera Data Science Work...
- Data Science
- Search (SolrCloud)
- Spark
- Cloudera Labs
- Data Management
- Data Discovery, Optimization
- Security/Sentry
- Building on the Platform
- Kite SDK
- Cloudera Altus
- Cloudera Altus Director
- Cloudera Altus Cloud Services Q&A
- Cloudera Altus Cloud Services Knowledge Base
- Suggestions
- Off Topic and Suggestions
- Cloudera AMA

- Cloudera Community
- :
- Using the Platform
- :
- Impala
- :
- Impala RANDOM, cases

Topic Options

- Start Article
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-11-2017 07:12 AM

Hi Everyone,

I've come across a very strange behavior of impala.

My aim is to generate random numbers for each rows, and then classified based on the probability to each segment. For some reason, the below query fails, and mixing up segments.

select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(ANYID) from ( select ANYID,RAND(unix_timestamp()) random from ANYTABLE limit 1000000) j group by segment

Can anyone help me with any explanation? I simply can't get my head around.

Thank you so much,

Andras

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-11-2017 11:29 AM

Unfortunately there are some known issues with rand(). This is essentially the same issue as https://issues.cloudera.org/browse/IMPALA-397 (Order by rand() does not work).

Impala's planner doesn't currently fully understand the concept of a non-deterministic or random function, so it will often produce plans that either evaluate rand() repeatedly when logically it shouldn't or caches the value of rand().

In this particular case, it evaluates essentially substitutes random for rand() and re-evaluates it multiple times.

[localhost:21000] > explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment; Query: explain select case when random < 0.005 then 1 when random < 0.0175 and random >= 0.005 then 2 when random < 0.0175 and random >= 0.0175 then 3 when random < 0.2500 and random >= 0.0800 then 4 else 0 end segment, min(random),max(random), count(id) from ( select l_orderkey id,RAND(unix_timestamp()) random from tpch_parquet.lineitem limit 1000000) j group by segment +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=80.00MB VCores=1 | | | | PLAN-ROOT SINK | | | | | 01:AGGREGATE [FINALIZE] | | | output: min(rand(1484133976)), max(rand(1484133976)), count(l_orderkey) | | | group by: CASE WHEN rand(1484133976) < 0.005 THEN 1 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.005 THEN 2 WHEN rand(1484133976) < 0.0175 AND rand(1484133976) >= 0.0175 THEN 3 WHEN rand(1484133976) < 0.2500 AND rand(1484133976) >= 0.0800 THEN 4 ELSE 0 END | | | | | 02:EXCHANGE [UNPARTITIONED] | | | limit: 1000000 | | | | | 00:SCAN HDFS [tpch_parquet.lineitem] | | partitions=1/1 files=3 size=193.61MB | | limit: 1000000 | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Generally rand() will work as expected if it's in the select list of the outer query. E.g. "create table tmp_rand as select rand(unix_timestamp()) from table" would do what you expect. So you could maybe work around it by creating a temporary table instead of using a subquery (I know that's not ideal).

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-11-2017 12:51 PM

Thank you very much for your very quick and useful answer! Glad to know that it is a known bug.

Anyway, I have been playing around, and made it work!

For some reason it does work. Have no idea why though..

select a+b+c+d segment ,min(r) ,max(r) ,count(ANYID) COUNTER from ( select ANYID ,r ,if(r >= 0.005,1,0) a ,if(r >= 0.0175,1,0) b ,if(r >= 0.08,1,0) c ,if(r >= 0.025,1,0) d from ( select ANYID,RAND(unix_timestamp()) r from ANYTABLE ) foo ) bar group by segment order by segment

Do you have any idea why It does work? :)

Thanks

Andras

New solutions