Reply
New Contributor
Posts: 2
Registered: ‎01-17-2019

CDH 6.0.1 hive count distinct performance is too bad

The HIVE performance in CDH 5.14.2 is more faster than hive in CDH6.0.1.  The same hive SQL is fast in 5.14.2, but it's too slow in CDH 6.0.1. it's only have count(distinct) which has this issue.  This question is similar to 

http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/be-careful-CDH-5-5-1-hive-performance-is-too-...

Cloudera Employee
Posts: 832
Registered: ‎03-23-2015

Re: CDH 6.0.1 hive count distinct performance is too bad

Hi,

Is it specifically to HiveOnSpark? Can you please provide test examples to replicate the issue? How big the data size is?

Thanks
New Contributor
Posts: 2
Registered: ‎01-17-2019

Re: CDH 6.0.1 hive count distinct performance is too bad

[ Edited ]

hi

In produce environment,I use hiveOnSpark. Below is an simple sql,but it's too slow.

  select count(1) as pv,
	 count(distinct user_id) as uv
   from ods_action_d 
  where dt>='2018-11-09' and dt<='2018-11-10'
  group by biz
 
This sql has 2 stages . In first stage which spend 1.2 minitues, input data size is 243.2GB, shuffle write data size is 167.7MB.
In second stage,shffle Read data size is 167.7MB. But it spend 3.3 minitues.
It  totaly spended 273 seconds.
 
If i executed in hiveOnMr . It spended 478 seconds.  
Please help me . Thanks.