Support Questions
Find answers, ask questions, and share your expertise

CDH 6.0.1 hive count distinct performance is too bad

New Contributor

The HIVE performance in CDH 5.14.2 is more faster than hive in CDH6.0.1.  The same hive SQL is fast in 5.14.2, but it's too slow in CDH 6.0.1. it's only have count(distinct) which has this issue.  This question is similar to



Is it specifically to HiveOnSpark? Can you please provide test examples to replicate the issue? How big the data size is?


New Contributor


In produce environment,I use hiveOnSpark. Below is an simple sql,but it's too slow.

  select count(1) as pv,
	 count(distinct user_id) as uv
   from ods_action_d 
  where dt>='2018-11-09' and dt<='2018-11-10'
  group by biz
This sql has 2 stages . In first stage which spend 1.2 minitues, input data size is 243.2GB, shuffle write data size is 167.7MB.
In second stage,shffle Read data size is 167.7MB. But it spend 3.3 minitues.
It  totaly spended 273 seconds.
If i executed in hiveOnMr . It spended 478 seconds.  
Please help me . Thanks.


Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.