Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

CDH 6.0.1 hive count distinct performance is too bad

CDH 6.0.1 hive count distinct performance is too bad

New Contributor

The HIVE performance in CDH 5.14.2 is more faster than hive in CDH6.0.1.  The same hive SQL is fast in 5.14.2, but it's too slow in CDH 6.0.1. it's only have count(distinct) which has this issue.  This question is similar to 

http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/be-careful-CDH-5-5-1-hive-performance-is-too-...

2 REPLIES 2

Re: CDH 6.0.1 hive count distinct performance is too bad

Guru
Hi,

Is it specifically to HiveOnSpark? Can you please provide test examples to replicate the issue? How big the data size is?

Thanks

Re: CDH 6.0.1 hive count distinct performance is too bad

New Contributor

hi

In produce environment,I use hiveOnSpark. Below is an simple sql,but it's too slow.

  select count(1) as pv,
	 count(distinct user_id) as uv
   from ods_action_d 
  where dt>='2018-11-09' and dt<='2018-11-10'
  group by biz
 
This sql has 2 stages . In first stage which spend 1.2 minitues, input data size is 243.2GB, shuffle write data size is 167.7MB.
In second stage,shffle Read data size is 167.7MB. But it spend 3.3 minitues.
It  totaly spended 273 seconds.
 
If i executed in hiveOnMr . It spended 478 seconds.  
Please help me . Thanks.