I know that one should avoid of COUNT DISTINCT usign Group by statment but I've got to many variables in my data set (more than 2000) to do it this separately.
To give you ilustration the codes Looks like:
count(distinct ann_navg_meantemp) as ann_navg_meantemp,
count(distinct ann_navg_mintemp) as ann_navg_mintemp,
count(distinct ann_nheatdays) as ann_nheatdays,
count(distinct ann_wind_mph) as ann_wind_mph,
count(distinct cann_nheatdays) as cann_nheatdays,
count(distinct mean_high_temp) as mean_high_temp,
count(distinct mean_low_temp) as mean_low_temp,
count(distinct hhfore) as hhfore,
count(distinct popfore) as popfore
Due to many mappers it is processing very slowly or it fails.
I've tried to look in to the programing hive book but there is nothing about optimisation of this process.
Does anybody has similar experience and can help me to solve the issue?
@Pavol Namer which version of hive do you have? If optimization mentioned in HIVE-10568 is not there, you might want to rewrite your query.
select count(distinct ann_navg_meantemp) from tbl;
can be rewritten as:
select count(1) from (select distinct ann_navg_meantemp from tbl) ann_navg_meantemp;
I'm using hive 2.0.
Thank you I know that best practice is to rewrite query like you have shown because there is too many mappers.
The issues here is that I have more than 2000 variables and that would mean that I need either submit more than 2000 queries or make 2000 nested select statetment with 2000 nested join statetments.
Is there any other way how to over this?