Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Partitons don't work well

Partitons don't work well

New Contributor

I'm using Hive on MRv2, and I'm trying to optimize hive queries.

The database assumes purchase history of convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.


SELECT c.gender,,, 
FROM   purchase_histories ph 
       JOIN customers c 
         ON ( = ph.cus_id 
              AND ph.dt < $var1
AND ph.dt > $var2 )         JOIN items i           ON ( = ph.item_id )         JOIN genres g           ON ( = i.gen_id )  GROUP  BY c.gender,  ,  ; 

I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).


I compared this database and no partition database(contains same tables) by above query. I input some kinds of valuesto $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.


I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but not partitioned database is about 150. I don't think the many mapper is definitely good but 10~30 mappers is too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.


The result of EXPLAIN are no partition and partition. And execution logs are exe_log_no_partition and exe_log_partitioned.



Don't have an account?
Coming from Hortonworks? Activate your account here