Reply
Highlighted
New Contributor
Posts: 4
Registered: ‎08-07-2018

Partitons don't work well

[ Edited ]

I'm using Hive on MRv2, and I'm trying to optimize hive queries.

The database assumes purchase history of convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.

 

SELECT c.gender, 
       g.name, 
       i.name, 
       Sum(ph.num) 
FROM   purchase_histories ph 
       JOIN customers c 
         ON ( c.id = ph.cus_id 
              AND ph.dt < $var1
AND ph.dt > $var2 )         JOIN items i           ON ( i.id = ph.item_id )         JOIN genres g           ON ( g.id = i.gen_id )  GROUP  BY c.gender,            g.name,            i.name; 

I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).

 

I compared this database and no partition database(contains same tables) by above query. I input some kinds of valuesto $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.

 

I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but not partitioned database is about 150. I don't think the many mapper is definitely good but 10~30 mappers is too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.

 

The result of EXPLAIN are no partition and partition. And execution logs are exe_log_no_partition and exe_log_partitioned.

 

Thanks.

Announcements