08-28-2018 11:06 PM - edited 08-28-2018 11:28 PM
I'm using Hive on MRv2, and I'm trying to optimize hive queries.
The database assumes purchase history of convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.
SELECT c.gender, g.name, i.name, Sum(ph.num) FROM purchase_histories ph JOIN customers c ON ( c.id = ph.cus_id AND ph.dt < $var1
AND ph.dt > $var2 ) JOIN items i ON ( i.id = ph.item_id ) JOIN genres g ON ( g.id = i.gen_id ) GROUP BY c.gender, g.name, i.name;
I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).
I compared this database and no partition database(contains same tables) by above query. I input some kinds of valuesto $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.
I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but not partitioned database is about 150. I don't think the many mapper is definitely good but 10~30 mappers is too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.