I'm using Hive on MRv2, and I'm trying to optimize hive queries.
The database assumes purchase history of convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.
FROM purchase_histories ph
JOIN customers c
ON ( c.id = ph.cus_id
AND ph.dt < $var1 AND ph.dt > $var2 )
JOIN items i
ON ( i.id = ph.item_id )
JOIN genres g
ON ( g.id = i.gen_id )
GROUP BY c.gender,
I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).
I compared this database and no partition database(contains same tables) by above query. I input some kinds of valuesto $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.
I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but not partitioned database is about 150. I don't think the many mapper is definitely good but 10~30 mappers is too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.