Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
I'm using Hive on MRv2, and I'm trying to optimize hive queries.
The database assumes purchase history of convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.
FROM purchase_histories ph
JOIN customers c
ON ( c.id = ph.cus_id
AND ph.dt < $var1 AND ph.dt > $var2 )
JOIN items i
ON ( i.id = ph.item_id )
JOIN genres g
ON ( g.id = i.gen_id )
GROUP BY c.gender,
I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).
I compared this database and no partition database(contains same tables) by above query. I input some kinds of valuesto $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.
I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but not partitioned database is about 150. I don't think the many mapper is definitely good but 10~30 mappers is too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.