I'm using Hive on MRv2, and I'm trying to optimize hive queries.
The database assumes the purchase history of a convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.
SELECT c.gender, g.name, i.name, Sum(ph.num) FROM purchase_histories ph JOIN customers c ON ( c.id = ph.cus_id AND ph.dt < $var1 AND ph.dt > $var2 ) JOIN items i ON ( i.id = ph.item_id ) JOIN genres g ON ( g.id = i.gen_id ) GROUP BY c.gender, g.name, i.name;
I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).
I compared this database and no partition database(contains same tables) by above query. I input some kinds of values to $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.
I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but a not partitioned database is about 150. I don't think the many mappers is definitely good but 10~30 mappers are too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.
1, I saw EXPLAIN result of partitioned and thought that the number of mapper is calculated from below formula:
(the table size 2619958583)/(mapreduce.input.fileinputformat.split.maxsize=256000000)
Is it wrong?
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
SELECT c.gender, g.name, i.name, Sum(ph.num) FROM purchase_histories ph JOIN customers c ON ( c.id = ph.cus_id AND ph.dt < $var1 AND ph.dt > $var2 ) JOIN items i ON ( i.id = ph.item_id ) JOIN genres g ON ( g.id = i.gen_id ) GROUP BY c.gender, g.name, i.name ORDER BY NULL