I'm using Hive on MRv2, and I'm trying to optimize hive queries.
The database assumes the purchase history of a convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.
SELECT c.gender,
g.name,
i.name,
Sum(ph.num)
FROM purchase_histories ph
JOIN customers c
ON ( c.id = ph.cus_id
AND ph.dt < $var1
AND ph.dt > $var2 )
JOIN items i
ON ( i.id = ph.item_id )
JOIN genres g
ON ( g.id = i.gen_id )
GROUP BY c.gender,
g.name,
i.name;
I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).
I compared this database and no partition database(contains same tables) by above query. I input some kinds of values to $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.
I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but a not partitioned database is about 150. I don't think the many mappers is definitely good but 10~30 mappers are too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.
The result of EXPLAIN are no_partitions and partitioned. And execution logs are exe_log_no_partition and exe_log_partitioned.
Thanks.
Addition
1, I saw EXPLAIN result of partitioned and thought that the number of mapper is calculated from below formula:
(the table size 2619958583)/(mapreduce.input.fileinputformat.split.maxsize=256000000)
Is it wrong?
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
SELECT
c.gender,
g.name,
i.name,
Sum(ph.num)
FROM
purchase_histories ph
JOIN
customers c
ON (
c.id = ph.cus_id
AND ph.dt < $var1
AND ph.dt > $var2
)
JOIN
items i
ON (
i.id = ph.item_id
)
JOIN
genres g
ON (
g.id = i.gen_id
)
GROUP BY
c.gender,
g.name,
i.name
ORDER BY
NULL