[Solved] How to make an effect by partitions in Hive
Looking to automatically optimize YOUR SQL query? Start for free.

EverSQL Database Performance Knowledge Base

How to make an effect by partitions in Hive

I'm using Hive on MRv2, and I'm trying to optimize hive queries.

The database assumes the purchase history of a convenience store. This database contains 6 tables(customers(1M rows), shops(1K rows), employees(5K rows), genres(30 rows), items(3.5K rows), purchase_histories(1G rows)), and I made query that retrieves sum of number purchased for each item, genre and customers' gender.

SELECT c.gender, 
       g.name, 
       i.name, 
       Sum(ph.num) 
FROM   purchase_histories ph 
       JOIN customers c 
         ON ( c.id = ph.cus_id 
              AND ph.dt < $var1
              AND ph.dt > $var2 ) 
       JOIN items i 
         ON ( i.id = ph.item_id ) 
       JOIN genres g 
         ON ( g.id = i.gen_id ) 
GROUP  BY c.gender, 
          g.name, 
          i.name; 

I made partition purchase_histories(dt), items(gen_id) and customers(gender, byear).

I compared this database and no partition database(contains same tables) by above query. I input some kinds of values to $var1 and $var2 to make reference numbers of rows of purchase_histories become 10,000,000.

I measured the process time, and I found the no partition database is faster(or equal) than the other. I checked execution logs and I found that the mapper number of the partitioned database is about 10~30 but a not partitioned database is about 150. I don't think the many mappers is definitely good but 10~30 mappers are too small. So I thought that I have to check some configuration about map numbers or memory size. But I don't know which configure to change and my thought is correct.

The result of EXPLAIN are no_partitions and partitioned. And execution logs are exe_log_no_partition and exe_log_partitioned.

Thanks.

Addition

1, I saw EXPLAIN result of partitioned and thought that the number of mapper is calculated from below formula:

(the table size 2619958583)/(mapreduce.input.fileinputformat.split.maxsize=256000000)

Is it wrong?

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Explicitly ORDER BY After GROUP BY (modified query below): By default, the database sorts all 'GROUP BY col1, col2, ...' queries as if you specified 'ORDER BY col1, col2, ...' in the query as well. If a query includes a GROUP BY clause but you want to avoid the overhead of sorting the result, you can suppress sorting by specifying 'ORDER BY NULL'.
The optimized query:
SELECT
        c.gender,
        g.name,
        i.name,
        Sum(ph.num) 
    FROM
        purchase_histories ph 
    JOIN
        customers c 
            ON (
                c.id = ph.cus_id 
                AND ph.dt < $var1 
                AND ph.dt > $var2
            ) 
    JOIN
        items i 
            ON (
                i.id = ph.item_id
            ) 
    JOIN
        genres g 
            ON (
                g.id = i.gen_id
            ) 
    GROUP BY
        c.gender,
        g.name,
        i.name 
    ORDER BY
        NULL

Related Articles



* original question posted on StackOverflow here.