[Solved] Improving SQL query performance for multi-column grouped aggregation

EverSQL Database Performance Knowledge Base

Improving SQL query performance for multi-column grouped aggregation

Database type:

I'd like to speed up the query performance I'm having for quite a large aggregation. The overall result is to get a number of rankings for some particular categories. The table is simply made up of purchases data, 1-purchase per row e.g.:

transactions:
| id | category | netvalue | store_id | suburb | city | country |
|  1 | clothes  |       20 | 12       |    A   | AUCK |    NZ   |
|  2 | food     |       10 | 11       |    B   | WELL |    NZ   |
|  3 | gear     |      120 | 15       |    A   | CHCH |    NZ   |
|  4 | clothes  |       15 |  9       |    C   | SYDN |    AU   |

I'd like to get a list of all categories for a particular store, then it's rankings compared to every other store in its own suburb, city and country. For example, store 12 corresponds to:

I'm trying to generate a result set that looks like:

| category | suburb_rank | city_rank | country_rank |
| clothes  | 23          | 20        | 250          |
| food     | 27          | 10        | 109          |
...

I thought I'd start off with a set of aggregations, grouped by every ranking field that I'm interested in. This results in a query like:

WITH aggregations AS (
    SELECT
        category,
        SUM(netvalue) AS sum_netvalue,
        store_id,
        suburb,
        city,
        country
    FROM
        transactions
    GROUP BY
        store_id,
        suburb,
        city,
        country,
        category
)
SELECT * FROM aggregations

I'm then using this aggregations table to create rankings for each ranking column:

WITH aggregations AS (...),
WITH suburb_rankings AS (
    row_number() OVER (
        PARTITION BY aggregations.category,
        ORDER BY sum(sum_netvalue) DESC
    ) AS rank,
    category,
    store_id
FROM
    aggregations
WHERE
    suburb = @MY_SUBURB
GROUP BY
    category,
    store_id
),
WITH city_rankings AS (...),
WITH country_rankings AS (...)

Finally, I'm joining each of these rankings tables to a list of categories to get the category, then ranking for (suburb, city, country):

...
SELECT 
    category,
    suburb_rankings.rank AS suburb_rank,
    city_rankings.rank AS city_rank,
    country_rankings.rank AS country_rank
FROM
    (SELECT DISTINCT category FROM transactions)
LEFT JOIN
    suburb_rankings
ON
    suburb_rankings.category = category AND suburb_rankings.store_id = @MY_STORE_ID
LEFT JOIN
    city_rankings
ON
    city_rankings.category = category AND city_rankings.store_id = @MY_STORE_ID        
LEFT JOIN
    country_rankings
ON
    country_rankings.category = category AND country_rankings.store_id = @MY_STORE_ID   

I'm not sure if this is the best approach, performance wise - and I notice that an EXPLAIN ANALYZE describes a lot of time is spent doing the aggregations.

What I'd like to know is, is there a better approach to this type of query? I can't create a table of pre-computed aggregations because there may be more columns in the original table that we may want to filter on later.

I'm using Postgres 9.2 and SQLAlchemy, and had a brief look at dogpile caching but wasn't too sure if that would be a good solution.

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Selecting Unnecessary Columns (query line: 16): Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
  2. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
CREATE INDEX transactions_idx_store_suburb_city_countr_catego ON "transactions" ("store_id","suburb","city","country","category");
The optimized query:
WITH aggregations AS (SELECT
        transactions.category,
        SUM(transactions.netvalue) AS sum_netvalue,
        transactions.store_id,
        transactions.suburb,
        transactions.city,
        transactions.country 
    FROM
        transactions 
    GROUP BY
        transactions.store_id,
        transactions.suburb,
        transactions.city,
        transactions.country,
        transactions.category) SELECT
        * 
    FROM
        aggregations

Related Articles



* original question posted on StackOverflow here.