I'd like to speed up the query performance I'm having for quite a large aggregation. The overall result is to get a number of rankings for some particular categories. The table is simply made up of purchases data, 1-purchase per row e.g.:
transactions:
| id | category | netvalue | store_id | suburb | city | country |
| 1 | clothes | 20 | 12 | A | AUCK | NZ |
| 2 | food | 10 | 11 | B | WELL | NZ |
| 3 | gear | 120 | 15 | A | CHCH | NZ |
| 4 | clothes | 15 | 9 | C | SYDN | AU |
I'd like to get a list of all categories for a particular store, then it's rankings compared to every other store in its own suburb, city and country. For example, store 12 corresponds to:
I'm trying to generate a result set that looks like:
| category | suburb_rank | city_rank | country_rank |
| clothes | 23 | 20 | 250 |
| food | 27 | 10 | 109 |
...
I thought I'd start off with a set of aggregations, grouped by every ranking field that I'm interested in. This results in a query like:
WITH aggregations AS (
SELECT
category,
SUM(netvalue) AS sum_netvalue,
store_id,
suburb,
city,
country
FROM
transactions
GROUP BY
store_id,
suburb,
city,
country,
category
)
SELECT * FROM aggregations
I'm then using this aggregations table to create rankings for each ranking column:
WITH aggregations AS (...),
WITH suburb_rankings AS (
row_number() OVER (
PARTITION BY aggregations.category,
ORDER BY sum(sum_netvalue) DESC
) AS rank,
category,
store_id
FROM
aggregations
WHERE
suburb = @MY_SUBURB
GROUP BY
category,
store_id
),
WITH city_rankings AS (...),
WITH country_rankings AS (...)
Finally, I'm joining each of these rankings tables to a list of categories to get the category, then ranking for (suburb, city, country):
...
SELECT
category,
suburb_rankings.rank AS suburb_rank,
city_rankings.rank AS city_rank,
country_rankings.rank AS country_rank
FROM
(SELECT DISTINCT category FROM transactions)
LEFT JOIN
suburb_rankings
ON
suburb_rankings.category = category AND suburb_rankings.store_id = @MY_STORE_ID
LEFT JOIN
city_rankings
ON
city_rankings.category = category AND city_rankings.store_id = @MY_STORE_ID
LEFT JOIN
country_rankings
ON
country_rankings.category = category AND country_rankings.store_id = @MY_STORE_ID
I'm not sure if this is the best approach, performance wise - and I notice that an EXPLAIN ANALYZE describes a lot of time is spent doing the aggregations.
What I'd like to know is, is there a better approach to this type of query? I can't create a table of pre-computed aggregations because there may be more columns in the original table that we may want to filter on later.
I'm using Postgres 9.2 and SQLAlchemy, and had a brief look at dogpile caching but wasn't too sure if that would be a good solution.
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
CREATE INDEX transactions_idx_store_suburb_city_countr_catego ON "transactions" ("store_id","suburb","city","country","category");
WITH aggregations AS (SELECT
transactions.category,
SUM(transactions.netvalue) AS sum_netvalue,
transactions.store_id,
transactions.suburb,
transactions.city,
transactions.country
FROM
transactions
GROUP BY
transactions.store_id,
transactions.suburb,
transactions.city,
transactions.country,
transactions.category) SELECT
*
FROM
aggregations