[Solved] Postgres - limit window function evaluation to once per partition
Looking to automatically optimize YOUR SQL query? Start for free.

EverSQL Database Performance Knowledge Base

Postgres - limit window function evaluation to once per partition

Database type:

I have a table of daily stock quotes with ~40M rows and I'm trying to run a query that spits out the most recent date and the number of stocks that made a new one year high on that date. For example:

trade_date | one_year_highs
---------------------------
2021-06-16 | 912

The problem is that my query is very slow (49 seconds) and I haven't been able to figure out a way to drastically improve the performance.

One major problem I'm seeing is that the window function is being evaluated for every row in each partition when I only need it to evaluate on the row with the latest date (the only date I care about). I thought the where clause of the outer select would push down to the subquery, limiting the execution of the window function on the row with max(trade_date), but clearly that isn't happening.

I was able to improve performance somewhat by adding the where clause to the subquery, but it's still not enough. The window function only needs to evaluate once per symbol (partition) on the latest date, so roughly 9000 times, but it's being evaluated over 2.1M times.

What am I missing here?

What I have so far is:

select
    trade_date,
    count(*) filter (where new_high=true) as one_year_highs
from (
    select
        symbol,
        trade_date,
        adj_close > max(adj_close) over (partition by symbol order by trade_date desc rows between 1 following and 261 following) as new_high
    from v_quotes
    where trade_date >= (select max(trade_date) from v_quotes) - interval '1 year'
) as tmp
where trade_date=(select max(trade_date) from v_quotes)
group by trade_date;

The output of explain analyze can be found here: https://explain.depesz.com/s/pOaC

I'm running Postgres version 13 as a managed Digital Ocean instance with 2GB RAM, 1vCPU, and 25GB disk (SSD).

This is how I have my tables set up:

create table if not exists quotes (
    symbol varchar(16) not null,
    trade_date date not null,
    unadj_open numeric not null,
    unadj_high numeric not null,
    unadj_low numeric not null,
    unadj_close numeric not null,
    unadj_volume numeric(12) not null,
    dividend numeric not null,
    split numeric not null,
    adj_open numeric not null,
    adj_high numeric not null,
    adj_low numeric not null,
    adj_close numeric not null,
    adj_volume numeric(12) not null,
    last_updated timestamp (0) without time zone not null,
    primary key (symbol, trade_date)
);

create index if not exists idx_symbol on quotes (symbol);
create index if not exists idx_trade_date on quotes (trade_date desc);

create table if not exists excluded_symbols
    ( symbol varchar(16) not null
    , primary key (symbol)
    );

create or replace view v_quotes as select * from quotes where symbol not in (select symbol from excluded_symbols);

UPDATE 18 JUN 2021

After some more experimenting, I was able to get the query time down to ~5s which is about an order of magnitude better.

My first change was to create a new index with all of the columns I need (more on this later):

create index if not exists idx_quotes_symbol_adjclose_tradedate on quotes (symbol, adj_close desc, trade_date desc);

Then I changed my query to:

select trade_date, count(*) from (
    select distinct on (symbol) symbol, trade_date, adj_close
    from v_quotes
    where trade_date >= (select max(trade_date) from v_quotes) - interval '1 year'
    order by symbol, adj_close desc, trade_date desc
) as tmp
where tmp.trade_date=(select max(trade_date) from v_quotes)
group by trade_date;

Notice in the order by clause of the subquery that the three columns match exactly with the three columns in the index. That, combined with the fact that I only need to look at the values of those three columns to get the answer I'm looking for, allows the query planner to do an index-only scan. This is fast and avoids the IO overhead of reading rows from the quotes table.

New explain analyze output: https://explain.depesz.com/s/sj7t

Maybe window functions aren't what I needed in the first place. Is this the best that can be done?

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Subqueries (query line: 5): We advise against using subqueries as they are not optimized well by the optimizer. Therefore, it's recommended to join a newly created temporary table that holds the data, which also includes the relevant search index.
  2. Avoid Subqueries (query line: 15): We advise against using subqueries as they are not optimized well by the optimizer. Therefore, it's recommended to join a newly created temporary table that holds the data, which also includes the relevant search index.
  3. Avoid Subqueries (query line: 23): We advise against using subqueries as they are not optimized well by the optimizer. Therefore, it's recommended to join a newly created temporary table that holds the data, which also includes the relevant search index.
  4. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
  5. Push Filtering Conditions Into Subqueries (modified query below): Parts of the WHERE clause can pushed from the outer query to a subquery / union clause. Applying those conditions as early as possible will allow the database to scan less data and run the query more efficiently.
Optimal indexes for this query:
CREATE INDEX v_quotes_idx_symbol_adj_close_trade_date ON "v_quotes" ("symbol","adj_close" desc,"trade_date" desc);
CREATE INDEX v_quotes_idx_trade_date ON "v_quotes" ("trade_date");
The optimized query:
SELECT
        tmp.trade_date,
        count(*) 
    FROM
        (SELECT
            DISTINCT 
                ON (symbol) v_quotes.symbol,
                v_quotes.trade_date,
                v_quotes.adj_close 
        FROM
            v_quotes 
        WHERE
            (
                v_quotes.trade_date >= (
                    SELECT
                        max(v_quotes.trade_date) 
                    FROM
                        v_quotes
                ) - INTERVAL '1 year'
            ) 
            AND (
                v_quotes.trade_date = (
                    SELECT
                        max(v_quotes.trade_date) 
                    FROM
                        v_quotes
                )
            ) 
        ORDER BY
            v_quotes.symbol,
            v_quotes.adj_close DESC,
            v_quotes.trade_date DESC) AS tmp 
        WHERE
            1 = 1 
        GROUP BY
            tmp.trade_date

Related Articles



* original question posted on StackOverflow here.