I have a table of daily stock quotes with ~40M rows and I'm trying to run a query that spits out the most recent date and the number of stocks that made a new one year high on that date. For example:
trade_date | one_year_highs
---------------------------
2021-06-16 | 912
The problem is that my query is very slow (49 seconds) and I haven't been able to figure out a way to drastically improve the performance.
One major problem I'm seeing is that the window function is being evaluated for every row in each partition when I only need it to evaluate on the row with the latest date (the only date I care about). I thought the where
clause of the outer select
would push down to the subquery, limiting the execution of the window function on the row with max(trade_date)
, but clearly that isn't happening.
I was able to improve performance somewhat by adding the where
clause to the subquery, but it's still not enough. The window function only needs to evaluate once per symbol (partition) on the latest date, so roughly 9000 times, but it's being evaluated over 2.1M times.
What am I missing here?
What I have so far is:
select
trade_date,
count(*) filter (where new_high=true) as one_year_highs
from (
select
symbol,
trade_date,
adj_close > max(adj_close) over (partition by symbol order by trade_date desc rows between 1 following and 261 following) as new_high
from v_quotes
where trade_date >= (select max(trade_date) from v_quotes) - interval '1 year'
) as tmp
where trade_date=(select max(trade_date) from v_quotes)
group by trade_date;
The output of explain analyze
can be found here: https://explain.depesz.com/s/pOaC
I'm running Postgres version 13 as a managed Digital Ocean instance with 2GB RAM, 1vCPU, and 25GB disk (SSD).
This is how I have my tables set up:
create table if not exists quotes (
symbol varchar(16) not null,
trade_date date not null,
unadj_open numeric not null,
unadj_high numeric not null,
unadj_low numeric not null,
unadj_close numeric not null,
unadj_volume numeric(12) not null,
dividend numeric not null,
split numeric not null,
adj_open numeric not null,
adj_high numeric not null,
adj_low numeric not null,
adj_close numeric not null,
adj_volume numeric(12) not null,
last_updated timestamp (0) without time zone not null,
primary key (symbol, trade_date)
);
create index if not exists idx_symbol on quotes (symbol);
create index if not exists idx_trade_date on quotes (trade_date desc);
create table if not exists excluded_symbols
( symbol varchar(16) not null
, primary key (symbol)
);
create or replace view v_quotes as select * from quotes where symbol not in (select symbol from excluded_symbols);
UPDATE 18 JUN 2021
After some more experimenting, I was able to get the query time down to ~5s which is about an order of magnitude better.
My first change was to create a new index with all of the columns I need (more on this later):
create index if not exists idx_quotes_symbol_adjclose_tradedate on quotes (symbol, adj_close desc, trade_date desc);
Then I changed my query to:
select trade_date, count(*) from (
select distinct on (symbol) symbol, trade_date, adj_close
from v_quotes
where trade_date >= (select max(trade_date) from v_quotes) - interval '1 year'
order by symbol, adj_close desc, trade_date desc
) as tmp
where tmp.trade_date=(select max(trade_date) from v_quotes)
group by trade_date;
Notice in the order by
clause of the subquery that the three columns match exactly with the three columns in the index. That, combined with the fact that I only need to look at the values of those three columns to get the answer I'm looking for, allows the query planner to do an index-only scan. This is fast and avoids the IO overhead of reading rows from the quotes table.
New explain analyze
output: https://explain.depesz.com/s/sj7t
Maybe window functions aren't what I needed in the first place. Is this the best that can be done?
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
CREATE INDEX v_quotes_idx_symbol_adj_close_trade_date ON "v_quotes" ("symbol","adj_close" desc,"trade_date" desc);
CREATE INDEX v_quotes_idx_trade_date ON "v_quotes" ("trade_date");
SELECT
tmp.trade_date,
count(*)
FROM
(SELECT
DISTINCT
ON (symbol) v_quotes.symbol,
v_quotes.trade_date,
v_quotes.adj_close
FROM
v_quotes
WHERE
(
v_quotes.trade_date >= (
SELECT
max(v_quotes.trade_date)
FROM
v_quotes
) - INTERVAL '1 year'
)
AND (
v_quotes.trade_date = (
SELECT
max(v_quotes.trade_date)
FROM
v_quotes
)
)
ORDER BY
v_quotes.symbol,
v_quotes.adj_close DESC,
v_quotes.trade_date DESC) AS tmp
WHERE
1 = 1
GROUP BY
tmp.trade_date