I'm trying to optimize a fairly simple query that just gets a list of the top N users ordered by the number of followers they have. I am running into performance issues when I run the query a certain way, and I'm not sure I understand why the other way is better.
I have two tables, user_account
, and following
, with about 30,000 and 9,000 records, respectively.
I also created this materialized view follower_count_mv
to hold cached follower counts for each user:
CREATE MATERIALIZED VIEW follower_count_mv AS
SELECT followed_user_id AS user_id, COUNT(*)::int AS follower_count FROM following
GROUP BY followed_user_id;
There are indexes on user_account.user_id
, follower_count_mv.user_id
, and (follower_count_mv.user_id, follower_count_mv.follower_count)
.
Here is the first, slower query:
SELECT ua.user_id, f.follower_count
FROM user_account ua
LEFT OUTER JOIN follower_count_mv f ON (ua.user_id = f.user_id)
ORDER BY f.follower_count DESC NULLS LAST
LIMIT 3
and its explain analyze:
Limit (cost=42756.30..42756.31 rows=3 width=54) (actual time=70.219..70.220 rows=3 loops=1)
-> Sort (cost=42756.30..42831.88 rows=30234 width=54) (actual time=70.217..70.217 rows=3 loops=1)
Sort Key: f.follower_count
Sort Method: top-N heapsort Memory: 25kB
-> Merge Left Join (cost=0.69..42365.53 rows=30234 width=54) (actual time=0.010..65.185 rows=30234 loops=1)
Merge Cond: ((ua.user_id)::text = (f.followed_user_id)::text)
-> Index Scan using user_account_pkey on user_account ua (cost=0.41..41893.22 rows=30234 width=50) (actual time=0.003..17.876 rows=30234 loops=1)
-> Index Only Scan using follower_count_mv_user_id_follower_count_idx on follower_count_mv fr (cost=0.28..316.52 rows=6416 width=41) (actual time=0.004..0.864 rows=6416 loops=1)
Heap Fetches: 0
Planning time: 0.233 ms
Execution time: 70.240 ms
Here is the second, faster query:
SELECT ua.user_id, f.follower_count
FROM user_account ua
LEFT OUTER JOIN follower_count_mv f ON (ua.user_id = f.user_id)
WHERE ua.user_id IN (SELECT user_id FROM follower_count_mv ORDER BY follower_count DESC limit 3)
and its explain analyze:
Nested Loop Left Join (cost=400.19..425.80 rows=3 width=54) (actual time=2.689..2.785 rows=3 loops=1)
-> Nested Loop (cost=399.91..424.85 rows=3 width=50) (actual time=2.666..2.722 rows=3 loops=1)
-> HashAggregate (cost=399.49..399.52 rows=3 width=37) (actual time=2.627..2.627 rows=3 loops=1)
Group Key: ("ANY_subquery".followed_user_id)::text
-> Subquery Scan on "ANY_subquery" (cost=399.45..399.49 rows=3 width=37) (actual time=2.622..2.624 rows=3 loops=1)
-> Limit (cost=399.45..399.46 rows=3 width=41) (actual time=2.620..2.620 rows=3 loops=1)
-> Sort (cost=399.45..415.49 rows=6416 width=41) (actual time=2.620..2.620 rows=3 loops=1)
Sort Key: follower_count_mv.follower_count
Sort Method: top-N heapsort Memory: 25kB
-> Index Only Scan using follower_count_mv_user_id_follower_count_idx on follower_count_mv (cost=0.28..316.52 rows=6416 width=41) (actual time=0.008..1.312 rows=6416 loops=1)
Heap Fetches: 0
-> Index Scan using user_account_pkey on user_account ua (cost=0.41..8.43 rows=1 width=50) (actual time=0.030..0.030 rows=1 loops=3)
Index Cond: ((user_id)::text = ("ANY_subquery".followed_user_id)::text)
-> Index Only Scan using follower_count_mv_user_id_follower_count_idx on follower_count_mv fr (cost=0.28..0.31 rows=1 width=41) (actual time=0.019..0.019 rows=1 loops=3)
Index Cond: (followed_user_id = (ua.user_id)::text)
Heap Fetches: 0
Planning time: 0.603 ms
Execution time: 2.834 ms
Am I missing something? Is the query planner not smart enough to figure out it should query follower_count_mv
for the top 3 results first, and then join on that? Is this correct Postgres behaviour? Is there a better way to optimize this query?
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
CREATE INDEX follower_mv_idx_user_id_follower_count ON "follower_count_mv" ("user_id","follower_count" desc);
CREATE INDEX follower_mv_idx_follower_count ON "follower_count_mv" ("follower_count" desc);
SELECT
ua.user_id,
ua.follower_count
FROM
user_account ua
LEFT OUTER JOIN
follower_count_mv f
ON (
ua.user_id = f.user_id
)
ORDER BY
f.follower_count DESC NULLS LAST LIMIT 3