I have an Events table with 30 million rows. The following query returns in 25 seconds
SELECT DISTINCT "events"."id", "calendars"."user_id"
FROM "events"
LEFT JOIN "calendars" ON "events"."calendar_id" = "calendars"."id"
WHERE "events"."deleted_at" is null
AND tstzrange('2016-04-21T12:12:36-07:00', '2016-04-21T12:22:36-07:00') @> lower(time_range)
AND ("status" is null or (status->>'pre_processed') IS NULL)
status
is a jsonb
column with an index on status->>'pre_processed'
. Here are the other indexes that were created on the events table. time_range
is of type TSTZRANGE
.
CREATE INDEX events_time_range_idx ON events USING gist (time_range);
CREATE INDEX events_lower_time_range_index on events(lower(time_range));
CREATE INDEX events_upper_time_range_index on events(upper(time_range));
CREATE INDEX events_calendar_id_index on events (calendar_id)
I'm definitely out of my comfort zone on this and am trying to reduce the query time. Here's the output of explain analyze
HashAggregate (cost=7486635.89..7486650.53 rows=1464 width=48) (actual time=26989.272..26989.306 rows=98 loops=1)
Group Key: events.id, calendars.user_id
-> Nested Loop Left Join (cost=0.42..7486628.57 rows=1464 width=48) (actual time=316.110..26988.941 rows=98 loops=1)
-> Seq Scan on events (cost=0.00..7475629.43 rows=1464 width=50) (actual time=316.049..26985.344 rows=98 loops=1)
Filter: ((deleted_at IS NULL) AND ((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange @> lower(time_range)))
Rows Removed by Filter: 31592898
-> Index Scan using calendars_pkey on calendars (cost=0.42..7.50 rows=1 width=48) (actual time=0.030..0.031 rows=1 loops=98)
Index Cond: (events.calendar_id = (id)::text)
Planning time: 1.468 ms
Execution time: 26989.370 ms
And here is the explain analyze with the events.deleted_at
part of the query removed
HashAggregate (cost=7487382.57..7487398.33 rows=1576 width=48) (actual time=23880.466..23880.503 rows=115 loops=1)
Group Key: events.id, calendars.user_id
-> Nested Loop Left Join (cost=0.42..7487374.69 rows=1576 width=48) (actual time=16.612..23880.114 rows=115 loops=1)
-> Seq Scan on events (cost=0.00..7475629.43 rows=1576 width=50) (actual time=16.576..23876.844 rows=115 loops=1)
Filter: (((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange @> lower(time_range)))
Rows Removed by Filter: 31592881
-> Index Scan using calendars_pkey on calendars (cost=0.42..7.44 rows=1 width=48) (actual time=0.022..0.023 rows=1 loops=115)
Index Cond: (events.calendar_id = (id)::text)
Planning time: 0.372 ms Execution time: 23880.571 ms
I added the index on the status
column. Everything else what already there and I'm unsure how to proceed going forward. Any suggestions on how to get the query time down to a more manageable number?
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
SELECT
DISTINCT "events"."id",
"calendars"."user_id"
FROM
"events"
LEFT JOIN
"calendars"
ON "events"."calendar_id" = "calendars"."id"
WHERE
"events"."deleted_at" IS NULL
AND tstzrange('2016-04-21T12:12:36-07:00', '2016-04-21T12:22:36-07:00') @> lower(time_range)
AND (
"status" IS NULL
OR (
status->>'pre_processed'
) IS NULL
)