[Solved] Speeding up query with `order by`, `limit` and `union`

EverSQL Database Performance Knowledge Base

Speeding up query with `order by`, `limit` and `union`

Database type:

I have a problem with a view that works as a “directory” of information from other tables. I cannot get PostgreSQL (11.4) to use an index when querying such a view with unions and left join.

First, here's a simplified version without the left join:

create table t1 (
  id uuid,
  n int,
  updated timestamp
);
create index i1 on t1 (updated);

create table t2 (
  id uuid,
  n int,
  updated timestamp
);
create index i2 on t2 (updated);

create or replace view directory as (
    select * from t1
  union all
    select * from t2
);

In the following tests, 10,000 rows were added to each table.

When querying this view with an order by and a limit, everything works well, and the i1 and i2 indices are used:

select *
from directory
order by updated desc
limit 10;

Plan visualized by PEV

Plan: https://lpaste.com/KWcGXlAoEy

However, if a left join is used in the view, the i2 index will no longer be used:

create table aux2 (
  id uuid,
  x text
);
create index auxi2 on aux2 (id);

drop view directory;
create view directory as (
    select t1.id, t1.n, t1.updated, null from t1
  union all
    select t2.id, t2.n, t2.updated, a.x from t2
    left join aux2 a on a.id = t2.id              -- ✱
);

In this case, a sequential scan is done on t2:

select *
from directory
order by updated desc
limit 10;

Plan visualized by PEV

Plan:

Limit  (cost=916.89..917.19 rows=10 width=44) (actual time=6.128..6.135 rows=10 loops=1)
  Buffers: shared hit=141
  ->  Merge Append  (cost=916.89..1516.89 rows=20000 width=44) (actual time=6.127..6.132 rows=10 loops=1)
        Sort Key: t1.updated DESC
        Buffers: shared hit=141
        ->  Index Scan Backward using i1 on t1  (cost=0.29..375.29 rows=10000 width=60) (actual time=0.003..0.007 rows=10 loops=1)
              Buffers: shared hit=3
        ->  Sort  (cost=916.60..941.60 rows=10000 width=29) (actual time=6.122..6.122 rows=1 loops=1)
              Sort Key: t2.updated DESC
              Sort Method: top-N heapsort  Memory: 25kB
              Buffers: shared hit=138
              ->  Hash Left Join  (cost=289.00..600.50 rows=10000 width=29) (actual time=2.240..4.956 rows=10000 loops=1)
                    Hash Cond: (t2.id = a.id)
                    Buffers: shared hit=138
                    ->  Seq Scan on t2  (cost=0.00..174.00 rows=10000 width=28) (actual time=0.005..0.807 rows=10000 loops=1)
                          Buffers: shared hit=74
                    ->  Hash  (cost=164.00..164.00 rows=10000 width=17) (actual time=2.227..2.227 rows=10000 loops=1)
                          Buckets: 16384  Batches: 1  Memory Usage: 607kB
                          Buffers: shared hit=64
                          ->  Seq Scan on aux2 a  (cost=0.00..164.00 rows=10000 width=17) (actual time=0.004..1.092 rows=10000 loops=1)
                                Buffers: shared hit=64
Planning Time: 0.295 ms
Execution Time: 6.161 ms

It's not the left join alone that precludes PostgreSQL from using i2 here. If the view does not contain a union (i.e. we only do the select + left join from t2 and aux2), the i2 index is used.

What's a better way to do this?

I'm both surprised that the planner doesn't use the i2 index, but also that the limit + order by isn't passed down to the node that ends up doing the sequential scan on t2 when the index isn't used. (On the real system, all rows from “t2” are returned, only to have most of them thrown away by the limit at the very end).

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Selecting Unnecessary Columns (query line: 2): Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
  2. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
CREATE INDEX directory_idx_updated ON "directory" ("updated" desc);
The optimized query:
SELECT
        * 
    FROM
        directory 
    ORDER BY
        directory.updated DESC LIMIT 10

Related Articles



* original question posted on StackOverflow here.