[Solved] Row estimate more accurate for primary key column than for others?

EverSQL Database Performance Knowledge Base

Row estimate more accurate for primary key column than for others?

Database type:

Let's create two test tables in a PostgreSQL 13 database:

CREATE TABLE foo (
  id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
  value int NOT NULL
);
CREATE TABLE bar (
  id bigint PRIMARY KEY,
  category_id bigint NOT NULL,
  foo_id bigint REFERENCES foo (id),
  value int
);
CREATE INDEX bar_category_id_ix ON bar (category_id);

and disable autovacuum for these tables:

ALTER TABLE foo SET (autovacuum_enabled = false);
ALTER TABLE bar SET (autovacuum_enabled = false);

Insert 500000 (half of a million) records into foo, transfer them to bar and analyze the tables:

INSERT INTO foo (value) SELECT * FROM generate_series(1, 500000);
ANALYZE foo;
INSERT INTO bar (id, category_id, foo_id, value) SELECT id, 1, id, value FROM foo WHERE value <= 500000;
ANALYZE bar;

Optionally ensure that only ANALYZE (no autovacuum) was performed on these tables:

SELECT relname, last_autovacuum, last_vacuum, last_autoanalyze, last_analyze FROM pg_stat_user_tables WHERE relname IN ('foo', 'bar');

Insert another chunk of 500000 records (but don't run ANALYZE):

INSERT INTO foo (value) SELECT * FROM generate_series(500001, 1000000);
INSERT INTO bar (id, category_id, foo_id, value) SELECT id, 2, id, value FROM foo WHERE value > 500000;

Since we did not run ANALYZE table statistics is outdated, its related to the stage when foo and bar contained half of million records. Now let's check the query plans:

EXPLAIN SELECT * FROM bar
JOIN foo ON bar.foo_id = foo.id
WHERE category_id = 2;
----
Nested Loop  (cost=0.85..12.89 rows=1 width=40)
  ->  Index Scan using bar_category_id_ix on bar  (cost=0.42..4.44 rows=1 width=28)
        Index Cond: (category_id = 2)
  ->  Index Scan using foo_pkey on foo  (cost=0.42..8.44 rows=1 width=12)
        Index Cond: (id = bar.foo_id)

and

EXPLAIN SELECT * FROM bar
JOIN foo ON bar.foo_id = foo.id;
---
Hash Join  (cost=32789.00..71320.29 rows=999864 width=40)
  Hash Cond: (bar.foo_id = foo.id)
  ->  Seq Scan on bar  (cost=0.00..17351.64 rows=999864 width=28)
  ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=12)
        ->  Seq Scan on foo  (cost=0.00..15406.00 rows=1000000 width=12)

I understand that 1st query plan has wrongly estimated only 1 row (rows=1) for condition category_id = 2 because the statistic is outdated (the ANALYZE was performed before inserting records with category_id = 2). (1) But then, how did the 2nd query plan arrive at a good estimation (rows=999864) for condition bar.foo_id = foo.id?

Also if we run:

EXPLAIN SELECT * FROM bar
JOIN foo ON bar.foo_id = foo.id
WHERE category_id = 1;
----
Hash Join  (cost=32789.00..73819.95 rows=999864 width=40)
  Hash Cond: (bar.foo_id = foo.id)
  ->  Seq Scan on bar  (cost=0.00..19851.30 rows=999864 width=28)
        Filter: (category_id = 1)
  ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=12)
        ->  Seq Scan on foo  (cost=0.00..15406.00 rows=1000000 width=12)

(2) Why does the planner estimate 999864 rows for condition category_id = 1? The statistics should show about 500000 of rows satisfying it?

NOTE: I came to these questions because empirically I observed that conditions containing only primary key columns will produce a better query plan even if the table was not analyzed, but I did not find anything about this behavior in PostgreSQL official documentation.

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
CREATE INDEX pg_user_idx_relname ON "pg_stat_user_tables" ("relname");
The optimized query:
SELECT
        pg_stat_user_tables.relname,
        pg_stat_user_tables.last_autovacuum,
        pg_stat_user_tables.last_vacuum,
        pg_stat_user_tables.last_autoanalyze,
        pg_stat_user_tables.last_analyze 
    FROM
        pg_stat_user_tables 
    WHERE
        pg_stat_user_tables.relname IN (
            'foo', 'bar'
        )

Related Articles



* original question posted on StackOverflow here.