[Solved] Optimizing a user retention query

EverSQL Database Performance Knowledge Base

Optimizing a user retention query

I'm trying to automatically calculate a table for user retention, so that the end result is something like:

Week | %
0    | 100
1    | 50
2    | 35

For now I'm not bothering with the % and just trying to get a count of users, i.e. users that came back on week 1, week 2, etc.

I have an accounts table and an action table. Currently I don't have indexes in either of these. Both have a user_email that stores the associated user, and both have a timestamp field that's a TIMESTAMP WITH TIME ZONE - for accounts this is the account creation date and for action it's the timestamp of the action.

The two tables are currently super simple:

accounts Table:
timestamp  | timestamp with time zone
user_email |character varying
action Table:
timestamp  | timestamp with time zone
user_email |character varying

That's it, no keys or anything else.

I managed to piece together a query that seems to do this calculation:

WITH weeks AS (
    SELECT (n || 'week')::interval as week from generate_series(0, 3) AS n
)
SELECT 

    week,

    (SELECT count(*) FROM accounts WHERE user_email IN 
        (SELECT user_email FROM action
            WHERE age(action.timestamp, accounts.timestamp) >= week 
            AND age(action.timestamp, accounts.timestamp) < week + '1 week'::interval)) 
            AS "Returning Users"

FROM weeks;

The output here is something like:

week    | Returning Users
0 days  |  100
7 days  |  50
14 days |  35
21 days |  25

The thing is, this takes super long to calculate (with 1300 users and about 100000 actions it takes a few minutes). I'm also convinced there's a more efficient way to calculate this, or that I can at least add indexes somewhere.

Unfortunately my SQL knowledge is quite limited and after looking around for a while I'm still not sure where to start. I think I can add indexes to the timestamp columns? And I'm not sure my way of checking if timestamps fall in certain ranges is correct.

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Calling Functions With Indexed Columns (query line: 20): When a function is used directly on an indexed column, the database's optimizer won’t be able to use the index. For example, if the column `timestamp` is indexed, the index won’t be used as it’s wrapped with the function `age`. If you can’t find an alternative condition that won’t use a function call, a possible solution is to store the required value in a new indexed column.
  2. Avoid Calling Functions With Indexed Columns (query line: 21): When a function is used directly on an indexed column, the database's optimizer won’t be able to use the index. For example, if the column `timestamp` is indexed, the index won’t be used as it’s wrapped with the function `age`. If you can’t find an alternative condition that won’t use a function call, a possible solution is to store the required value in a new indexed column.
  3. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
  4. Replace In Subquery With Correlated Exists (modified query below): In many cases, an EXISTS subquery with a correlated condition will perform better than a non correlated IN subquery.
Optimal indexes for this query:
ALTER TABLE `action` ADD INDEX `action_idx_user_email` (`user_email`);
The optimized query:
WITH weeks AS (SELECT
        (n 
        OR 'week')::interval AS week 
    FROM
        generate_series(0,
        3) AS n) SELECT
        week,
        (SELECT
            count(*) 
        FROM
            accounts 
        WHERE
            EXISTS (
                SELECT
                    1 
                FROM
                    action 
                WHERE
                    (
                        age(action.timestamp, accounts.timestamp) >= week 
                        AND age(action.timestamp, accounts.timestamp) < week + '1 week'::interval
                    ) 
                    AND (
                        accounts.user_email = action.user_email
                    )
            )
        ) AS 'Returning Users' 
    FROM
        weeks

Related Articles



* original question posted on StackOverflow here.