[Solved] Getting the Top 5 rows by score for each group
Looking to automatically optimize YOUR SQL query? Start for free.

EverSQL Database Performance Knowledge Base

Getting the Top 5 rows by score for each group

I'm trying to get the top 5 comments by score for each Reddit post. I only want to retrieve the top N comments by score for each post title.

Example: I only would want comment 1 and 2 for each post.

Post 1 | Comment 1 | Comment Score 10
Post 1 | Comment 2 | Comment Score 9
Post 1 | Comment 3 | Comment Score 8
Post 2 | Comment 1 | Comment Score 10
Post 2 | Comment 2 | Comment Score 9
Post 2 | Comment 3 | Comment Score 8

StandardSQL

SELECT 
    posts.title, 
    posts.url, 
    posts.score AS postsscore, 
    DATE_TRUNC(DATE(TIMESTAMP_SECONDS(posts.created_utc)), MONTH), 
    SUBSTR(comments.body, 0, 80), 
    comments.score AS commentsscore, 
    comments.id
FROM 
    `fh-bigquery.reddit_posts.2015*` AS posts
    JOIN `fh-bigquery.reddit_comments.2015*` AS comments
        ON posts.id = SUBSTR(comments.link_id, 4)
WHERE 
    posts.subreddit = 'Showerthoughts' 
    AND posts.score >100 
    AND comments.score >100
ORDER BY 
    posts.score DESC, 
    posts.title DESC, 
    comments.score DESC

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Calling Functions With Indexed Columns (query line: 16): When a function is used directly on an indexed column, the database's optimizer won’t be able to use the index. For example, if the column `link_id` is indexed, the index won’t be used as it’s wrapped with the function `SUBSTR`. If you can’t find an alternative condition that won’t use a function call, a possible solution is to store the required value in a new indexed column.
  2. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
ALTER TABLE `fh-bigquery.reddit_comments.2015*` ADD INDEX `fhbigquery_reddit_idx_score` (`score`);
ALTER TABLE `fh-bigquery.reddit_posts.2015*` ADD INDEX `fhbigquery_reddit_idx_subreddit_score` (`subreddit`,`score`);
The optimized query:
SELECT
        posts.title,
        posts.url,
        posts.score AS postsscore,
        DATE_TRUNC(DATE(TIMESTAMP_SECONDS(posts.created_utc)),
        MONTH),
        SUBSTR(comments.body,
        0,
        80),
        comments.score AS commentsscore,
        comments.id 
    FROM
        `fh-bigquery.reddit_posts.2015*` AS posts 
    JOIN
        `fh-bigquery.reddit_comments.2015*` AS comments 
            ON posts.id = SUBSTR(comments.link_id,
        4) 
    WHERE
        posts.subreddit = 'Showerthoughts' 
        AND posts.score > 100 
        AND comments.score > 100 
    ORDER BY
        posts.score DESC,
        posts.title DESC,
        comments.score DESC

Related Articles



* original question posted on StackOverflow here.