[Solved] BigQuery SQL Query Optimization

EverSQL Database Performance Knowledge Base

BigQuery SQL Query Optimization

I managed to get a query that works, but I'm curious if there is a more succinct way to construct it (still learning!).

The BigQuery dataset that I'm working with comes from Hubspot. It's being kept in sync by Stitch. (For those unfamiliar with BigQuery, most integrations are append-only so I have to filter out old copies via the ROW_NUMBER() OVER line you'll see below, so that's why it exists. Seems like the standard way to deal with this quirk.)

The wrinkle with the companies table is every single field, except for two ID ones, is of type RECORD. (See the screenshot at the bottom for an example). It serves to keep a history of field value changes. Unfortunately they don't seem to be in any order so wrapping up the fields - properties.first_conversion_event_name for example - in a MIN() or MAX() and grouping by companyid formula doesn't work.

This is what I ended up with (the final query is much longer; I didn't include all of the fields in the sample below):

WITH companies AS (
SELECT
    o.companyid as companyid,
    ARRAY_AGG(STRUCT(o.properties.name.value, o.properties.name.timestamp) ORDER BY o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] as name,
    ARRAY_AGG(STRUCT(o.properties.industry.value, o.properties.industry.timestamp) ORDER BY o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] as industry,
    ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value, o.properties.lifecyclestage.timestamp) ORDER BY o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] as lifecyclestage
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY o.companyid ORDER BY o._sdc_batched_at DESC) as seqnum
  FROM `project.hubspot.companies` o) o
WHERE seqnum = 1
GROUP BY companyid)

SELECT
  companyid,
  name.value as name,
  industry.value as industry,
  lifecyclestage.value as lifecyclestage
FROM companies

The WITH clause at the top is to get rid of the extra fields that the ARRAY_AGG(STRUCT()) includes. For each field I would have two columns - [field].value and [field].timestamp - and I only want the [field].value one.

Thanks in advance!

Schema Screenshot

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Selecting Unnecessary Columns (query line: 17): Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
  2. Avoid Subqueries (query line: 16): We advise against using subqueries as they are not optimized well by the optimizer. Therefore, it's recommended to join a newly created temporary table that holds the data, which also includes the relevant search index.
  3. Explicitly ORDER BY After GROUP BY (modified query below): By default, the database sorts all 'GROUP BY col1, col2, ...' queries as if you specified 'ORDER BY col1, col2, ...' in the query as well. If a query includes a GROUP BY clause but you want to avoid the overhead of sorting the result, you can suppress sorting by specifying 'ORDER BY NULL'.
The optimized query:
WITH companies AS (SELECT
        o.companyid AS companyid,
        ARRAY_AGG(STRUCT(o.properties.name.value,
        o.properties.name.timestamp) 
    ORDER BY
        o.properties.name.timestamp DESC)[SAFE_OFFSET(0)] AS name,
        ARRAY_AGG(STRUCT(o.properties.industry.value,
        o.properties.industry.timestamp) 
    ORDER BY
        o.properties.industry.timestamp DESC)[SAFE_OFFSET(0)] AS industry,
        ARRAY_AGG(STRUCT(o.properties.lifecyclestage.value,
        o.properties.lifecyclestage.timestamp) 
    ORDER BY
        o.properties.lifecyclestage.timestamp DESC)[SAFE_OFFSET(0)] AS lifecyclestage 
    FROM
        (SELECT
            *,
            ROW_NUMBER() OVER (PARTITION 
        BY
            o.companyid 
        ORDER BY
            o._sdc_batched_at DESC) AS seqnum 
        FROM
            `project.hubspot.companies` o) o 
    WHERE
        seqnum = 1 
    GROUP BY
        o.companyid 
    ORDER BY
        NULL) SELECT
            companies.companyid,
            name.value AS name,
            industry.value AS industry,
            lifecyclestage.value AS lifecyclestage 
        FROM
            companies

Related Articles



* original question posted on StackOverflow here.