One of the annoying features of BigQuery is its unpredictable behavior.
The last such annoyance can be illustrated in this query
SELECT
COUNT(DISTINCT large_table.event_id ) AS large_table_count,
COUNT(DISTINCT small_table.pk) AS small_table_count
FROM large_table
LEFT JOIN small_table ON
large_table.ga_user_id = small_table.ga_user_id
Where large_table ~2.5GB, and small_table is ~250mb
When I join on a ga_user_id, the query takes ~60 seconds. More importantly, it finishes and it the size of tables can be increased without a significant increase in time. (see screenshot 1).
When I join on a date string, however, the query's time shoots up to ~140 seconds. (see screenshot #2) More importantly, the performance does not scale linearly with data size. The query just does not finish even after 3,000 seconds.
A couple observations: 1) snapshot_date has less variability than ga_user_id 2) Both ga_user_id and snapshot_date are strings
What am I missing?
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
ALTER TABLE `small_table` ADD INDEX `small_table_idx_ga_id` (`ga_user_id`);
SELECT
COUNT(DISTINCT large_table.event_id) AS large_table_count,
COUNT(DISTINCT small_table.pk) AS small_table_count
FROM
large_table
LEFT JOIN
small_table
ON large_table.ga_user_id = small_table.ga_user_id