[Solved] Select no more than 1 row every N seconds
Looking to automatically optimize YOUR SQL query? Start for free.

EverSQL Database Performance Knowledge Base

Select no more than 1 row every N seconds

Database type:

I have a couple of MySQL tables storing temperature data from sensors. The sensors report about once every minute, and there are dozens of sensors (and growing). The tables have quickly grown to millions of rows, and will keep growing. The two pertinent tables are data and data_temperature.

The data table is structured like this:

data_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
sensor_id int(10) unsigned NOT NULL

The data_temperature table is structured like this:

temperature_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
data_id bigint(20) unsigned NOT NULL
x_id varchar(32) DEFAULT NULL
x_sn varchar(16) DEFAULT NULL
x_unit char(1) DEFAULT NULL
x_value` decimal(6,2) DEFAULT NULL

Since each sensor reports about once per minute, there should be about 1440 rows per day for each sensor. But there are occasionally gaps in the data, sometimes lasting minutes, and sometimes lasting much longer.

I need to select a sampling of the data to display on a graph. The graphs are 600 pixels wide. While the time-frames of the graphs are variable (sometimes a daily graph, sometimes weekly, sometimes annually, etc), the pixel widths of the graph are fixed.

Originally I would select a count of the rows within the timeframe, then divide that by 600 to get X, then select the rows where data_id MOD X = 0. But this doesn't work well unless only one sensor is reporting to the table. With many sensors, it creates lots of gaps. To compensate, I'm pulling much more data than needed and overpopulating the graphs to be sure there are no holes due to this.

The overpopulating causes slow render times in the browser. But even the SELECT COUNT() is now the major cause of the server-side slowness, which takes about 5-6 seconds to run on the data table.

Ideally, what I'd like to do is to select the data from the table such that I have no more than one data point (but zero is okay, in case there is no data) in a given window. The window is the total time frame being viewed in the graph divided by the width of the graph in pixels. So viewing a daily graph that's 600px wide would be calculated like this:

86400 seconds per day / 600 pixels = 144-second window

So I would want no more than one data point every 144 seconds. Here's the query that I've come up with so far:

SELECT data_temperature.data_id, data_temperature.created,
       ROUND( data_temperature.x_value, 1 ) AS temperature
  FROM data_temperature
         INNER JOIN data
                 ON data_temperature.data_id = data.data_id
 WHERE data.sensor_id = :sensor_id
   AND data.created BETWEEN :dt_start AND :dt_end
 GROUP BY ROUND( UNIX_TIMESTAMP( data_temperature.created ) / 144 )
 ORDER BY data.created, data.data_id

This query is an improvement both in that it returns the correct data, but also in that it runs in about 3.6 seconds. That's still much slower than what I really want. So I'm wondering if there are any other thoughts on accomplishing this with a more efficient query.

Note: Even though it doesn't look right, there's a good reason for having the data and data_temperature tables separated even though their relationship is 1-to-1. When I modify my queries and structure so that everything is in a single table, it doesn't improve the query time. So I don't believe having two tables is negatively impacting performance.

Update to clarify based on @Kevin Nelson's response

It's not the GROUP BY that's slow, it's the BETWEEN in the WHERE clause that's slow. If I remove that, it runs much faster, but of course returns the wrong results. If I execute a simple query like this:

SELECT data.data_id, data.created
  FROM data
 WHERE data.created BETWEEN :dt_start AND :dt_end

It's also very slow. My created column is indexed, so I'm not particularly sure why. I do know that the greater the range between dt_start and dt_end, the slower it takes. A one-day range takes about half a second. A one-week range takes about 10 seconds.

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Calling Functions With Indexed Columns (query line: 15): When a function is used directly on an indexed column, the database's optimizer won’t be able to use the index. For example, if the column `created` is indexed, the index won’t be used as it’s wrapped with the function `ROUND`. If you can’t find an alternative condition that won’t use a function call, a possible solution is to store the required value in a new indexed column.
  2. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
ALTER TABLE `data` ADD INDEX `data_idx_sensor_id_created` (`sensor_id`,`created`);
ALTER TABLE `data_temperature` ADD INDEX `data_temperature_idx_data_id` (`data_id`);
The optimized query:
SELECT
        data_temperature.data_id,
        data_temperature.created,
        ROUND(data_temperature.x_value,
        1) AS temperature 
    FROM
        data_temperature 
    INNER JOIN
        data 
            ON data_temperature.data_id = data.data_id 
    WHERE
        data.sensor_id = :sensor_id 
        AND data.created BETWEEN :dt_start AND :dt_end 
    GROUP BY
        ROUND(UNIX_TIMESTAMP(data_temperature.created) / 144) 
    ORDER BY
        data.created,
        data.data_id

Related Articles



* original question posted on StackOverflow here.