I have a couple of MySQL tables storing temperature data from sensors. The sensors report about once every minute, and there are dozens of sensors (and growing). The tables have quickly grown to millions of rows, and will keep growing. The two pertinent tables are data
and data_temperature
.
The data
table is structured like this:
data_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
sensor_id int(10) unsigned NOT NULL
The data_temperature
table is structured like this:
temperature_id bigint(20) unsigned NOT NULL AUTO_INCREMENT
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
data_id bigint(20) unsigned NOT NULL
x_id varchar(32) DEFAULT NULL
x_sn varchar(16) DEFAULT NULL
x_unit char(1) DEFAULT NULL
x_value` decimal(6,2) DEFAULT NULL
Since each sensor reports about once per minute, there should be about 1440 rows per day for each sensor. But there are occasionally gaps in the data, sometimes lasting minutes, and sometimes lasting much longer.
I need to select a sampling of the data to display on a graph. The graphs are 600 pixels wide. While the time-frames of the graphs are variable (sometimes a daily graph, sometimes weekly, sometimes annually, etc), the pixel widths of the graph are fixed.
Originally I would select a count of the rows within the timeframe, then divide that by 600 to get X
, then select the rows where data_id MOD X = 0
. But this doesn't work well unless only one sensor is reporting to the table. With many sensors, it creates lots of gaps. To compensate, I'm pulling much more data than needed and overpopulating the graphs to be sure there are no holes due to this.
The overpopulating causes slow render times in the browser. But even the SELECT COUNT()
is now the major cause of the server-side slowness, which takes about 5-6 seconds to run on the data
table.
Ideally, what I'd like to do is to select the data from the table such that I have no more than one data point (but zero is okay, in case there is no data) in a given window. The window is the total time frame being viewed in the graph divided by the width of the graph in pixels. So viewing a daily graph that's 600px wide would be calculated like this:
86400 seconds per day / 600 pixels = 144-second window
So I would want no more than one data point every 144 seconds. Here's the query that I've come up with so far:
SELECT data_temperature.data_id, data_temperature.created,
ROUND( data_temperature.x_value, 1 ) AS temperature
FROM data_temperature
INNER JOIN data
ON data_temperature.data_id = data.data_id
WHERE data.sensor_id = :sensor_id
AND data.created BETWEEN :dt_start AND :dt_end
GROUP BY ROUND( UNIX_TIMESTAMP( data_temperature.created ) / 144 )
ORDER BY data.created, data.data_id
This query is an improvement both in that it returns the correct data, but also in that it runs in about 3.6 seconds. That's still much slower than what I really want. So I'm wondering if there are any other thoughts on accomplishing this with a more efficient query.
Note: Even though it doesn't look right, there's a good reason for having the data
and data_temperature
tables separated even though their relationship is 1-to-1. When I modify my queries and structure so that everything is in a single table, it doesn't improve the query time. So I don't believe having two tables is negatively impacting performance.
Update to clarify based on @Kevin Nelson's response
It's not the GROUP BY that's slow, it's the BETWEEN
in the WHERE
clause that's slow. If I remove that, it runs much faster, but of course returns the wrong results. If I execute a simple query like this:
SELECT data.data_id, data.created
FROM data
WHERE data.created BETWEEN :dt_start AND :dt_end
It's also very slow. My created
column is indexed, so I'm not particularly sure why. I do know that the greater the range between dt_start
and dt_end
, the slower it takes. A one-day range takes about half a second. A one-week range takes about 10 seconds.
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
ALTER TABLE `data` ADD INDEX `data_idx_sensor_id_created` (`sensor_id`,`created`);
ALTER TABLE `data_temperature` ADD INDEX `data_temperature_idx_data_id` (`data_id`);
SELECT
data_temperature.data_id,
data_temperature.created,
ROUND(data_temperature.x_value,
1) AS temperature
FROM
data_temperature
INNER JOIN
data
ON data_temperature.data_id = data.data_id
WHERE
data.sensor_id = :sensor_id
AND data.created BETWEEN :dt_start AND :dt_end
GROUP BY
ROUND(UNIX_TIMESTAMP(data_temperature.created) / 144)
ORDER BY
data.created,
data.data_id