[Solved] Relational/time series databases and very large SELECT queries

EverSQL Database Performance Knowledge Base

Relational/time series databases and very large SELECT queries

Database type:

I need to store a large number of structured records (potentially hundreds of billions) in a database. Data would be written continuously by many sensors, with a high insert rate (up to 100k rows/second).

The data being well structured, it seems to be a good match for a structured database such as Postgres. However, I am afraid the performance wouldn't be sufficient for the amount of data that needs to be ingested.

Furthermore, I do not need all of the features of a relational database (no need for full SQL support). Data would be written once, and read a few times as large chunks using basic queries such as:

SELECT time, value FROM data WHERE time>1000 AND time<2500 AND sensor_location="home" ORDER BY time

That is, select all records between two timestamps for a given sensor (or set of sensors). I do not need any ability to make complex queries such as joins or updates. The ORDER BY clause is important, since I need to be able to process these messages in the order they were written (using a Python script). These queries typically return many rows, and are often too large to fit in RAM. Furthermore, returning that many rows is very slow with most RDBMSs due to their text-based wire protocol, even if I split the query.

This seems to be a good use case for a time series database such as InfluxDB. However, its open source version cannot easily be distributed (which is a requirement in my case, both for resilience and scalability), and my tests showed that it is not performant enough when it comes to large queries (in particular, its wire protocol is too slow to efficiently transfer this many rows - and it sometimes even crashes when the query returns too many rows).

I recently learned about Clickhouse, which is horizontally scalable and highly performant. It has a binary/compressed wire protocol and one of the Python drivers (clickhouse_driver) has an execute_iter function which avoids blowing up the client's RAM when making these large queries. However, I am quite worried about its resilience (data corruption is not tolerable in my use case) since it is fairly recent and has a limited user base.

I am aware that my use case if quite specific. Are there other free/open source options I should be aware of?

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
CREATE INDEX data_idx_sensor_loc_home_time ON "data" ("sensor_location","home","time");
The optimized query:
SELECT
        data.time,
        data.value 
    FROM
        data 
    WHERE
        data.time > 1000 
        AND data.time < 2500 
        AND data.sensor_location = data."home" 
    ORDER BY
        data.time

Related Articles



* original question posted on StackOverflow here.