I need to store a large number of structured records (potentially hundreds of billions) in a database. Data would be written continuously by many sensors, with a high insert rate (up to 100k rows/second).
The data being well structured, it seems to be a good match for a structured database such as Postgres. However, I am afraid the performance wouldn't be sufficient for the amount of data that needs to be ingested.
Furthermore, I do not need all of the features of a relational database (no need for full SQL support). Data would be written once, and read a few times as large chunks using basic queries such as:
SELECT time, value FROM data WHERE time>1000 AND time<2500 AND sensor_location="home" ORDER BY time
That is, select all records between two timestamps for a given sensor (or set of sensors). I do not need any ability to make complex queries such as joins or updates. The ORDER BY clause is important, since I need to be able to process these messages in the order they were written (using a Python script). These queries typically return many rows, and are often too large to fit in RAM. Furthermore, returning that many rows is very slow with most RDBMSs due to their text-based wire protocol, even if I split the query.
This seems to be a good use case for a time series database such as InfluxDB. However, its open source version cannot easily be distributed (which is a requirement in my case, both for resilience and scalability), and my tests showed that it is not performant enough when it comes to large queries (in particular, its wire protocol is too slow to efficiently transfer this many rows - and it sometimes even crashes when the query returns too many rows).
I recently learned about Clickhouse, which is horizontally scalable and highly performant. It has a binary/compressed wire protocol and one of the Python drivers (clickhouse_driver) has an
execute_iter function which avoids blowing up the client's RAM when making these large queries. However, I am quite worried about its resilience (data corruption is not tolerable in my use case) since it is fairly recent and has a limited user base.
I am aware that my use case if quite specific. Are there other free/open source options I should be aware of?
The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:
CREATE INDEX data_idx_sensor_loc_home_time ON "data" ("sensor_location","home","time");
SELECT data.time, data.value FROM data WHERE data.time > 1000 AND data.time < 2500 AND data.sensor_location = data."home" ORDER BY data.time