[Solved] Cassandra READ Where In performance

EverSQL Database Performance Knowledge Base

Cassandra READ Where In performance

I have a Cassandra cluster of 6 nodes, each one has 96 CPU/800 RAM.

My table for performance tests is:

create table if not exists space.table
(
    id          bigint primary key,
    data        frozen<list<float>>,
    updated_at  timestamp
);

Table contains 150.000.000 rows.

When I was testing it with query:

SELECT * FROM space.table WHERE id = X

I even wasn't able to overload cluster, the client was overloaded by itself, RPS to cluster were 350.000.

Now I'm testing a second test case:

SELECT * FROM space.table WHERE id in (X1, X2 ... X3000)

I want to get 3000 random rows from Cassandra per request.

Max RPS in this case 15 RPS after that occurs a lot of pending tasks in Cassandra thread pool with type: Native-Transport-Requests. Isn't it the best idea to get big resultsets from cassandra? What is the best practice, for sure I can divide 3000 rows to separate requests, for example 30 request each with 100 ids. Where can I find info about it, maybe WHERE IN operation is not good from performance perspective?

Update:

Want to share my measurements for getting 3000 rows by different chunk size from Cassandra:

Test with 3000 ids per request

Latency: 5 seconds
Max RPS to cassandra: 20


Test with 100 ids per request (total 300 request each by 100 ids)
Latency at 350 rps to service (350 * 30 = 10500 requests to cassandra): 170 ms (q99), 95 ms (q90), 75 ms(q50)
Max RPS to cassandra: 350 * 30 = 10500

Test with 20 ids per request (total 150 request each by 20 ids) 
Latency at 250 rps to service(250 * 150 = 37500 requests to cassandra): 49 ms (q99), 46 ms (q90), 32 ms(q50)
Latency at 600 rps to service(600 * 150 = 90000 requests to cassandra): 190 ms (q99), 180 ms (q90), 148 ms(q50)
Max RPS to cassandra: 650  * 150 = 97500


Test with 10 ids per request (total 300 request each by 10 ids)
Latency at 250 rps to service(250 * 300 = 75000 requests to cassandra): 48 ms (q99), 31 ms (q90), 11 ms(q50)
Latency at 600 rps to service(600 * 300 = 180000 requests to cassandra): 159 ms (q99), 95 ms (q90), 75 ms(q50)
Max RPS to cassandra: 650  * 300 = 195000


Test with 5 ids per request (total 600 request each by 5 ids)
Latency at 550 rps to service(550 * 600 = 330000 requests to cassandra): 97 ms (q99), 92 ms (q90), 60 ms(q50)
Max RPS to cassandra: 550  * 660 = 363000


Test with 1 ids per request (total 3000 request each by 1 ids)
Latency at 190 rps to service(250 * 3000 = 750000 requests to cassandra): 49 ms (q99), 43 ms (q90), 30 ms(q50)
Max RPS to cassandra: 190  * 3000 = 570000

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Selecting Unnecessary Columns (query line: 2): Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
  2. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
ALTER TABLE `table` ADD INDEX `table_idx_id_x` (`id`,`X`);
The optimized query:
SELECT
        * 
    FROM
        space.table 
    WHERE
        space.table.id = space.table.X

Related Articles



* original question posted on StackOverflow here.