[Solved] Improve performance of Recursive CTE

EverSQL Database Performance Knowledge Base

Improve performance of Recursive CTE

I have a pretty simple recursive CTE running on a single sourcetable (REP.INVENTMOVEMENTS) containing around 4mln records. The table is pretty heavily indexed.

with 
dataset as (
    select imv.sourceBatch, 
           imv.targetBatch,
           imv.sourceDataArea, 
           imv.targetDataArea,
           sum(Weight) as Weight
    from REP.INVENTMOVEMENTS imv 
    where imv.sourceBatch <> ''
    Group By imv.sourceBatch, 
             imv.targetBatch,
             imv.sourceDataArea, 
             imv.targetDataArea
    ),
result as (
    select  targetBatch as Batch,
            targetDataArea as DataArea, 
            sourceBatch, 
            targetBatch,   
            sourceDataArea,
            targetDataArea, 
            1 as level,
            Weight
    from dataset
    where sourceBatch <> targetBatch

    union all 

    select result.Batch,
           result.DataArea, 
           dataset.sourceBatch, 
           dataset.targetBatch, 
           dataset.sourceDataArea,
           dataset.targetDataArea, 
           result.level + 1 as level,
           dataset.Weight
    from dataset inner join result on dataset.targetBatch       = result.sourceBatch 
                                  and dataset.targetDataArea    = result.sourceDataArea
                                  and dataset.targetBatch       <> dataset.sourceBatch
    )

select * from result
union all
select      targetBatch as Batch,
            targetDataArea as DataArea, 
            sourceBatch, 
            targetBatch,   
            sourceDataArea,
            targetDataArea, 
            0 as level,
            Weight
    from dataset
    where sourceBatch = targetBatch
;

running the initial query without selections takes the database 122 seconds returning 517.947 records.

Running that same query returning a single batch takes the database less then a second returning 5 records.

But if I run the CTE with a selection on 1 batch it takes the database 28 seconds to complete 2 recursions and return 7 records.

I need to fill a table with the result from this view for 150k batches. so if all of them take half a minute to complete it would take 52 days to finish that task.

This is the execution plan

Executionplan

Just to clarify my objective. Batches can merge into new batches so 2 or more source batches can create an new batch. Two batches created in such a merge can be used to create a new batch... and so forth.

I want to be able to select a batch and find all the batches used to create this new batch.

Please take into account that one batch can be used in multiple other batches.

I hope you can help me out here.

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Selecting Unnecessary Columns (query line: 45): Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
  2. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
ALTER TABLE `INVENTMOVEMENTS` ADD INDEX `inventmovements_idx_sourceb_targetb_sourced_targetd` (`sourceBatch`,`targetBatch`,`sourceDataArea`,`targetDataArea`);
The optimized query:
WITH dataset AS (SELECT
        imv.sourceBatch,
        imv.targetBatch,
        imv.sourceDataArea,
        imv.targetDataArea,
        sum(Weight) AS Weight 
    FROM
        REP.INVENTMOVEMENTS imv 
    WHERE
        imv.sourceBatch <> '' 
    GROUP BY
        imv.sourceBatch,
        imv.targetBatch,
        imv.sourceDataArea,
        imv.targetDataArea), result AS (SELECT
        dataset.targetBatch AS Batch,
        dataset.targetDataArea AS DataArea,
        dataset.sourceBatch,
        dataset.targetBatch,
        dataset.sourceDataArea,
        dataset.targetDataArea,
        1 AS level,
        dataset.Weight 
    FROM
        dataset 
    WHERE
        dataset.sourceBatch <> dataset.targetBatch 
    UNION
    ALL SELECT
        result.Batch,
        result.DataArea,
        dataset.sourceBatch,
        dataset.targetBatch,
        dataset.sourceDataArea,
        dataset.targetDataArea,
        result.level + 1 AS level,
        dataset.Weight 
    FROM
        dataset 
    INNER JOIN
        result 
            ON dataset.targetBatch = result.sourceBatch 
            AND dataset.targetDataArea = result.sourceDataArea 
            AND dataset.targetBatch <> dataset.sourceBatch) SELECT
            * 
    FROM
        result 
    UNION
    ALL SELECT
        dataset.targetBatch AS Batch,
        dataset.targetDataArea AS DataArea,
        dataset.sourceBatch,
        dataset.targetBatch,
        dataset.sourceDataArea,
        dataset.targetDataArea,
        0 AS level,
        dataset.Weight 
    FROM
        dataset 
    WHERE
        dataset.sourceBatch = dataset.targetBatch

Related Articles



* original question posted on StackOverflow here.