[Solved] Delete big amount of data and primary index

EverSQL Database Performance Knowledge Base

Delete big amount of data and primary index

Database type:

I'm trying to delete big amount of rows (>10 million, it about 1/3 of all records in a table) from InnoDB MySQL table with primary/clustered index. Field id is primary/clustered index and it's continuous without gaps. At least it should be, I do not delete records in a middle. But it's possible that some insert queries fails and innodb allocates some ids that became unused (I'm not sure if it is true). I delete only old records that no longer needed. Table contains varchar columns, so rows does not have fixed size.

First my attempt:

DELETE FROM `table` WHERE id<=10000000

It fails with big io operations. It seems to be that mysql killed this query and rolled back all changes. Approximate time for query execution was 6 hours and approximately same for rolling back. My big mistake there is that transaction log size was standard 5mb, be aware of it. It have to be enlarged.

Second attempt to delete in chunks by 10 000 records, like:

DELETE FROM `table` WHERE id<=10000;
COMMIT;
DELETE FROM `table` WHERE id<=20000;
COMMIT;

And so on. Query execution time from beginning was about 10 seconds (on laptop). Execution time grow gradually and it was about 300 seconds per query after 6 hours of execution.

Third attempt to make queries that in average executes less than 1 second. php code:

protected function deleteById($table, $id) {
    $MinId          = $this->getMinFromTable($table, 'id');
    $PackDeleteCount= $this->PackDeleteCount;
    $timerTotal     = new Timer();
    $delCountTotal  = 0;
    $delCountReport = 0;
    $delInfo        = array();
    $PackMinTime    = round($this->PackDeleteTime - $this->PackDeleteTime*$this->PackDeleteDiv, 3);
    $PackMaxTime    = round($this->PackDeleteTime + $this->PackDeleteTime*$this->PackDeleteDiv, 3);
    $this->LogString(sprintf('Del `%s`, PackMinTime: %s; PackMaxTime: %s', $table, $PackMinTime, $PackMaxTime));
    for (; $MinId < $id;) {
        $MinId          += $PackDeleteCount;
        $delCountReport += $PackDeleteCount;
        if ($MinId > $id) {
            $MinId = $id;
        }
        $timer          = new Timer();
        $sql            = sprintf('DELETE FROM `%s` WHERE id<=%s', $table, $MinId);
        $this->s->Query($sql, __FILE__, __LINE__);
        $delCount       = $this->s->AffectedRows();
        $this->s->CommitT();
        $RoundTime      = round($timer->end(), 3);
        $delInfo[]      = array(
            'time'  => $RoundTime,
            'rows'  => $PackDeleteCount,
        );
        $delCountTotal  += $delCount;
        if ($delCountReport >= $this->PackDeleteReport) {
            $delCountReport = 0;
            $delSqlCount    = count($delInfo);
            $EvTime         = 0;
            $PackTime       = 0;
            $EvCount        = 0;
            $PackCount      = 0;
            foreach ($delInfo as $v) {
                $PackTime   += $v['time'];
                $PackCount  += $v['rows'];
            }
            $EvTime         = round($PackTime/$delSqlCount, 2);
            $PackTime       = round($PackTime, 2);
            $EvCount        = round($PackCount/$delSqlCount);
            $TotalTime      = $this->readableTime(intval($timerTotal->end()));
            $this->LogString(sprintf('Del `%s`, Sql query count: %d; Time: %s; Count: %d; Evarage Time %s; Evarage count per delete: %d; Del total: %s; Del Total Time: %s; id <= %s', $table, $delSqlCount, $PackTime, $PackCount, $EvTime, $EvCount, $delCountTotal, $TotalTime, $MinId));
            $delInfo        = array();
        }

        $PackDeleteCountOld = $PackDeleteCount;
        if ($RoundTime < $PackMinTime) {
            $PackDeleteCount    = intval($PackDeleteCount + $PackDeleteCount*(1 - $RoundTime/$this->PackDeleteTime));
        } elseif ($RoundTime > $PackMaxTime) {
            $PackDeleteCount    = intval($PackDeleteCount - $PackDeleteCount*(1 - $this->PackDeleteTime/$RoundTime));
        }
        //$this->LogString(sprintf('Del `%s`, round time: %s; row count old: %d; row count new: %d', $table, $RoundTime, $PackDeleteCountOld, $PackDeleteCount));
    }
    $this->LogString(sprintf('Finished del `%s`: time: %s', $table, round($timerTotal->end(), 2)));
}

It has some dependencies, but they are self explanatory and can be easily changed with standard. I'll explain only input variables that are used here:

$table - target table, where rows needs to be deleted
$id - all records up to this id should be deleted
$MinId - Minimal id in the target table
$this->PackDeleteCount - Initial count of records, to start from. Then it recalculates row count to be deleted each new query.
$this->PackDeleteTime - desirable query execution time in average. I used 0.5
$this->PackDeleteDiv - acceptable deviation from $this->PackDeleteTime. In percentage. I used 0.3
$this->PackDeleteReport - Each N records should print statistic information about deleting

This variant has stable performance.

The reason of weak performance is that database engine have to physically resort all record data in leaves that was affected. It is my understanding and if your knowledge is deeper you are welcome to add description of what actually is going on in details. Perhaps it will give some new ideas.

Question: Is it possible to calculate row distribution on leaves and delete whole leave or even branch, so database engine does not have to resort data? Perhaps you have some other ideas on performance optimization for this case.

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
ALTER TABLE `table` ADD INDEX `table_idx_id` (`id`);
The optimized query:
SELECT
        1 
    FROM
        `table` 
    WHERE
        `table`.id <= 10000000

Related Articles



* original question posted on StackOverflow here.