[Solved] Mysql Matching \"Same\" Emails
Looking to automatically optimize YOUR SQL query? Start for free.

EverSQL Database Performance Knowledge Base

Mysql Matching \"Same\" Emails

Database type:

I have a table with 2 columns email and id. I need to find emails that are closely related. For example:

[email protected]

and

[email protected]

These should be considered the same because the username (john.smith12) and the most top level domain (example.com) are the same. They are currently 2 different rows in my table. I've written the below expression which should do that comparison but it takes hours to execute (possibly/probably because of regex). Is there a better way to write this:

  select c1.email, c2.email 
  from table as c1
  join table as c2
   on (
             c1.leadid <> c2.leadid 
        and 
             c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))

The explain of this query comes back as:

id, select_type, table, type, possible_keys, key, key_len, ref,  rows,   Extra
1,  SIMPLE,      c1,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, NULL
1,  SIMPLE,      c2,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, Using where; Using join buffer (Block Nested Loop)

The create table is:

CREATE TABLE `table` (
 `ID` int(11) NOT NULL AUTO_INCREMENT,
 `Email` varchar(100) DEFAULT NULL,
 KEY `Table_Email` (`Email`),
 KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

I guess the indices aren't being used because of the regexp.

The regex comes out as:

john[.][email protected][^@]*example[.]com

which should match both addresses.

Update:

I've modified the on to be:

on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and    
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))

and the explain with this approach is at least using the indices.

id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)

So far this has executed for 5 minutes, will update if there is a vast improvement.
Update 2:

I've split the email so the username is a column and domain is a column. I've stored the domain in reverse order so the index of it can be used with a trailing wildcard.

CREATE TABLE `table` (
     `ID` int(11) NOT NULL AUTO_INCREMENT,
     `Email` varchar(100) DEFAULT NULL,
     `domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
     `username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
     KEY `Table_Email` (`Email`),
     KEY `Email` (`Email`),
     KEY `domain` (`domain`)
    ) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

Query to populate new columns:

update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)), 
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));

New query:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

New Explain Results:

1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

From that explain it looks like the domain index is not being used. I also tried to force the usage with USE but that also didn't work, that resulted in no indices being used:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

Explain with use:

1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Calling Functions With Indexed Columns (query line: 10): When a function is used directly on an indexed column, the database's optimizer won’t be able to use the index. For example, if the column `email` is indexed, the index won’t be used as it’s wrapped with the function `replace`. If you can’t find an alternative condition that won’t use a function call, a possible solution is to store the required value in a new indexed column.
  2. Avoid Regular Expression Filtering (query line: 10): When a regular expression filtering is applied on an indexed column, the database's optimizer won’t be able to use the index. Consider replacing this call with an alternative condition that won’t use a regular expression.
  3. Create Optimal Indexes (modified query below): The recommended indexes are an integral part of this optimization effort and should be created before testing the execution duration of the optimized query.
Optimal indexes for this query:
ALTER TABLE `table` ADD INDEX `table_idx_leadid` (`leadid`);
The optimized query:
SELECT
        c1.email,
        c2.email 
    FROM
        table AS c1 
    JOIN
        table AS c2 
            ON (
                c1.leadid <> c2.leadid 
                AND c1.email REGEXP replace(replace(c2.email,
            '.',
            '[.]'),
            '@',
            '@[^@]*'))

Related Articles



* original question posted on StackOverflow here.