[Solved] mySQL - matching latin (english) form input to utf8 (non-English) data

EverSQL Database Performance Knowledge Base

mySQL - matching latin (english) form input to utf8 (non-English) data

Database type:

I maintain a music database in mySQL, how do I return results stored under e.g. 'Tiësto' when people search for 'Tiesto'?

All the data is stored under full text indexing, if that makes any difference.

I'm already employing a combination of Levenshtein in PHP and REGEXP in SQL - not in trying to solve this problem, but just for increased searchability in general.

PHP:

function Levenshtein($word) {

$words = array();
for ($i = 0; $i < strlen($word); $i++) {
    $words[] = substr($word, 0, $i) . '_' . substr($word, $i);
    $words[] = substr($word, 0, $i) . substr($word, $i + 1);
    $words[] = substr($word, 0, $i) . '_' . substr($word, $i + 1);
    }
$words[] = $word . '_';
return $words;
}

$fuzzyartist = Levenshtein($_POST['searchartist']);
$searchimplode = "'".implode("', '", $fuzzyartist)."'";

mySQL:

SELECT *
FROM new_track_database
WHERE artist REGEXP concat_ws('|', $searchimplode);

To add, I frequently perform character set conversions and string sanitation in PHP, but these have always been the OTHER way - standardising non latin characters. I can't get my head around performing the oppsite process, but only in certain circumstances based on the data I've got stored.

How to optimize this SQL query?

The following recommendations will help you in your SQL tuning process.
You'll find 3 sections below:

  1. Description of the steps you can take to speed up the query.
  2. The optimal indexes for this query, which you can copy and create in your database.
  3. An automatically re-written query you can copy and execute in your database.
The optimization process and recommendations:
  1. Avoid Regular Expression Filtering (query line: 6): When a regular expression filtering is applied on an indexed column, the database's optimizer won’t be able to use the index. Consider replacing this call with an alternative condition that won’t use a regular expression.
  2. Avoid Selecting Unnecessary Columns (query line: 2): Avoid selecting all columns with the '*' wildcard, unless you intend to use them all. Selecting redundant columns may result in unnecessary performance degradation.
The optimized query:
SELECT
        * 
    FROM
        new_track_database 
    WHERE
        new_track_database.artist REGEXP concat_ws('|', $searchimplode)

Related Articles



* original question posted on StackOverflow here.