MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4?

UTF8 VS UTF8MB4 – What’s the difference?

You’re getting a support call from an IT administrator in a tech company saying some of his critical data can’t be saved in the product you deployed at his servers a week ago. His users are seeing a general error from the application. About 30 of his 500 users are experiencing this issue and can’t save data in the application.

After a short 15 minutes debug session, you can see that the data is transmitted from the client side, received in the backend server and the insertion query is fired to the database. But still, no data in the database.

Tip: Do not click here if all your SQL queries are fast as lightning!

Hmm.. now it got interesting. You roll up your sleeves, put up a smile on your face, some sunglasses on your eyes and you start investigating (well, maybe just start investigating without all the fancy accessories).

Looking at the logs, it turns out that for specific inputs, MySQL refused to add the data to the database. The error MySQL reports is:

Incorrect string value: ‘\xF0\x9F\x98\x81…’ for column ‘data’ at row 1

That’s a “good” hint, but why? What does it even mean? How can it be incorrect? Well, I started researching about this string and I found that the first 4 bytes combine a specific emoticon in UTF8. So, it means that in this specific input, each character is probably encoded as 4 bytes.

Tip: Do not click here if all your SQL queries are quick as a chitta!

I tried to reproduce this issue with a different string, which has its characters encoded with 1 / 2 / 3 bytes per character. It turned out that it only happens when each character in the data was combined of 4-byte. The character set I used in the database is UTF8, which should support 4 bytes (right?), so what’s wrong?

Well, it turns out I was wrong. I learned that MySQL decided that UTF-8 can only hold 3 bytes per character. Why? no good reason that I can find documented anywhere. Few years later, when MySQL 5.5.3 was released, they introduced a new encoding called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love.

Recommendation

if you’re using MySQL in any flavor (MySQL, MariaDB, PerconaDB, etc.), make sure you know your encodings. I would recommend anyone to set the MySQL encoding to utf8mb4. Never use utf8 in MySQL, there is no good reason to do that (unless you like tracing encoding related bugs).

How to convert utf8 to utf8mb4 in MySQL?

So now I had to fix this issue. As I recommend above, I wanted to use utf8mb4 and drop the old utf8. To do that, I used the following ALTER statements. Please DO NOT just copy paste them. You need to make sure you understand each of them and adjust them accordingly.

Tip: Do not click here if all your SQL queries are fast as a bullet!
# Run this once on each schema you have (Replace database_name with your schema name)
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

# Run this once for each table you have (replace table_name with the table name)
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

# Run this for each column (replace table name, column_name, the column type, maximum length, etc.)
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Please note that you’ll have to consider the consequences of increasing the column size from 3 bytes per character to 4. For example, MySQL indexes are limited to 768 bytes. This means that if you increase VARCHAR(255) from 3 bytes per character to 4 bytes per character, you won’t meet that limit anymore.

To conclude, make sure you read about the internals of every decision you make with MySQL. Oh, and use utf8mb4 instead of utf8 without even thinking about it.

Struggling with slow SQL queries? Stop trying to optimize manually!
Try EverSQL Query Optimizer now!

Tip: Do not click here if all your SQL queries are as quick as a Nascar car!

8 thoughts on “MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4?

  1. Yeah – god this issue is a pain – nearly lost me clients too, and hair….
    Also – stop your MySQL instance, then edit the my.ini file in the installation folder, and change the default utf8 values there to utf8mb4. This wilk ensure that any new database and tables you create will default to using utf8mb4

  2. Hi!

    The 768 byte limit is specific to earlier releases of MySQL. In 5.7 the default row+table format is DYNAMIC/BARRACUDA and no longer has this issue.

    It is important to point out that the 3 bytes of utf8mb3 was an optimization chosen to cover “most modern languages”. This optimization becomes untrue in MySQL 8.0 because of a number of performance enhancements: http://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

    (But is something that is still true in current releases, and should perhaps be mentioned.)

    1. I’ll never understand the logic of creating a Unicode Translation Format which can’t translate all of Unicode. Wasn’t that the whole point of Unicode?

      I suppose I’m lucky that none of the characters I’ve needed, so far, is beyond the BMP, in any MySQL system I use. It’s like Homer Simpson said: “These people looked deep within my soul and assigned me a number based on the order in which I joined.”

  3. I ran into this too a while ago. Some customer inputs were cut off mid sentence and I spent a good few hours looking into it.

    The next thing I learned on that occasion was the difference between the _bin and _ci endings as in utf8mb4_bin. Those mean binary or case insensitive. Which is important if you query those fields, as the clause WHERE name = ‘mark’ won’t return ‘Mark’ anymore.

  4. We are having a strange issue with this. We have a database that has the correct encoding. When we backup and restore the database the resulting table has the correct encoding. However the data has become corrupted.

    Basically we have a blog post full of emoticons that when we push from out staging to live loses the icons. Any pointers? I am thinking it must be to to with the mysqldump command used.

    1. Hi Scott, I would start with checking if your MySQL server is set to use utf8 in all aspects (check my.cnf for – default-character-set, collation-server, default-character-set).
      If not, dumping with the default parameters (mysqldump -uroot -p database > utf8.dump) might cause issues in the backup and restore process.
      Try using mysqldump with the –default-character-set=latin1 flag, and removing the SET NAMES=’latin1′ comment from the top of the created dump.
      To safely import utf8 dumps, do not use default parameters. Instead, use the following method:
      mysql -uroot -p –default-character-set=utf8 database
      mysql> SET names ‘utf8’
      mysql> SOURCE utf8.dump

      Good luck 🙂

  5. MediaWiki (the software of wikipedia) also kind of found this problem. If I remember correctly (so probably not) they used binary for a long time to be able to store any character from the client. Then later moved to utf8mb4.

Comments are closed.