MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4?

The Story of UTF8 VS UTF8MB4

I once got a call from the support team, saying that one of our customers reported that the application fails to save data in one of our business-critical features. The customer is seeing a general error from the application. About 30 of his 500 users are experiencing this issue and can't save data in the application.

After a short 15 minutes debugging session, we saw that the data is transmitted from the client side, successful received in the server side, and the insertion query is fired to the database. But still, no data in the database. Hmm.. now it got interesting.

Looking at the logs, it turns out that for specific inputs, MySQL refused to insert the data to the database. The error MySQL logged was:

Incorrect string value: '\xF0\x9F\x98\x81...' for column 'data' at row 1

Looking at those first 4 bytes, I got to no conclusion as to what was the issue. When googling them, I found that they represent an emoticonΒ in UTF8. So, it means that for this specific input, each character is probably encoded as 4 bytes.

I tried to reproduce this issue with a different string, which has its characters encoded with 1-3 bytes per character. It turned out that it only happens when each character in the data was combined of 4-byte.

We scratched our heads and thought - the character set we used in the database is UTF8, which should support 4 bytes (right?), so what's wrong?

Well, it turns out we were wrong. We quickly realized that MySQL decided that UTF-8 can only hold 3 bytes per character (as it's defined as an alias of utf8mb3). Why? no good reason that I can find documented anywhere. Few years later, when MySQL 5.5.3 was released, they introduced a new encoding called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love.

Recommendation

if you're using MySQL (or MariaDB or Percona Server), make sure you know your encodings. I would recommend anyone to set the MySQL encoding to utf8mb4. Never use utf8 in MySQL, there is no good reason to do that (unless you like tracing encoding related bugs).

How to convert utf8 to utf8mb4 in MySQL?

So now I had to fix this issue. As I recommend above, I wanted to use utf8mb4 and drop the old utf8. To do that, I used the following ALTER statements. Please DO NOT just copy paste them. You need to make sure you understand each of them and adjust them accordingly.

# Run this once on each schema you have (Replace database_name with your schema name)
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

# Run this once for each table you have (replace table_name with the table name)
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

# Run this for each column (replace table name, column_name, the column type, maximum length, etc.)
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Please note that you'll have to consider the consequences of increasing the column size from 3 bytes per character to 4. For example, MySQL indexes are limited to 768 bytes. This means that if you increase VARCHAR(255) from 3 bytes per character to 4 bytes per character, you won't meet that limit anymore.

To conclude, make sure you read about the internals of every decision you make with MySQL. Oh, and use utf8mb4 instead of utf8 without even thinking about it.

8 thoughts on “MySQL utf8 vs utf8mb4 – What’s the difference between utf8 and utf8mb4?

  1. Yeah - god this issue is a pain - nearly lost me clients too, and hair....
    Also - stop your MySQL instance, then edit the my.ini file in the installation folder, and change the default utf8 values there to utf8mb4. This wilk ensure that any new database and tables you create will default to using utf8mb4

  2. Hi!

    The 768 byte limit is specific to earlier releases of MySQL. In 5.7 the default row+table format is DYNAMIC/BARRACUDA and no longer has this issue.

    It is important to point out that the 3 bytes of utf8mb3 was an optimization chosen to cover "most modern languages". This optimization becomes untrue in MySQL 8.0 because of a number of performance enhancements: http://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

    (But is something that is still true in current releases, and should perhaps be mentioned.)

    1. I'll never understand the logic of creating a Unicode Translation Format which can't translate all of Unicode. Wasn't that the whole point of Unicode?

      I suppose I'm lucky that none of the characters I've needed, so far, is beyond the BMP, in any MySQL system I use. It's like Homer Simpson said: "These people looked deep within my soul and assigned me a number based on the order in which I joined."

  3. I ran into this too a while ago. Some customer inputs were cut off mid sentence and I spent a good few hours looking into it.

    The next thing I learned on that occasion was the difference between the _bin and _ci endings as in utf8mb4_bin. Those mean binary or case insensitive. Which is important if you query those fields, as the clause WHERE name = 'mark' won't return 'Mark' anymore.

  4. We are having a strange issue with this. We have a database that has the correct encoding. When we backup and restore the database the resulting table has the correct encoding. However the data has become corrupted.

    Basically we have a blog post full of emoticons that when we push from out staging to live loses the icons. Any pointers? I am thinking it must be to to with the mysqldump command used.

    1. Hi Scott, I would start with checking if your MySQL server is set to use utf8 in all aspects (check my.cnf for - default-character-set, collation-server, default-character-set).
      If not, dumping with the default parameters (mysqldump -uroot -p database > utf8.dump) might cause issues in the backup and restore process.
      Try using mysqldump with the --default-character-set=latin1 flag, and removing the SET NAMES='latin1' comment from the top of the created dump.
      To safely import utf8 dumps, do not use default parameters. Instead, use the following method:
      mysql -uroot -p --default-character-set=utf8 database
      mysql> SET names 'utf8'
      mysql> SOURCE utf8.dump

      Good luck πŸ™‚

  5. MediaWiki (the software of wikipedia) also kind of found this problem. If I remember correctly (so probably not) they used binary for a long time to be able to store any character from the client. Then later moved to utf8mb4.

Comments are closed.