UTF8 VS UTF8MB4 – What’s the difference?
You’re getting a support call from an IT administrator in a tech company saying some of his critical data can’t be saved in the product you deployed at his servers a week ago. His users are seeing a general error from the application. About 30 of his 500 users are experiencing this issue and can’t save data in the application.
After a short 15 minutes debug session, you can see that the data is transmitted from the client side, received in the backend server and the insertion query is fired to the database. But still, no data in the database.
Hmm.. now it got interesting. You roll up your sleeves, put up a smile on your face, some sunglasses on your eyes and you start investigating (well, maybe just start investigating without all the fancy accessories).
Looking at the logs, it turns out that for specific inputs, MySQL refused to add the data to the database. The error MySQL reports is:
Incorrect string value: ‘\xF0\x9F\x98\x81…’ for column ‘data’ at row 1
That’s a “good” hint, but why? What does it even mean? How can it be incorrect? Well, I started researching about this string and I found that the first 4 bytes combine a specific emoticon in UTF8. So, it means that in this specific input, each character is probably encoded as 4 bytes.
I tried to reproduce this issue with a different string, which has its characters encoded with 1 / 2 / 3 bytes per character. It turned out that it only happens when each character in the data was combined of 4-byte. The character set I used in the database is UTF8, which should support 4 bytes (right?), so what’s wrong?
Well, it turns out I was wrong. I learned that MySQL decided that UTF-8 can only hold 3 bytes per character. Why? no good reason that I can find documented anywhere. Few years later, when MySQL 5.5.3 was released, they introduced a new encoding called utf8mb4, which is actually the real 4-byte utf8 encoding that you know and love.
if you’re using MySQL in any flavor (MySQL, MariaDB, PerconaDB, etc.), make sure you know your encodings. I would recommend anyone to set the MySQL encoding to utf8mb4. Never use utf8 in MySQL, there is no good reason to do that (unless you like tracing encoding related bugs).
How to convert utf8 to utf8mb4 in MySQL?
So now I had to fix this issue. As I recommend above, I wanted to use utf8mb4 and drop the old utf8. To do that, I used the following ALTER statements. Please DO NOT just copy paste them. You need to make sure you understand each of them and adjust them accordingly.
# Run this once on each schema you have (Replace database_name with your schema name) ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci; # Run this once for each table you have (replace table_name with the table name) ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; # Run this for each column (replace table name, column_name, the column type, maximum length, etc.) ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Please note that you’ll have to consider the consequences of increasing the column size from 3 bytes per character to 4. For example, MySQL indexes are limited to 768 bytes. This means that if you increase VARCHAR(255) from 3 bytes per character to 4 bytes per character, you won’t meet that limit anymore.
To conclude, make sure you read about the internals of every decision you make with MySQL. Oh, and use utf8mb4 instead of utf8 without even thinking about it.