Fixing a SQL database to have only one type of collation, a Craft CMS story
I’m in the process of updating a site from Craft CMS 4 to Craft CMS 5. As I read through the detailed upgrade guide, I noticed some database related items that deserved close attention. The character set and collation of the database will change during the upgrade.
Before I did the upgrade, I discovered an issue with the collation type issue in my existing database. I had a mixture of two different collations and I wanted to correct that before I went through with the Craft update. Find yourself in the same boat? Let’s dive in.
Post publication note: While the contents of this post will allow you to fix the problem of having a mixed collation type, I only realized later that Craft CMS comes with a built-in CLI command that will make this simple. Check out the
craft db/convert-charset
CLI command. See this page in the docs: https://craftcms.com/docs/5.x/reference/cli.html#db-convert-charset. If you’re not using Craft though, read on for more.
Beware and be prepared
Before we go any farther, if you’re thinking of making these changes your database as you read this post, I encourage you to work locally and have a database backup. There be squalls ahead. 🏴☠️
Tools
Here are the tools I’m using for this process:
- DDEV — the recommended local development tool for Craft CMS
- TablePlus — a handy GUI for working with databases which has a very close integration with DDEV
- A code editor, like PHPStorm, VSCode, or Sublime Text, that understands SQL and can color code it for easy visual inspection
What is your current collation?
In the Performing the Upgrade section of the guide, step 6 jumped out at me. During the upgrade process, we need to specify the database charset and collation. If you’ve always followed Craft’s defaults, you simple add the following to your .env
file.
CRAFT_DB_CHARSET="utf8mb3"
CRAFT_DB_COLLATION="utf8mb3_general_ci"
Perhaps you’re nodding in agreement as your eyes glaze over. That’s ok! 😵💫
I wasn’t sure that the database I was working with had always followed Craft’s recommendation. The database was from a site I had inherited a while ago, so I needed to confirm what the state of my data was. TablePlus can help with that.
Checking your database
With your site running locally in DDEV, it’s easy to open up your database in TablePlus. On the command line, simply enter:
ddev tableplus
You should see the table used in your DDEV site appear TablePlus. There are two columns that we need to pay attention to, charset and collation.
I was happy to see the that every table’s charset in the database from my Craft 4 site said utf8mb3
.
Unfortunately, the collation values for the tables were a mix of ut8mb3_general_ci
and utfmb3_unicode_ci
.
I’m in the process of updating a site from Craft CMS 4 to Craft CMS 5. As I read through the detailed upgrade guide, I noticed some database related items that deserved close attention. The character set and collation of the database will change during the upgrade.
Before I did the upgrade, I discovered an issue with the collation type issue in my existing database. I had a mixture of two different collations and I wanted to correct that before I went through with the Craft update. Find yourself in the same boat? Let’s dive in.
Post publication note: While the contents of this post will allow you to fix the problem of having a mixed collation type, I only realized later that Craft CMS comes with a built-in CLI command that will make this simple. Check out the
craft db/convert-charset
CLI command. See this page in the docs: https://craftcms.com/docs/5.x/reference/cli.html#db— convert— charset. If you’re not using Craft though, read on for more.
Beware and be prepared #
Before we go any farther, if you’re thinking of making these changes your database as you read this post, I encourage you to work locally and have a database backup. There be squalls ahead. 🏴☠️
Tools #
Here are the tools I’m using for this process:
- DDEV — the recommended local development tool for Craft CMS
- TablePlus — a handy GUI for working with databases which has a very close integration with DDEV
- A code editor, like PHPStorm, VSCode, or Sublime Text, that understands SQL and can color code it for easy visual inspection
What is your current collation? #
In the Performing the Upgrade section of the guide, step 6 jumped out at me. During the upgrade process, we need to specify the database charset and collation. If you’ve always followed Craft’s defaults, you simple add the following to your .env
file.
CRAFT_DB_CHARSET="utf8mb3"
CRAFT_DB_COLLATION="utf8mb3_general_ci"
Perhaps you’re nodding in agreement as your eyes glaze over. That’s ok! 😵💫
I wasn’t sure that the database I was working with had always followed Craft’s recommendation. The database was from a site I had inherited a while ago, so I needed to confirm what the state of my data was. TablePlus can help with that.
Checking your database #
With your site running locally in DDEV, it’s easy to open up your database in TablePlus. On the command line, simply enter:
ddev tableplus
You should see the table used in your DDEV site appear TablePlus. There are two columns that we need to pay attention to, charset and collation.
I was happy to see the that every table’s charset in the database from my Craft 4 site said utf8mb3
.
Unfortunately, the collation values for the tables were a mix of ut8mb3_general_ci
and utfmb3_unicode_ci
.
Why would the database have two types of collations? I don’t know, but my goal is to get them in sync. The site I’m working on is about ten years old and has has multiple developers over time. It also doesn’t matter how the database ended up in this state, I just wanted to fix it.
Since the underlying charset for my tables is the same, utf8mb3
, the data itself should be in good shape. Problems I’ve seen in the past with the wrong charset for a table would be with extended characters, like letter with ascents. Luckily, I’m not dealing with that issue here.
The issue I had was that there were two different collations, which can cause problems. For example, MySQL can throw an Illegal mix of collations error. 😵💫😵💫
What are we talking about here?
But first, what’s a collation? Think of the word collate. When you collate data, you’re sorting it. Here’s how Wikipedia defines collation generally.
A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order.
That makes some sense. Collation has to do with the sorting of data. So what’s the difference between ut8mb3_general_ci
and utfmb3_unicode_ci
? The best answer I found was from StackOverflow, which basically says:
utf8mb3_unicode_ci
provides more accurate sorting and comparison for a wide range of characters according to the Unicode standard.utf8mb3_general_ci
is faster but less accurate in terms of sorting and comparison, especially for characters with diacritics, i.e., accents.
Faster seems better, right? Also, since utf8mb3_general_ci
has been the Craft CMS recommendation all along, I’m sold. I want to get everything converted to utf8mb3_general_ci
. The key here is that we do sorting consistently.
Based on my reading (see reference links at the end of this post), changing the collation should not be an issue. (You did make a database backup, right?)
Why would the database have two types of collations? I don’t know, but my goal is to get them in sync. The site I’m working on is about ten years old and has has multiple developers over time. It also doesn’t matter how the database ended up in this state, I just wanted to fix it.
Since the underlying charset for my tables is the same, utf8mb3
, the data itself should be in good shape. Problems I’ve seen in the past with the wrong charset for a table would be with extended characters, like letter with ascents. Luckily, I’m not dealing with that issue here.
The issue I had was that there were two different collations, which can cause problems. For example, MySQL can throw an Illegal mix of collations error. 😵💫😵💫
What are we talking about here? #
But first, what’s a collation? Think of the word collate. When you collate data, you’re sorting it. Here’s how Wikipedia defines collation generally.
A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order.
That makes some sense. Collation has to do with the sorting of data. So what’s the difference between ut8mb3_general_ci
and utfmb3_unicode_ci
? The best answer I found was from StackOverflow, which basically says:
utf8mb3_unicode_ci
provides more accurate sorting and comparison for a wide range of characters according to the Unicode standard.utf8mb3_general_ci
is faster but less accurate in terms of sorting and comparison, especially for characters with diacritics, i.e., accents.
Faster seems better, right? Also, since utf8mb3_general_ci
has been the Craft CMS recommendation all along, I’m sold. I want to get everything converted to utf8mb3_general_ci
. The key here is that we do sorting consistently.
Based on my reading (see reference links at the end of this post), changing the collation should not be an issue. (You did make a database backup, right?)
Using TablePlus to build your SQL
In my case, I had tables and columns both that needed a collation change.
I used two sets of SQL which each generates a set of SQL statements (see the AS stmt
in the code) that I will run in a final step to do the actual migration. In the code below, be sure to change database_name
to whatever you DDEV database name is. In my case, that’s db
. You can look at your .env
file to find the database name in your project.
Step 1: SQL to generate the migration statements for tables:
SELECT
CONCAT('ALTER TABLE `', TABLE_NAME, '` CONVERT TO CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM
INFORMATION_SCHEMA.TABLES
WHERE
TABLE_SCHEMA = 'database_name'
AND TABLE_COLLATION = 'utf8mb3_unicode_ci';
Step 2: SQL to generation the migration statements for columns:
SELECT
CONCAT('ALTER TABLE `', TABLE_NAME, '` CHANGE `', COLUMN_NAME, '` `', COLUMN_NAME, '` ', COLUMN_TYPE, ' CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_SCHEMA = 'database_name'
AND COLLATION_NAME = 'utf8mb3_unicode_ci';
Step 3: The migration
I ran Step 1 and Step 2 separately, but you should be able to do it in one go if you prefer.
You should see many lines of code generated when you run these commands. They will contain rows of “ALTER TABLE” statements.
Copy the generated ALTER TABLE statements and paste the code into a new blank document in your code editor somewhere in your project. Name the file whatever you like, but make sure the extension is .sql
, i.e. my-database-migration-code.sql
. The extension is important because it will enable color formatting of the code and help you spot any potential problems.
On my first attempt, I did not save the generated statement into a file and closely inspect it. I simply ran the ALTER TABLE statements in TablePlus. The migration was mostly successful, but a few of the SQL command generated errors. Seeing the formatted code in a code editor revealed the problem. Here’s a screenshot of one of the problematic lines in my code.
Using TablePlus to build your SQL #
In my case, I had tables and columns both that needed a collation change.
I used two sets of SQL which each generates a set of SQL statements (see the AS stmt
in the code) that I will run in a final step to do the actual migration. In the code below, be sure to change database_name
to whatever you DDEV database name is. In my case, that’s db
. You can look at your .env
file to find the database name in your project.
Step 1: SQL to generate the migration statements for tables: #
SELECT
CONCAT('ALTER TABLE `', TABLE_NAME, '` CONVERT TO CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM
INFORMATION_SCHEMA.TABLES
WHERE
TABLE_SCHEMA = 'database_name'
AND TABLE_COLLATION = 'utf8mb3_unicode_ci';
Step 2: SQL to generation the migration statements for columns: #
SELECT
CONCAT('ALTER TABLE `', TABLE_NAME, '` CHANGE `', COLUMN_NAME, '` `', COLUMN_NAME, '` ', COLUMN_TYPE, ' CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_SCHEMA = 'database_name'
AND COLLATION_NAME = 'utf8mb3_unicode_ci';
Step 3: The migration #
I ran Step 1 and Step 2 separately, but you should be able to do it in one go if you prefer.
You should see many lines of code generated when you run these commands. They will contain rows of “ALTER TABLE” statements.
Copy the generated ALTER TABLE statements and paste the code into a new blank document in your code editor somewhere in your project. Name the file whatever you like, but make sure the extension is .sql
, i.e. my-database-migration-code.sql
. The extension is important because it will enable color formatting of the code and help you spot any potential problems.
On my first attempt, I did not save the generated statement into a file and closely inspect it. I simply ran the ALTER TABLE statements in TablePlus. The migration was mostly successful, but a few of the SQL command generated errors. Seeing the formatted code in a code editor revealed the problem. Here’s a screenshot of one of the problematic lines in my code.
Each error I saw corresponded to a line in my migration that was wrapped in quotes. I removed the quotes and inspected each line for proper syntax. Each line looked good when I removed the quotes. I saved the result in a new sql file in my project. Next, I restored my database from backup and then pasted my newly updated migration code into a new SQL query tab and pressed “Run all” again.
A success migration
The fix worked and the migration went through without an error. I wanted to check results. Opening the migrated table in TablePlus showed that all my collation values were now the same. Next, I went through my site to confirm things looked good and didn’t find any problems. I opened at least one entry for each section in my site and saw everything rendered properly.
This is the copy of the database which will replace the production database.
The rest of the migration from Craft 4 to Craft 5 went without a hitch. Once you get a site migrated from 4 to 5 though, be sure to finish your data migration as mentioned in the Database Character Set and Collation section of the upgrade guide.
You’ll need to update your .env
file with a new charset and collation and run php craft db/convert-charset
to finish your database migration.
Good luck!
Each error I saw corresponded to a line in my migration that was wrapped in quotes. I removed the quotes and inspected each line for proper syntax. Each line looked good when I removed the quotes. I saved the result in a new sql file in my project. Next, I restored my database from backup and then pasted my newly updated migration code into a new SQL query tab and pressed “Run all” again.
A success migration #
The fix worked and the migration went through without an error. I wanted to check results. Opening the migrated table in TablePlus showed that all my collation values were now the same. Next, I went through my site to confirm things looked good and didn’t find any problems. I opened at least one entry for each section in my site and saw everything rendered properly.
This is the copy of the database which will replace the production database.
The rest of the migration from Craft 4 to Craft 5 went without a hitch. Once you get a site migrated from 4 to 5 though, be sure to finish your data migration as mentioned in the Database Character Set and Collation section of the upgrade guide.
You’ll need to update your .env
file with a new charset and collation and run php craft db/convert-charset
to finish your database migration.
Good luck!
Reference links:
- https://craftcms.com/docs/5.x/upgrade.html
- https://en.wikipedia.org/wiki/Collation
- https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci
- https://dba.stackexchange.com/questions/255548/mysql-performance-implications-of-changing-a-column-collation-from-latin1-swedi
- https://sqlserverfast.com/blog/hugo/2019/01/collation-change-script/
- https://dev.mysql.com/doc/refman/8.0/en/charset.html
- https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html
- https://tableplus.com/blog/2019/08/illegal-mix-of-collations-mysql.html
- https://medium.com/@nilesh.patil.d/utf8-general-ci-vs-utf8-unicode-ci-what-should-we-use-cc01b58c00cc
Reference links: #
- https://craftcms.com/docs/5.x/upgrade.html
- https://en.wikipedia.org/wiki/Collation
- https://stackoverflow.com/questions/766809/whats— the— difference— between— utf8— general— ci— and— utf8— unicode-ci
- https://dba.stackexchange.com/questions/255548/mysql— performance— implications— of— changing— a— column— collation— from— latin1— swedi
- https://sqlserverfast.com/blog/hugo/2019/01/collation— change— script/
- https://dev.mysql.com/doc/refman/8.0/en/charset.html
- https://dev.mysql.com/doc/refman/8.0/en/charset— unicode— utf8mb3.html
- https://tableplus.com/blog/2019/08/illegal— mix— of— collations— mysql.html
- https://medium.com/@nilesh.patil.d/utf8— general— ci— vs— utf8— unicode— ci— what— should— we— use— cc01b58c00cc