SuperGeekery: A blog probably of interest only to nerds by John F Morton.

A blog prob­a­bly of inter­est only to nerds by John F Mor­ton.

Fixing a SQL database to have only one type of collation, a Craft CMS story

Spiderman database collation

I’m in the process of updat­ing a site from Craft CMS 4 to Craft CMS 5. As I read through the detailed upgrade guide, I noticed some data­base relat­ed items that deserved close atten­tion. The char­ac­ter set and col­la­tion of the data­base will change dur­ing the upgrade.

Before I did the upgrade, I dis­cov­ered an issue with the col­la­tion type issue in my exist­ing data­base. I had a mix­ture of two dif­fer­ent col­la­tions and I want­ed to cor­rect that before I went through with the Craft update. Find your­self in the same boat? Let’s dive in.

Post pub­li­ca­tion note: While the con­tents of this post will allow you to fix the prob­lem of hav­ing a mixed col­la­tion type, I only real­ized lat­er that Craft CMS comes with a built-in CLI com­mand that will make this sim­ple. Check out the craft db/convert-charset CLI com­mand. See this page in the docs: https://​craftcms​.com/​d​o​c​s​/​5​.​x​/​r​e​f​e​r​e​n​c​e​/​c​l​i​.​h​t​m​l​#​d​b​-​c​o​n​v​e​r​t​-​c​h​arset. If you’re not using Craft though, read on for more.

Beware and be prepared

Before we go any far­ther, if you’re think­ing of mak­ing these changes your data­base as you read this post, I encour­age you to work local­ly and have a data­base back­up. There be squalls ahead. 🏴‍☠️

Tools

Here are the tools I’m using for this process:

  1. DDEV — the rec­om­mend­ed local devel­op­ment tool for Craft CMS
  2. Table­Plus — a handy GUI for work­ing with data­bas­es which has a very close inte­gra­tion with DDEV
  3. A code edi­tor, like PHP­Storm, VSCode, or Sub­lime Text, that under­stands SQL and can col­or code it for easy visu­al inspec­tion

What is your current collation?

In the Per­form­ing the Upgrade sec­tion of the guide, step 6 jumped out at me. Dur­ing the upgrade process, we need to spec­i­fy the data­base charset and col­la­tion. If you’ve always fol­lowed Craft’s defaults, you sim­ple add the fol­low­ing to your .env file.

CRAFT_DB_CHARSET="utf8mb3"
CRAFT_DB_COLLATION="utf8mb3_general_ci"

Per­haps you’re nod­ding in agree­ment as your eyes glaze over. That’s ok! 😵‍💫

I was­n’t sure that the data­base I was work­ing with had always fol­lowed Craft’s rec­om­men­da­tion. The data­base was from a site I had inher­it­ed a while ago, so I need­ed to con­firm what the state of my data was. Table­Plus can help with that.

Checking your database

With your site run­ning local­ly in DDEV, it’s easy to open up your data­base in Table­Plus. On the com­mand line, sim­ply enter:

ddev tableplus

You should see the table used in your DDEV site appear Table­Plus. There are two columns that we need to pay atten­tion to, charset and col­la­tion.

I was hap­py to see the that every table’s charset in the data­base from my Craft 4 site said utf8mb3.

Unfor­tu­nate­ly, the col­la­tion val­ues for the tables were a mix of ut8mb3_general_ci and utfmb3_unicode_ci.

I’m in the process of updat­ing a site from Craft CMS 4 to Craft CMS 5. As I read through the detailed upgrade guide, I noticed some data­base relat­ed items that deserved close atten­tion. The char­ac­ter set and col­la­tion of the data­base will change dur­ing the upgrade. 

Before I did the upgrade, I dis­cov­ered an issue with the col­la­tion type issue in my exist­ing data­base. I had a mix­ture of two dif­fer­ent col­la­tions and I want­ed to cor­rect that before I went through with the Craft update. Find your­self in the same boat? Let’s dive in.

Post pub­li­ca­tion note: While the con­tents of this post will allow you to fix the prob­lem of hav­ing a mixed col­la­tion type, I only real­ized lat­er that Craft CMS comes with a built-in CLI com­mand that will make this sim­ple. Check out the craft db/convert-charset CLI com­mand. See this page in the docs: https://​craftcms​.com/​d​o​c​s​/​5​.​x​/​r​e​f​e​r​e​n​c​e​/​c​l​i​.​h​t​m​l​#​d​b​— c​o​n​v​e​r​t​— c​h​arset. If you’re not using Craft though, read on for more.

Beware and be pre­pared #

Before we go any far­ther, if you’re think­ing of mak­ing these changes your data­base as you read this post, I encour­age you to work local­ly and have a data­base back­up. There be squalls ahead. 🏴‍☠️

Tools #

Here are the tools I’m using for this process:

  1. DDEV — the rec­om­mend­ed local devel­op­ment tool for Craft CMS
  2. Table­Plus — a handy GUI for work­ing with data­bas­es which has a very close inte­gra­tion with DDEV
  3. A code edi­tor, like PHP­Storm, VSCode, or Sub­lime Text, that under­stands SQL and can col­or code it for easy visu­al inspec­tion

What is your cur­rent col­la­tion? #

In the Per­form­ing the Upgrade sec­tion of the guide, step 6 jumped out at me. Dur­ing the upgrade process, we need to spec­i­fy the data­base charset and col­la­tion. If you’ve always fol­lowed Craft’s defaults, you sim­ple add the fol­low­ing to your .env file.

CRAFT_DB_CHARSET="utf8mb3"
CRAFT_DB_COLLATION="utf8mb3_general_ci"

Per­haps you’re nod­ding in agree­ment as your eyes glaze over. That’s ok! 😵‍💫

I wasn’t sure that the data­base I was work­ing with had always fol­lowed Craft’s rec­om­men­da­tion. The data­base was from a site I had inher­it­ed a while ago, so I need­ed to con­firm what the state of my data was. Table­Plus can help with that.

Check­ing your data­base #

With your site run­ning local­ly in DDEV, it’s easy to open up your data­base in Table­Plus. On the com­mand line, sim­ply enter:

ddev tableplus

You should see the table used in your DDEV site appear Table­Plus. There are two columns that we need to pay atten­tion to, charset and col­la­tion.

I was hap­py to see the that every table’s charset in the data­base from my Craft 4 site said utf8mb3.

Unfor­tu­nate­ly, the col­la­tion val­ues for the tables were a mix of ut8mb3_general_ci and utfmb3_unicode_ci.


2024 05 14 09 25 04

TablePlus showing different collations for tables in the same database


Why would the data­base have two types of col­la­tions? I don’t know, but my goal is to get them in sync. The site I’m work­ing on is about ten years old and has has mul­ti­ple devel­op­ers over time. It also does­n’t mat­ter how the data­base end­ed up in this state, I just want­ed to fix it.

Since the under­ly­ing charset for my tables is the same, utf8mb3, the data itself should be in good shape. Prob­lems I’ve seen in the past with the wrong charset for a table would be with extend­ed char­ac­ters, like let­ter with ascents. Luck­i­ly, I’m not deal­ing with that issue here.

The issue I had was that there were two dif­fer­ent col­la­tions, which can cause prob­lems. For exam­ple, MySQL can throw an Ille­gal mix of col­la­tions error. 😵‍💫😵‍💫

What are we talking about here?

But first, what’s a col­la­tion? Think of the word col­late. When you col­late data, you’re sort­ing it. Here’s how Wikipedia defines col­la­tion gen­er­al­ly.

A col­la­tion algo­rithm such as the Uni­code col­la­tion algo­rithm defines an order through the process of com­par­ing two giv­en char­ac­ter strings and decid­ing which should come before the oth­er. When an order has been defined in this way, a sort­ing algo­rithm can be used to put a list of any num­ber of items into that order.

That makes some sense. Col­la­tion has to do with the sort­ing of data. So what’s the dif­fer­ence between ut8mb3_general_ci and utfmb3_unicode_ci? The best answer I found was from Stack­Over­flow, which basi­cal­ly says:

  • utf8mb3_unicode_ci pro­vides more accu­rate sort­ing and com­par­i­son for a wide range of char­ac­ters accord­ing to the Uni­code stan­dard.
  • utf8mb3_general_ci is faster but less accu­rate in terms of sort­ing and com­par­i­son, espe­cial­ly for char­ac­ters with dia­crit­ics, i.e., accents.

Faster seems bet­ter, right? Also, since utf8mb3_general_ci has been the Craft CMS rec­om­men­da­tion all along, I’m sold. I want to get every­thing con­vert­ed to utf8mb3_general_ci. The key here is that we do sort­ing con­sis­tent­ly.

Based on my read­ing (see ref­er­ence links at the end of this post), chang­ing the col­la­tion should not be an issue. (You did make a data­base back­up, right?)

Why would the data­base have two types of col­la­tions? I don’t know, but my goal is to get them in sync. The site I’m work­ing on is about ten years old and has has mul­ti­ple devel­op­ers over time. It also doesn’t mat­ter how the data­base end­ed up in this state, I just want­ed to fix it.

Since the under­ly­ing charset for my tables is the same, utf8mb3, the data itself should be in good shape. Prob­lems I’ve seen in the past with the wrong charset for a table would be with extend­ed char­ac­ters, like let­ter with ascents. Luck­i­ly, I’m not deal­ing with that issue here.

The issue I had was that there were two dif­fer­ent col­la­tions, which can cause prob­lems. For exam­ple, MySQL can throw an Ille­gal mix of col­la­tions error. 😵‍💫😵‍💫

What are we talk­ing about here? #

But first, what’s a col­la­tion? Think of the word col­late. When you col­late data, you’re sort­ing it. Here’s how Wikipedia defines col­la­tion gen­er­al­ly.

A col­la­tion algo­rithm such as the Uni­code col­la­tion algo­rithm defines an order through the process of com­par­ing two giv­en char­ac­ter strings and decid­ing which should come before the oth­er. When an order has been defined in this way, a sort­ing algo­rithm can be used to put a list of any num­ber of items into that order.

That makes some sense. Col­la­tion has to do with the sort­ing of data. So what’s the dif­fer­ence between ut8mb3_general_ci and utfmb3_unicode_ci? The best answer I found was from Stack­Over­flow, which basi­cal­ly says:

  • utf8mb3_unicode_ci pro­vides more accu­rate sort­ing and com­par­i­son for a wide range of char­ac­ters accord­ing to the Uni­code stan­dard.
  • utf8mb3_general_ci is faster but less accu­rate in terms of sort­ing and com­par­i­son, espe­cial­ly for char­ac­ters with dia­crit­ics, i.e., accents.

Faster seems bet­ter, right? Also, since utf8mb3_general_ci has been the Craft CMS rec­om­men­da­tion all along, I’m sold. I want to get every­thing con­vert­ed to utf8mb3_general_ci. The key here is that we do sort­ing con­sis­tent­ly.

Based on my read­ing (see ref­er­ence links at the end of this post), chang­ing the col­la­tion should not be an issue. (You did make a data­base back­up, right?)


2024 05 14 15 31 30

How to use the TablePlus interface to run SQL


Using TablePlus to build your SQL

In my case, I had tables and columns both that need­ed a col­la­tion change.

I used two sets of SQL which each gen­er­ates a set of SQL state­ments (see the AS stmt in the code) that I will run in a final step to do the actu­al migra­tion. In the code below, be sure to change database_name to what­ev­er you DDEV data­base name is. In my case, that’s db. You can look at your .env file to find the data­base name in your project.

Step 1: SQL to generate the migration statements for tables:

SELECT 
    CONCAT('ALTER TABLE `', TABLE_NAME, '` CONVERT TO CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM 
    INFORMATION_SCHEMA.TABLES 
WHERE 
    TABLE_SCHEMA = 'database_name' 
    AND TABLE_COLLATION = 'utf8mb3_unicode_ci';

Step 2: SQL to generation the migration statements for columns:

SELECT 
    CONCAT('ALTER TABLE `', TABLE_NAME, '` CHANGE `', COLUMN_NAME, '` `', COLUMN_NAME, '` ', COLUMN_TYPE, ' CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM 
    INFORMATION_SCHEMA.COLUMNS 
WHERE 
    TABLE_SCHEMA = 'database_name' 
    AND COLLATION_NAME = 'utf8mb3_unicode_ci';

Step 3: The migration

I ran Step 1 and Step 2 sep­a­rate­ly, but you should be able to do it in one go if you pre­fer.

You should see many lines of code gen­er­at­ed when you run these com­mands. They will con­tain rows of ALTER TABLE” state­ments.

Copy the gen­er­at­ed ALTER TABLE state­ments and paste the code into a new blank doc­u­ment in your code edi­tor some­where in your project. Name the file what­ev­er you like, but make sure the exten­sion is .sql, i.e. my-database-migration-code.sql. The exten­sion is impor­tant because it will enable col­or for­mat­ting of the code and help you spot any poten­tial prob­lems.

On my first attempt, I did not save the gen­er­at­ed state­ment into a file and close­ly inspect it. I sim­ply ran the ALTER TABLE state­ments in Table­Plus. The migra­tion was most­ly suc­cess­ful, but a few of the SQL com­mand gen­er­at­ed errors. See­ing the for­mat­ted code in a code edi­tor revealed the prob­lem. Here’s a screen­shot of one of the prob­lem­at­ic lines in my code.

Using Table­Plus to build your SQL #

In my case, I had tables and columns both that need­ed a col­la­tion change.

I used two sets of SQL which each gen­er­ates a set of SQL state­ments (see the AS stmt in the code) that I will run in a final step to do the actu­al migra­tion. In the code below, be sure to change database_name to what­ev­er you DDEV data­base name is. In my case, that’s db. You can look at your .env file to find the data­base name in your project.

Step 1: SQL to gen­er­ate the migra­tion state­ments for tables: #

SELECT 
    CONCAT('ALTER TABLE `', TABLE_NAME, '` CONVERT TO CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM 
    INFORMATION_SCHEMA.TABLES 
WHERE 
    TABLE_SCHEMA = 'database_name' 
    AND TABLE_COLLATION = 'utf8mb3_unicode_ci';

Step 2: SQL to gen­er­a­tion the migra­tion state­ments for columns: #

SELECT 
    CONCAT('ALTER TABLE `', TABLE_NAME, '` CHANGE `', COLUMN_NAME, '` `', COLUMN_NAME, '` ', COLUMN_TYPE, ' CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;') AS stmt
FROM 
    INFORMATION_SCHEMA.COLUMNS 
WHERE 
    TABLE_SCHEMA = 'database_name' 
    AND COLLATION_NAME = 'utf8mb3_unicode_ci';

Step 3: The migra­tion #

I ran Step 1 and Step 2 sep­a­rate­ly, but you should be able to do it in one go if you pre­fer. 

You should see many lines of code gen­er­at­ed when you run these com­mands. They will con­tain rows of ALTER TABLE” state­ments.

Copy the gen­er­at­ed ALTER TABLE state­ments and paste the code into a new blank doc­u­ment in your code edi­tor some­where in your project. Name the file what­ev­er you like, but make sure the exten­sion is .sql, i.e. my-database-migration-code.sql. The exten­sion is impor­tant because it will enable col­or for­mat­ting of the code and help you spot any poten­tial prob­lems.

On my first attempt, I did not save the gen­er­at­ed state­ment into a file and close­ly inspect it. I sim­ply ran the ALTER TABLE state­ments in Table­Plus. The migra­tion was most­ly suc­cess­ful, but a few of the SQL com­mand gen­er­at­ed errors. See­ing the for­mat­ted code in a code edi­tor revealed the prob­lem. Here’s a screen­shot of one of the prob­lem­at­ic lines in my code. 


2024 05 14 15 46 40

The generated migration code had some lines wrapped in open and closing quotes.


Each error I saw cor­re­spond­ed to a line in my migra­tion that was wrapped in quotes. I removed the quotes and inspect­ed each line for prop­er syn­tax. Each line looked good when I removed the quotes. I saved the result in a new sql file in my project. Next, I restored my data­base from back­up and then past­ed my new­ly updat­ed migra­tion code into a new SQL query tab and pressed Run all” again.

A success migration

The fix worked and the migra­tion went through with­out an error. I want­ed to check results. Open­ing the migrat­ed table in Table­Plus showed that all my col­la­tion val­ues were now the same. Next, I went through my site to con­firm things looked good and did­n’t find any prob­lems. I opened at least one entry for each sec­tion in my site and saw every­thing ren­dered prop­er­ly.

This is the copy of the data­base which will replace the pro­duc­tion data­base.

The rest of the migra­tion from Craft 4 to Craft 5 went with­out a hitch. Once you get a site migrat­ed from 4 to 5 though, be sure to fin­ish your data migra­tion as men­tioned in the Data­base Char­ac­ter Set and Col­la­tion sec­tion of the upgrade guide.

You’ll need to update your .env file with a new charset and col­la­tion and run php craft db/convert-charset to fin­ish your data­base migra­tion.

Good luck!

Each error I saw cor­re­spond­ed to a line in my migra­tion that was wrapped in quotes. I removed the quotes and inspect­ed each line for prop­er syn­tax. Each line looked good when I removed the quotes. I saved the result in a new sql file in my project. Next, I restored my data­base from back­up and then past­ed my new­ly updat­ed migra­tion code into a new SQL query tab and pressed Run all” again. 

A suc­cess migra­tion #

The fix worked and the migra­tion went through with­out an error. I want­ed to check results. Open­ing the migrat­ed table in Table­Plus showed that all my col­la­tion val­ues were now the same. Next, I went through my site to con­firm things looked good and didn’t find any prob­lems. I opened at least one entry for each sec­tion in my site and saw every­thing ren­dered prop­er­ly. 

This is the copy of the data­base which will replace the pro­duc­tion data­base. 

The rest of the migra­tion from Craft 4 to Craft 5 went with­out a hitch. Once you get a site migrat­ed from 4 to 5 though, be sure to fin­ish your data migra­tion as men­tioned in the Data­base Char­ac­ter Set and Col­la­tion sec­tion of the upgrade guide.

You’ll need to update your .env file with a new charset and col­la­tion and run php craft db/convert-charset to fin­ish your data­base migra­tion. 

Good luck!


Reference links: