Encoding, Collation and Storage

Last updated on
17 September 2025

Drupal 7 will no longer be supported after January 5, 2025. Learn more and find resources for Drupal 7 sites

This documentation needs work. See "Help improve this page" in the sidebar.

Storage Engines or Table Types:

A Storage Engine is a software module that a database management system uses to create, read, update data from a database. In other words, Storage Engines are underlying software components that handle operations and manage information. 

You can see Storage and Collation settings of a table by selecting a MySQL's table and clicking on the Operations tab in phpMyAdmin.

Drupal Character Encoding, collation and storage settings

MySQL includes character set support that enables you to store data using a variety of character sets and perform comparisons according to a variety of collations. You can specify character sets at the server, database, table, and column level. MySQL supports the use of character sets for the MyISAM, MEMORY, and InnoDB storage engines.

InnoDB and MyISAM are two most common storage engines for MySQL. The following table compares both storage engines in detail.

InnoDB MyISAM

Preview

This is the default storage engine for MySQL 5.5 and higher. InnoDB should be your default storage engine of choice. This storage engine manages non-transactional tables, provides high-speed storage and retrieval, supports full-text searching.

Storage limits

64TB

256TB

Multiversion concurrency control

Yes No

B-Tree indexes

Yes Yes

Full-Text Search Indexes

Yes Yes

Index caches

Yes Yes

Cluster Database Support

No No

Backup or point in time recovery

Yes Yes

Transactions

Yes No

Geospatial Datatype Support

Yes Yes

T-tree indexes

No No

Clustered indexes

Yes No

Compressed data

Yes Yes

Replication support

Yes Yes

Query cache support

Yes Yes

Locking granularity

Row Table

Geospatial indexing support

No Yes

Hash indexes

No No

Data caches

Yes No

Encrypted data

Yes Yes

Foreign key support

Yes No

Update Statistics for data dictionary

Yes Yes

Conclusion

InnoDB is most widely used default storage engine of choice with transaction/rollback/commit support. Oracle recommends using InnoDB. MyISAM is used only for some specialized use cases.

Character Sets and Collations:

You must have come across following terms several times while dealing with MySQL databases:

utf8_general_ci utf8mb4_general_ci latin1_swedish_ci
utf8_unicode_ci utf8mb4_unicode_ci

The first part of the string represents Character Encoding and the second and last part represents Collation. 

Character Encoding latin1,utf8, uft8mb4 
Collation swedish_ci,general_ci, unicode_ci

First of all, let's have a look at character encodings.

A Characterset determines encoding and range of characters.

If you need to handle special characters, like letters with accents you use utf8 encoding. However, this encoding won’t handle emojis. For that, you’ll need utf8mb4, available since MySQL 5.5.3. If you don’t care about any of that, simply go with the default, latin1.

A note of warning when migrating from utf8 to utf8mb4. In utf8, a character can be encoded in a maximum of 3 bytes. In utf8mb4, we add an extra byte to store special characters like smileys. That changes the maximum length a column or index can hold. So if a column was a varchar(256) in utf8, it should now be a varchar(191) in utf8mb4.

Charset Description Default Collation MaxLen
latin1 cp1252 West European latin1_swedish_ci 1
utf8 UTF-8 Unicode utf8_general_ci 3
utf8mb4 UTF-8 Unicode utf8mb4_general_ci 4

latin1 contains the basic ASCII set of unaccented alphanumeric characters and punctuations. Its sort order is swedish and its case insensitive. So, why Swedish and not English? Its because MySQL was originally developed in Sweden and Swedish uses the same sort order as English. So, this default is fine for English.

Each character in the latin1 character set is stored in a single byte.

But, the characters in utf8 and utf8mb4 are variable in length.

Why is Varchar often considered more storage-friendly than Char?

Have a look at the following chart:

+--------------------------------------------------+
| Character Set      |  CHAR(40)  | VARCHAR(40)    |
+--------------------+------------+----------------+
| latin1             |  40 bytes  |                |
+--------------------+------------| Actual length  |
| utf8               |  120 bytes |       +        |
+--------------------+------------| 1 or 2 bytes   |         
| utf8mb4            |  160 bytes |                |
+--------------------------------------------------+

With a fixed length CHAR column defined to store 40 characters, latin1 just uses 40 bytes, but with multibyte character sets like utf8 and utf8mb4 MySQL needs to reserve enough storage space to accommodate the widest character in the set.

A variable length varchar column, on the other hand, uses only the actual length plus one or two bytes. 

Consequently, it is much more efficient to use Varchar with utf8 and utf8mb4 even for fixed length strings.

How do utf8 and utf8mb4 store characters?

Charset

Byte

Use

utf8 / utf8mb4 1st Same as latin
utf8 / utf8mb4 2nd Accented characters, Greek, Cyrillic, Armenian, Hebrew, and Arabic.
utf8 / utf8mb4 3rd Asian languages (Devnagri Script Languages), mathematical operators and dingbats.
utf8 / utf8mb4 4th Ancient languages, smilies, symbols and rare Chinese, Japanese and Korean Characters.
Understanding Collation:

One one side where Character set defines how text is stored, Collation defines how they are sorted and compared.

utf8mb4_unicode_ci is based on the Unicode standard for sorting and comparison, which sorts accurately in a very wide range of languages.

utf8mb4_unicode_ci, which uses the Unicode rules for sorting and comparison, employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters. These rules need to take into account language-specific conventions; not everybody sorts their characters in what we would call 'alphabetical order'.

utf8mb4_general_ci fails to implement all of the Unicode sorting rules, which will result in undesirable sorting in some situations, such as when using particular languages or characters.

utf8mb4_general_ci is faster at comparisons and sorting because it takes a bunch of performance-related shortcuts.

As far as Latin (ie "European") languages go, there is not much difference between the Unicode sorting and the simplified utf8mb4_general_ci sorting in MySQL.

In non-latin languages, such as Asian languages or languages with different alphabets, there may be a lot more differences between Unicode sorting and the simplified utf8mb4_general_ci sorting. The suitability of utf8mb4_general_ci will depend heavily on the language used. For some languages, it'll be quite inadequate.

Help improve this page

Page status: Needs work

You can: