MySQL 8.0 Reference Manual(读书笔记36节-- 字符编码(3))
1.Configuring Application Character Set and Collation
Regardless of how you configure the MySQL character set for application use, you must also consider the environment within which those applications execute. For example, if you intend to send statements using UTF-8 text taken from a file that you create in an editor, you should edit the file with the locale of your environment set to UTF-8 so that the file encoding is correct and so that the operating system handles it correctly. If you use the mysql client from within a terminal window, the window must be configured to use UTF-8 or characters may not display properly. For a script that executes in a Web environment, the script must handle character encoding properly for its interaction with the MySQL server, and it must generate pages that correctly indicate the encoding so that browsers know how to display the content of the pages. For example, you can include this tag within your element:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
2.Column Character Set Conversion
To convert a binary or nonbinary string column to use a particular character set, use ALTER TABLE. For successful conversion【kənˈvɜːrʒn 转换;转化;转变;(宗教或信仰的)改变;皈依;(持球触地或持球越过对方球门线后的)附加得分;归附;(尤指为居住而)改建的房屋;】 to occur, one of the following conditions must apply:
• If the column has a binary data type (BINARY, VARBINARY, BLOB), all the values that it contains must be encoded using a single character set (the character set you're converting the column to). If you use a binary column to store information in multiple character sets, MySQL has no way to know which values use which character set and cannot convert the data properly.
• If the column has a nonbinary data type (CHAR, VARCHAR, TEXT), its contents should be encoded in the column character set, not some other character set. If the contents are encoded in a different character set, you can convert the column to use a binary data type first, and then to a nonbinary column with the desired character set.
Suppose that a table t has a binary column named col1 defined as VARBINARY(50). Assuming that the information in the column is encoded using a single character set, you can convert it to a nonbinary column that has that character set. For example, if col1 contains binary data representing characters in the greek character set, you can convert it as follows:
ALTER TABLE t MODIFY col1 VARCHAR(50) CHARACTER SET greek;
If your original column has a type of BINARY(50), you could convert it to CHAR(50), but the resulting values are padded with 0x00 bytes at the end, which may be undesirable. To remove these bytes, use the TRIM() function:
UPDATE t SET col1 = TRIM(TRAILING 0x00 FROM col1);
Suppose that table t has a nonbinary column named col1 defined as CHAR(50) CHARACTER SET latin1 but you want to convert it to use utf8mb4 so that you can store values from many languages. The following statement accomplishes this:
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8mb4;
Conversion may be lossy【ˈlɔːsi (对数据或电能)有损的,有损耗的;】 if the column contains characters that are not in both character sets.
A special case occurs if you have old tables from before MySQL 4.1 where a nonbinary column contains values that actually are encoded in a character set different from the server's default character set. For example, an application might have stored sjis values in a column, even though MySQL's default character set was different. It is possible to convert the column to use the proper character set but an additional step is required. Suppose that the server's default character set was latin1 and col1 is defined as CHAR(50) but its contents are sjis values. The first step is to convert the column to a binary data type, which removes the existing character set information without performing any character conversion:
ALTER TABLE t MODIFY col1 BLOB;
The next step is to convert the column to a nonbinary data type with the proper character set:
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET sjis;
This procedure requires that the table not have been modified already with statements such as INSERT or UPDATE after an upgrade to MySQL 4.1 or higher. In that case, MySQL would store new values in the column using latin1, and the column would contain a mix of sjis and latin1 values and cannot be converted properly.
If you specified attributes when creating a column initially, you should also specify them when altering the table with ALTER TABLE. For example, if you specified NOT NULL and an explicit DEFAULT value, you should also provide them in the ALTER TABLE statement. Otherwise, the resulting column definition does not include those attributes.
To convert all character columns in a table, the ALTER TABLE ... CONVERT TO CHARACTER SET charset statement may be useful.
3. Collation Issues
The following sections discuss various aspects of character set collations.
3.1 Using COLLATE in SQL Statements
With the COLLATE clause, you can override whatever the default collation is for a comparison. COLLATE may be used in various parts of SQL statements. Here are some examples:
• With ORDER BY:
SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
• With AS:
SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
• With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
• With WHERE:
SELECT * FROM t1 WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k; 或 SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
3.2 COLLATE Clause Precedence
The COLLATE clause has high precedence (higher than ||), so the following two expressions are equivalent:
x || y COLLATE z x || (y COLLATE z)
3.3 Collation Coercibility【可压缩性,可压凝性;】 in Expressions
In the great majority of statements, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases, it should be clear that the collation is the collation of column x:
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, with multiple operands, there can be ambiguity. For example, this statement performs a comparison between the column x and the string literal 'Y':
SELECT x FROM T WHERE x = 'Y';
If x and 'Y' have the same collation, there is no ambiguity【ˌæmbɪˈɡjuːəti 模棱两可;不明确;歧义;一语多义;含混不清的语句;】 about the collation to use for the comparison. But if they have different collations, should the comparison use the collation of x, or of 'Y'? Both x and 'Y' have collations, so which collation takes precedence?
A mix of collations may also occur in contexts other than comparison. For example, a multiple-argument concatenation operation such as CONCAT(x,'Y') combines its arguments to produce a single string. What collation should the result have?
To resolve questions like these, MySQL checks whether the collation of one item can be coerced【koʊˈɜːrst 胁迫;强迫;迫使;】 to the collation of the other. MySQL assigns coercibility【可压缩性,可压凝性;】 values as follows:
• An explicit COLLATE clause has a coercibility of 0 (not coercible at all).
• The concatenation【kənˌkætəˈneɪʃn 一系列相关联的事物(或事件);】 of two strings with different collations has a coercibility of 1.
• The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
• A “system constant” (the string returned by functions such as USER() or VERSION()) has a coercibility of 3.
• The collation of a literal has a coercibility of 4.
• The collation of a numeric or temporal value has a coercibility of 5.
• NULL or an expression that is derived from NULL has a coercibility of 6.
MySQL uses coercibility values with the following rules to resolve ambiguities:
• Use the collation with the lowest coercibility value.
• If both sides have the same coercibility, then:
- If both sides are Unicode, or both sides are not Unicode, it is an error.
- If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement does not return an error:
SELECT CONCAT(utf8mb4_column, latin1_column) FROM t1;
It returns a result that has a character set of utf8mb4 and the same collation as utf8mb4_column. Values of latin1_column are automatically converted to utf8mb4 before concatenating.
- For an operation with operands from the same character set but that mix a _bin collation and a _ci or _cs collation, the _bin collation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, applied to collations rather than data types.
Although automatic conversion is not in the SQL standard, the standard does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings. More generally, MySQL uses the concept of character set repertoire【ˈrepərtwɑːr (总称某人的)可表演项目;(某人的)全部才能,全部本领;】, which can sometimes be used to determine subset relationships among character sets and enable conversion of operands in operations that would otherwise produce an error.
The following table illustrates some applications of the preceding rules.
Comparison | Collation Used |
column1 = 'A' | Use collation of column1 |
column1 = 'A' COLLATE x | Use collation of 'A' COLLATE x |
column1 COLLATE x = 'A' COLLATE y | Error |
To determine the coercibility of a string expression, use the COERCIBILITY() function.
mysql> SELECT COERCIBILITY(_utf8mb4'A' COLLATE utf8mb4_bin); -> 0 mysql> SELECT COERCIBILITY(VERSION()); -> 3 mysql> SELECT COERCIBILITY('A'); -> 4 mysql> SELECT COERCIBILITY(1000); -> 5 mysql> SELECT COERCIBILITY(NULL); -> 6
For implicit conversion of a numeric or temporal value to a string, such as occurs for the argument 1 in the expression CONCAT(1, 'abc'), the result is a character (nonbinary) string that has a character set and collation determined by the character_set_connection and collation_connection system variables.
4. The binary Collation Compared to _bin Collations
This section describes how the binary collation for binary strings compares to _bin collations for nonbinary strings.
Binary strings (as stored using the BINARY, VARBINARY, and BLOB data types) have a character set and collation named binary. Binary strings are sequences【ˈsiːkwənsɪz 顺序;次序;一系列;一连串;(电影中表现同一主题或场面的)一组镜头;】 of bytes and the numeric values of those bytes determine comparison and sort order.
Nonbinary strings (as stored using the CHAR, VARCHAR, and TEXT data types) have a character set and collation other than binary. A given nonbinary character set can have several collations, each of which defines a particular comparison and sort order for the characters in the set. For most character sets, one of these is the binary collation, indicated by a _bin suffix in the collation name. For example, the binary collations for latin1 and big5 are named latin1_bin and big5_bin, respectively. utf8mb4 is an exception that has two binary collations, utf8mb4_bin and utf8mb4_0900_bin.
4.1 The Unit for Comparison and Sorting
Binary strings are sequences of bytes. For the binary collation, comparison and sorting are based on numeric【nuˈmɛrɪk 数字的】 byte values. Nonbinary strings are sequences of characters, which might be multibyte. Collations for nonbinary strings define an ordering of the character values for comparison and sorting. For _bin collations, this ordering is based on numeric character code values, which is similar to ordering for binary strings except that character code values might be multibyte.
4.2 Character Set Conversion
A nonbinary string has a character set and is automatically converted to another character set in many cases, even when the string has a _bin collation:
• When assigning column values to another column that has a different character set:
UPDATE t1 SET utf8mb4_bin_column=latin1_column; INSERT INTO t1 (latin1_column) SELECT utf8mb4_bin_column FROM t2;
• When assigning column values for INSERT or UPDATE using a string literal:
SET NAMES latin1; INSERT INTO t1 (utf8mb4_bin_column) VALUES ('string-in-latin1');
• When sending results from the server to a client:
SET NAMES latin1; SELECT utf8mb4_bin_column FROM t2;
For binary string columns, no conversion occurs. For cases similar to those preceding, the string value is copied byte-wise.
4.3 Lettercase Conversion
Collations for nonbinary character sets provide information about lettercase of characters, so characters in a nonbinary string can be converted from one lettercase to another, even for _bin collations that ignore lettercase for ordering:
mysql> SET NAMES utf8mb4 COLLATE utf8mb4_bin; mysql> SELECT LOWER('aA'), UPPER('zZ'); +-------------+-------------+ | LOWER('aA') | UPPER('zZ') | +-------------+-------------+ | aa | ZZ | +-------------+-------------+
The concept【ˈkɑːnsept 概念;观念,思想;发明,创造;(体现设计思想的)样品;】of lettercase does not apply to bytes in a binary string. To perform lettercase conversion, the string must first be converted to a nonbinary string using a character set appropriate for the data stored in the string:
mysql> SET NAMES binary; mysql> SELECT LOWER('aA'), LOWER(CONVERT('aA' USING utf8mb4)); +-------------+------------------------------------+ | LOWER('aA') | LOWER(CONVERT('aA' USING utf8mb4)) | +-------------+------------------------------------+ | aA | aa | +-------------+------------------------------------+
4.4 Trailing Space Handling in Comparisons
MySQL collations have a pad【pæd 衬垫;(吸收液体、保洁或保护用的)软垫,护垫,垫状物;(火箭的)发射坪,发射台;<口> 床,住处,房间,公寓;<古>(英方) 路,小径; <电> 衰减器,衰耗片;百洁布,菜瓜布;】 attribute, which has a value of PAD SPACE or NO PAD:
• Most MySQL collations have a pad attribute of PAD SPACE.
• The Unicode collations based on UCA 9.0.0 and higher have a pad attribute of NO PAD.
For nonbinary strings (CHAR, VARCHAR, and TEXT values), the string collation pad attribute determines treatment【ˈtriːtmənt 治疗;处理;(净化或防治)处理,加工;待遇;疗法;对待;诊治;讨论;论述;】 in comparisons of trailing spaces at the end of strings:
• For PAD SPACE collations, trailing spaces are insignificant【ɪnsɪɡˈnɪfɪkənt 微不足道的;无足轻重的;】 in comparisons; strings are compared without regard to trailing spaces.
• NO PAD collations treat trailing spaces as significant【sɪɡˈnɪfɪkənt 重要的, 有重大意义的;显著的, 值得注意的;<统>显著的, 有效的;(词缀等)有意义的;不可忽略的, 值得注意的;相当数量的;别有含义的, 意味深长的;(语言上)区别性的;】 in comparisons, like any other character.
The differing behaviors can be demonstrated using the two utf8mb4 binary collations, one of which is PAD SPACE, the other of which is NO PAD. The example also shows how to use the INFORMATION_SCHEMA COLLATIONS table to determine the pad attribute for collations.
mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE FROM INFORMATION_SCHEMA.COLLATIONS WHERE COLLATION_NAME LIKE 'utf8mb4%bin'; +------------------+---------------+ | COLLATION_NAME | PAD_ATTRIBUTE | +------------------+---------------+ | utf8mb4_bin | PAD SPACE | | utf8mb4_0900_bin | NO PAD | +------------------+---------------+ mysql> SET NAMES utf8mb4 COLLATE utf8mb4_bin; mysql> SELECT 'a ' = 'a'; +------------+ | 'a ' = 'a' | +------------+ | 1 | +------------+ mysql> SET NAMES utf8mb4 COLLATE utf8mb4_0900_bin; mysql> SELECT 'a ' = 'a'; +------------+ | 'a ' = 'a' | +------------+ | 0 | +------------+
说明:“Comparison” in this context does not include the LIKE pattern-matching operator, for which trailing spaces are significant, regardless of collation.
For binary strings (BINARY, VARBINARY, and BLOB values), all bytes are significant in comparisons, including trailing spaces:
mysql> SET NAMES binary; mysql> SELECT 'a ' = 'a'; +------------+ | 'a ' = 'a' | +------------+ | 0 | +------------+
4.5 Trailing Space Handling for Inserts and Retrievals
CHAR(N) columns store nonbinary strings N characters long. For inserts, values shorter than N characters are extended with spaces. For retrievals【检索;】, trailing spaces are removed.
BINARY(N) columns store binary strings N bytes long. For inserts, values shorter than N bytes are extended with 0x00 bytes. For retrievals, nothing is removed; a value of the declared length is always returned.
mysql> CREATE TABLE t1 ( a CHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin, b BINARY(10) ); mysql> INSERT INTO t1 VALUES ('x','x'); mysql> INSERT INTO t1 VALUES ('x ','x '); mysql> SELECT a, b, HEX(a), HEX(b) FROM t1; +------+------------------------+--------+----------------------+ | a | b | HEX(a) | HEX(b) | +------+------------------------+--------+----------------------+ | x | 0x78000000000000000000 | 78 | 78000000000000000000 | | x | 0x78200000000000000000 | 78 | 78200000000000000000 | +------+------------------------+--------+----------------------+
5.Using Collation in INFORMATION_SCHEMA Searches
String columns in INFORMATION_SCHEMA tables have a collation of utf8mb3_general_ci, which is case-insensitive. However, for values that correspond【ˌkɔːrəˈspɑːnd 相一致;通信;符合;相当于;类似于;】 to objects that are represented in the file system, such as databases and tables, searches in INFORMATION_SCHEMA string columns can be case-sensitive or case-insensitive, depending on the characteristics of the underlying【ˌʌndərˈlaɪɪŋ 根本的;潜在的;下层的;隐含的;表面下的;】 file system and the lower_case_table_names system variable setting. For example, searches may be case-sensitive if the file system is case-sensitive. This section describes this behavior and how to modify it if necessary.
Suppose that a query searches the SCHEMATA.SCHEMA_NAME column for the test database. On Linux, file systems are case-sensitive, so comparisons of SCHEMATA.SCHEMA_NAME with 'test' match, but comparisons with 'TEST' do not.
mysql> SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'test'; +-------------+ | SCHEMA_NAME | +-------------+ | test | +-------------+ mysql> SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'TEST'; Empty set (0.00 sec)
These results occur with the lower_case_table_names system variable set to 0. A lower_case_table_names setting of 1 or 2 causes the second query to return the same (nonempty) result as the first query.
说明:It is prohibited【prəˈhɪbɪtɪd (尤指以法令)禁止;阻止;使不可能;】 to start the server with a lower_case_table_names setting that is different from the setting used when the server was initialized.
On Windows or macOS, file systems are not case-sensitive, so comparisons match both 'test' and 'TEST'.
The value of lower_case_table_names makes no difference in this context.
The preceding behavior occurs because the utf8mb3_general_ci collation is not used for INFORMATION_SCHEMA queries when searching for values that correspond to objects represented in the file system.
If the result of a string operation on an INFORMATION_SCHEMA column differs from expectations【ˌɛkspɛkˈteɪʃənz 预期;期望;希望;期待;指望;预料;盼望;】, a workaround is to use an explicit COLLATE clause to force a suitable collation. For example, to perform a case-insensitive search, use COLLATE with the INFORMATION_SCHEMA column name:
mysql> SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME COLLATE utf8mb3_general_ci = 'test'; +-------------+ | SCHEMA_NAME | +-------------+ | test | +-------------+ mysql> SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME COLLATE utf8mb3_general_ci = 'TEST'; +-------------+ | SCHEMA_NAME | +-------------+ | test | +-------------+
You can also use the UPPER() or LOWER() function.
Although a case-insensitive comparison【kəmˈpærɪsn 比较;对比;相比;】 can be performed even on platforms with case-sensitive file systems, as just shown, it is not necessarily always the right thing to do. On such platforms, it is possible to have multiple objects with names that differ only in lettercase. For example, tables named city, CITY, and City can all exist simultaneously【ˌsaɪməlˈteɪniəsli 同时;联立;急切地;】. Consider whether a search should match all such names or just one and write queries accordingly【əˈkɔːrdɪŋli 因此;相应地;所以;照着;】. The first of the following comparisons (with utf8mb3_bin) is case-sensitive; the others are not:
WHERE TABLE_NAME COLLATE utf8mb3_bin = 'City' WHERE TABLE_NAME COLLATE utf8mb3_general_ci = 'city' WHERE UPPER(TABLE_NAME) = 'CITY' WHERE LOWER(TABLE_NAME) = 'city'
Searches in INFORMATION_SCHEMA string columns for values that refer to INFORMATION_SCHEMA itself do use the utf8mb3_general_ci collation because INFORMATION_SCHEMA is a “virtual” database not represented in the file system. For example, comparisons with SCHEMATA.SCHEMA_NAME match 'information_schema' or 'INFORMATION_SCHEMA' regardless of platform:
mysql> SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'information_schema'; +--------------------+ | SCHEMA_NAME | +--------------------+ | information_schema | +--------------------+ mysql> SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'INFORMATION_SCHEMA'; +--------------------+ | SCHEMA_NAME | +--------------------+ | information_schema | +--------------------+