Comparison of Unicode encodings
From Seo Wiki - Search Engine Optimization and Programming Languages
|This article may require cleanup to meet Wikipedia's quality standards. Please improve this article if you can. (November 2009)|
This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid[clarification needed] use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
UTF-16 and UTF-32 are incompatible with ASCII files, and thus require Unicode-aware programs to display, print and manipulate them, even if the file is known to contain only characters in the ASCII subset. Because they contain many nul bytes, the strings cannot be manipulated by normal C string handling for even simple operations such as copy.
Therefore even most UTF-16 systems such as Windows and Java represent text objects such as program code with 8-bit encodings (ASCII, ISO-8859-1, or UTF-8), not UTF-16. Indeed it is very rare to find a UTF-16 encoded text file on any system unless it is part of some more complex structure. This introduces a serious complication in programming that is often overlooked by system designers: many 8-bit encodings (in particular UTF-8) can contain invalid sequences that cannot be translated to UTF-16 and thus the file can contain a superset of the valid data. For instance a UTF-8 URL can name a location that cannot correspond to a file on the system, or two different files may compare identical, or reading and writing a file can change it.
One of the few counterexamples of a UTF-16 file is the "strings" file used by Mac OS X (10.3 and later) applications for lookup of internationalized versions of messages, these default to UTF-16 and "files encoded using UTF-8 are not guaranteed to work. When in doubt, encode the file using UTF-16". This is because the default string class in Mac OS X (NSString) stores characters in UTF-16.
UTF-32/UCS-4 requires four bytes to encode any character. Since characters outside the basic multilingual plane (BMP) are typically rare, a document encoded in UTF-32 will often be nearly twice as large as its UTF-16/UCS-2–encoded equivalent because UTF-16 uses two bytes for the characters inside the BMP, or four bytes otherwise.
UTF-8 uses between one and four bytes to encode a character. It requires one byte for ASCII characters, making it half the space of UTF-16 for texts consisting only of ASCII. For other Latin characters and many non-Latin scripts it requires two bytes, the same as UTF-16. Only a few frequently used Western characters in the range U+0800 to U+FFFF, such as the € sign U+20AC, require three bytes in UTF-8. Characters outside of the BMP above U+FFFF need four bytes in UTF-8 and UTF-16.
The conservation of bytes in encoding files to a Unicode transform format (UTF) depends on the code points used and the scripts and blocks from which those code points are drawn. For example using the more common characters from the BMP makes UTF-16 more space-conserving than UTF-32. In the same way using characters predominantly from the UTF-8 scripts makes UTF-8 more space efficient than UTF-16. The UTF-8 scripts are those scripts where UTF-8 only requires fewer than three bytes per character (only one byte for the ASCII-equivalent Basic Latin block) and include: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, N'Ko, and the IPA and other Latin-based phonetic alphabets.
All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes.
For processing, a format should be easy to search, truncate, and generally process safely. All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode code point. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB 18030 do not.
Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to combining characters. If you are working with a particular API heavily and that API has standardised on a particular Unicode encoding, it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Similarly if you are writing server-side software, it may simplify matters to use the same format for processing that you are communicating in.
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. However, using UTF-16 makes characters outside the Basic Multilingual Plane a special case which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.
A byte array declared to contain UTF-8 can contain invalid sequences of bytes, which cannot be converted losslessly to UTF-16 (or UTF-32). UTF-8 may also encode paired surrogate halves as individual characters which cannot be losslessly translated to UTF-16. However invalid UTF-16 can be translated losslessly to UTF-8. If information must not be lost, this makes processing in UTF-8 a requirement if any input may be UTF-8. For example an API to control multiple file systems that use both UTF-8 and UTF-16 can be written in UTF-8 but not UTF-16, without this operations such as "rename this file with invalid UTF-8 in its name to be correct" are impossible.
For communication and storage
UTF-16 and UTF-32 are not byte oriented, so a byte order must be selected when transmitting them over a byte-oriented network or storing them in a byte-oriented file. This may be achieved by standardising on a single byte order, by specifying the endianness as part of external metadata (for example the MIME charset registry has distinct UTF-16BE and UTF-16LE registrations) or by using a byte-order mark at the start of the text. UTF-8 is byte-oriented and does not have this problem.
If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronise at the start of the next good character, GB 18030 is unable to recover after a corrupt or missing byte until the next ASCII non-number. UTF-16 and UTF-32 will handle corrupt bytes (again recovering on the next good character) but a lost byte will garble all following text.
The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.
N.B. The tables below list numbers of bytes per code point, not per user visible "character" (or "grapheme cluster"). It can take multiple code points to describe a single grapheme cluster, so even in UTF-32, care must be taken when splitting or concatenating strings.
|Code range (hexadecimal)||UTF-8||UTF-16||UTF-32||UTF-EBCDIC||GB 18030|
|000000 – 00007F||1||2||4||1||1|
|000080 – 00009F||2||2 for characters inherited from|
GB 2312/GBK (e.g. most
Chinese characters) 4 for
|0000A0 – 0003FF||2|
|000400 – 0007FF||3|
|000800 – 003FFF||3|
|004000 – 00FFFF||4|
|010000 – 03FFFF||4||4||4|
|040000 – 10FFFF||5|
This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications.
|Code range (hexadecimal)||UTF-7||UTF-8 quoted-|
|UTF-8 base64||UTF-16 q.-p.||UTF-16 base64||UTF-32 q.-p.||UTF-32 base64||GB 18030 q.-p.||GB 18030 base64|
|000000 – 000032||same as 000080–00FFFF||3||1⅓||6||2⅔||12||5⅓||3||1⅓|
|000033 – 00003C||1 for "direct characters" and possibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080–00FFFF||1||4||10||1|
|00003D (equals sign)||3||6||12||3|
|00003E – 00007E||1||4||10||1|
|00007F||5 for an isolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run||3||6||12||3|
|000080 – 0007FF||6||2⅔||2–6 depending on if the byte values need to be escaped||8–12 depending on if the final two byte values need to be escaped||4–6 for characters inherited from GB2312/GBK (e.g.|
most Chinese characters) 8 for everything else.
|2⅔ for characters inherited from GB2312/GBK (e.g.|
most Chinese characters) 5⅓ for everything else.
|000800 – 00FFFF||9||4|
|010000 – 10FFFF||8 for isolated case, 5⅓ per character plus padding to integer plus 2 for a run||12||5⅓||8–12 depending on if the low bytes of the surrogates need to be escaped.||5⅓||8||5⅓|
BOCU-1 and SCSU are two ways to compress Unicode data. Their encoding relies on how frequently the text is used. Most runs of text use the same script; for example, Latin, Cyrillic, Greek and so on. This normal use allows many runs of text to compress down to about 1 byte per code point. These stateful encodings make it more difficult to randomly access text at any position of a string.
These two compression schemes are not as efficient as other compression schemes, like zip or bzip2. Those general-purpose compression schemes can compress longer runs of bytes to just a few bytes. The SCSU and BOCU-1 compression schemes will not compress more than the theoretical 25% of text encoded as UTF-8, UTF-16 or UTF-32. Other general-purpose compression schemes can easily compress to 10% of original text size. The general purpose schemes require more complicated algorithms and longer chunks of text for a good compression ratio.
Unicode Technical Note #14 contains a more detailed comparison of compression schemes.
Historical: UTF-5 and UTF-6
Proposals have been made for a UTF-5 and UTF-6 for the internationalization of domain names (IDN). The UTF-5 proposal used a base 32 encoding, where Punycode is (among other things, and not exactly) a base 36 encoding. <math>2^5 = 32</math> explains the name UTF-5 for a code unit of 5 bits. The UTF-6 proposal added a running length encoding to UTF-5, here 6 simply stands for UTF-5 plus 1. The IETF IDN WG later adopted the more efficient Punycode for this purpose.
Not being seriously pursued: UTF-1
UTF-1 never gained serious acceptance. UTF-8 is much more frequently used.