UTF-8 (Unicode Tranformation Format-8) is a Unicode character encoding. It is a variable-length encoding; characters may be assigned to one to four bytes, being still backwards compatible with ASCII. UTF-8 is a prefix code.
Description[]
The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte:
Unicode | Byte1 | Byte2 | Byte3 | Byte4 | example |
---|---|---|---|---|---|
U+000000-U+00007F
|
0xxxxxxx
|
'$' U+0024 → 00100100 → 0x24
| |||
U+000080-U+0007FF
|
110xxxxx
|
10xxxxxx
|
'¢' U+00A2 → 11000010,10100010 → 0xC2,0xA2
| ||
U+000800-U+00FFFF
|
1110xxxx
|
10xxxxxx
|
10xxxxxx
|
'€' U+20AC → 11100010,10000010,10101100 → 0xE2,0x82,0xAC
| |
U+010000-U+10FFFF
|
11110xxx
|
10xxxxxx
|
10xxxxxx
|
10xxxxxx
|
U+10ABCD → 11110100,10001010,10101111,10001101 → 0xf4,0x8a,0xaf,0x8d
|
So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.
By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.
With these restrictions, bytes in a UTF-8 sequence have the following meanings. The ones marked in red can never appear in a legal UTF-8 sequence. The ones in green are represented in a single byte. The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:
By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000
to U+10FFFF
, in November 2003.
With these restrictions, bytes in a UTF-8 sequence have the following meanings. The ones marked in red can never appear in a legal UTF-8 sequence. The ones in green are represented in a single byte. The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:
binary | hex | dec | notes |
---|---|---|---|
00000000-01111111 | 00-7F | 0-127 | US-ASCII (single byte) |
10000000-10111111 | 80-BF | 128-191 | Second, third, or fourth byte of a multi-byte sequence |
11000000-11000001 | C0-C1 | 192-193 | Overlong encoding: start of a 2-byte sequence, but code point <= 127 |
11000010-11011111 | C2-DF | 194-223 | Start of 2-byte sequence |
11100000-11101111 | E0-EF | 224-239 | Start of 3-byte sequence |
11110000-11110100 | F0-F4 | 240-244 | Start of 4-byte sequence |
11110101-11110111 | F5-F7 | 245-247 | Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF |
11111000-11111011 | F8-FB | 248-251 | Restricted by RFC 3629: start of 5-byte sequence |
11111100-11111101 | FC-FD | 252-253 | Restricted by RFC 3629: start of 6-byte sequence |
11111110-11111111 | FE-FF | 254-255 | Invalid: not defined by original UTF-8 specification |
Unicode also disallows the 2048 code points U+D800..U+DFFF (the UTF-16/UCS-2 surrogate pairs) and also the 32 code points U+FDD0..U+FDEF (noncharacters) and all 34 code points of the form U+xxFFFE and U+xxFFFF (more noncharacters). See Table 3-7 in the Unicode 5.0 standard. UTF-8 reliably transforms these values, but they are not valid scalar values in Unicode, and thus the UTF-8 encodings of them may be considered invalid sequences.