File Formats Wiki
Register
Advertisement
Smallwikipedialogo.png Wikipedia has an article related to:

UTF-8 (Unicode Tranformation Format-8) is a Unicode character encoding. It is a variable-length encoding; characters may be assigned to one to four bytes, being still backwards compatible with ASCII. UTF-8 is a prefix code.

Description[]

The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte:

Unicode Byte1 Byte2 Byte3 Byte4 example
U+000000-U+00007F 0xxxxxxx '$' U+0024001001000x24
U+000080-U+0007FF 110xxxxx 10xxxxxx '¢' U+00A211000010,101000100xC2,0xA2
U+000800-U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx '€' U+20AC11100010,10000010,101011000xE2,0x82,0xAC
U+010000-U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  U+10ABCD11110100,10001010,10101111,100011010xf4,0x8a,0xaf,0x8d

So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.

With these restrictions, bytes in a UTF-8 sequence have the following meanings. The ones marked in red can never appear in a legal UTF-8 sequence. The ones in green are represented in a single byte. The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the Universal Character Set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.

With these restrictions, bytes in a UTF-8 sequence have the following meanings. The ones marked in red can never appear in a legal UTF-8 sequence. The ones in green are represented in a single byte. The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:

binary hex dec notes
00000000-01111111 00-7F 0-127 US-ASCII (single byte)
10000000-10111111 80-BF 128-191 Second, third, or fourth byte of a multi-byte sequence
11000000-11000001 C0-C1 192-193 Overlong encoding: start of a 2-byte sequence, but code point <= 127
11000010-11011111 C2-DF 194-223 Start of 2-byte sequence
11100000-11101111 E0-EF 224-239 Start of 3-byte sequence
11110000-11110100 F0-F4 240-244 Start of 4-byte sequence
11110101-11110111 F5-F7 245-247 Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF
11111000-11111011 F8-FB 248-251 Restricted by RFC 3629: start of 5-byte sequence
11111100-11111101 FC-FD 252-253 Restricted by RFC 3629: start of 6-byte sequence
11111110-11111111 FE-FF 254-255 Invalid: not defined by original UTF-8 specification

Unicode also disallows the 2048 code points U+D800..U+DFFF (the UTF-16/UCS-2 surrogate pairs) and also the 32 code points U+FDD0..U+FDEF (noncharacters) and all 34 code points of the form U+xxFFFE and U+xxFFFF (more noncharacters). See Table 3-7 in the Unicode 5.0 standard. UTF-8 reliably transforms these values, but they are not valid scalar values in Unicode, and thus the UTF-8 encodings of them may be considered invalid sequences.

This page uses CC-BY-SA content from Wikipedia (authors). Smallwikipedialogo.png
Advertisement