utf(6), rune(2): document 21-bit runes
This commit is contained in:
parent
bba6d26ca2
commit
8003c8b1e2
|
@ -54,7 +54,7 @@ bytes starting at
|
||||||
and returns the number of bytes copied.
|
and returns the number of bytes copied.
|
||||||
.BR UTFmax ,
|
.BR UTFmax ,
|
||||||
defined as
|
defined as
|
||||||
.B 3
|
.B 4
|
||||||
in
|
in
|
||||||
.BR <libc.h> ,
|
.BR <libc.h> ,
|
||||||
is the maximum number of bytes required to represent a rune.
|
is the maximum number of bytes required to represent a rune.
|
||||||
|
|
|
@ -7,7 +7,7 @@ based on the Unicode Standard and on the ISO multibyte
|
||||||
.SM UTF-8
|
.SM UTF-8
|
||||||
encoding (Universal Character
|
encoding (Universal Character
|
||||||
Set Transformation Format, 8 bits wide).
|
Set Transformation Format, 8 bits wide).
|
||||||
The Unicode Standard represents its characters in 16
|
The Unicode Standard represents its characters in 21
|
||||||
bits;
|
bits;
|
||||||
.SM UTF-8
|
.SM UTF-8
|
||||||
represents such
|
represents such
|
||||||
|
@ -19,7 +19,7 @@ is shortened to
|
||||||
.PP
|
.PP
|
||||||
In Plan 9, a
|
In Plan 9, a
|
||||||
.I rune
|
.I rune
|
||||||
is a 16-bit quantity representing a Unicode character.
|
is a 32-bit quantity representing a Unicode character.
|
||||||
Internally, programs may store characters as runes.
|
Internally, programs may store characters as runes.
|
||||||
However, any external manifestation of textual information,
|
However, any external manifestation of textual information,
|
||||||
in files or at the interface between programs, uses a
|
in files or at the interface between programs, uses a
|
||||||
|
@ -65,19 +65,21 @@ a rune x is converted to a multibyte
|
||||||
sequence
|
sequence
|
||||||
as follows:
|
as follows:
|
||||||
.PP
|
.PP
|
||||||
01. x in [00000000.0bbbbbbb] → 0bbbbbbb
|
001. x in [00000000.00000000.0bbbbbbb] → 0bbbbbbb
|
||||||
.br
|
.br
|
||||||
10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
|
010. x in [00000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
|
||||||
.br
|
.br
|
||||||
11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
|
011. x in [00000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
|
||||||
|
.br
|
||||||
|
100. x in [000bbbbb.bbbbbbbb.bbbbbbbb] → 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
|
||||||
.br
|
.br
|
||||||
.PP
|
.PP
|
||||||
Conversion 01 provides a one-byte sequence that spans the
|
Conversion 001 provides a one-byte sequence that spans the
|
||||||
.SM ASCII
|
.SM ASCII
|
||||||
character set in a compatible way.
|
character set in a compatible way.
|
||||||
Conversions 10 and 11 represent higher-valued characters
|
Conversions 010, 011 and 100 represent higher-valued characters
|
||||||
as sequences of two or three bytes with the high bit set.
|
as sequences of two, three or four bytes with the high bit set.
|
||||||
Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
|
Plan 9 does not support the 5 and 6 byte sequences proposed by X-Open.
|
||||||
When there are multiple ways to encode a value, for example rune 0,
|
When there are multiple ways to encode a value, for example rune 0,
|
||||||
the shortest encoding is used.
|
the shortest encoding is used.
|
||||||
.PP
|
.PP
|
||||||
|
|
Loading…
Reference in a new issue