utf(6), rune(2): document 21-bit runes

This commit is contained in:
cinap_lenrek 2015-09-24 12:14:08 +02:00
parent bba6d26ca2
commit 8003c8b1e2
2 changed files with 12 additions and 10 deletions

View file

@ -54,7 +54,7 @@ bytes starting at
and returns the number of bytes copied.
.BR UTFmax ,
defined as
.B 3
.B 4
in
.BR <libc.h> ,
is the maximum number of bytes required to represent a rune.

View file

@ -7,7 +7,7 @@ based on the Unicode Standard and on the ISO multibyte
.SM UTF-8
encoding (Universal Character
Set Transformation Format, 8 bits wide).
The Unicode Standard represents its characters in 16
The Unicode Standard represents its characters in 21
bits;
.SM UTF-8
represents such
@ -19,7 +19,7 @@ is shortened to
.PP
In Plan 9, a
.I rune
is a 16-bit quantity representing a Unicode character.
is a 32-bit quantity representing a Unicode character.
Internally, programs may store characters as runes.
However, any external manifestation of textual information,
in files or at the interface between programs, uses a
@ -65,19 +65,21 @@ a rune x is converted to a multibyte
sequence
as follows:
.PP
01. x in [00000000.0bbbbbbb] → 0bbbbbbb
001. x in [00000000.00000000.0bbbbbbb] → 0bbbbbbb
.br
10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
010. x in [00000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
.br
11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
011. x in [00000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
.br
100. x in [000bbbbb.bbbbbbbb.bbbbbbbb] → 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
.br
.PP
Conversion 01 provides a one-byte sequence that spans the
Conversion 001 provides a one-byte sequence that spans the
.SM ASCII
character set in a compatible way.
Conversions 10 and 11 represent higher-valued characters
as sequences of two or three bytes with the high bit set.
Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open.
Conversions 010, 011 and 100 represent higher-valued characters
as sequences of two, three or four bytes with the high bit set.
Plan 9 does not support the 5 and 6 byte sequences proposed by X-Open.
When there are multiple ways to encode a value, for example rune 0,
the shortest encoding is used.
.PP