utf(6), rune(2): document 21-bit runes

This commit is contained in:
cinap_lenrek 2015-09-24 12:14:08 +02:00
parent bba6d26ca2
commit 8003c8b1e2
2 changed files with 12 additions and 10 deletions

View file

@ -54,7 +54,7 @@ bytes starting at
and returns the number of bytes copied. and returns the number of bytes copied.
.BR UTFmax , .BR UTFmax ,
defined as defined as
.B 3 .B 4
in in
.BR <libc.h> , .BR <libc.h> ,
is the maximum number of bytes required to represent a rune. is the maximum number of bytes required to represent a rune.

View file

@ -7,7 +7,7 @@ based on the Unicode Standard and on the ISO multibyte
.SM UTF-8 .SM UTF-8
encoding (Universal Character encoding (Universal Character
Set Transformation Format, 8 bits wide). Set Transformation Format, 8 bits wide).
The Unicode Standard represents its characters in 16 The Unicode Standard represents its characters in 21
bits; bits;
.SM UTF-8 .SM UTF-8
represents such represents such
@ -19,7 +19,7 @@ is shortened to
.PP .PP
In Plan 9, a In Plan 9, a
.I rune .I rune
is a 16-bit quantity representing a Unicode character. is a 32-bit quantity representing a Unicode character.
Internally, programs may store characters as runes. Internally, programs may store characters as runes.
However, any external manifestation of textual information, However, any external manifestation of textual information,
in files or at the interface between programs, uses a in files or at the interface between programs, uses a
@ -65,19 +65,21 @@ a rune x is converted to a multibyte
sequence sequence
as follows: as follows:
.PP .PP
01. x in [00000000.0bbbbbbb] → 0bbbbbbb 001. x in [00000000.00000000.0bbbbbbb] → 0bbbbbbb
.br .br
10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb 010. x in [00000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
.br .br
11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb 011. x in [00000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
.br
100. x in [000bbbbb.bbbbbbbb.bbbbbbbb] → 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
.br .br
.PP .PP
Conversion 01 provides a one-byte sequence that spans the Conversion 001 provides a one-byte sequence that spans the
.SM ASCII .SM ASCII
character set in a compatible way. character set in a compatible way.
Conversions 10 and 11 represent higher-valued characters Conversions 010, 011 and 100 represent higher-valued characters
as sequences of two or three bytes with the high bit set. as sequences of two, three or four bytes with the high bit set.
Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open. Plan 9 does not support the 5 and 6 byte sequences proposed by X-Open.
When there are multiple ways to encode a value, for example rune 0, When there are multiple ways to encode a value, for example rune 0,
the shortest encoding is used. the shortest encoding is used.
.PP .PP