From 8003c8b1e2d5d6e2a22ca7e552b53e631db86df4 Mon Sep 17 00:00:00 2001 From: cinap_lenrek Date: Thu, 24 Sep 2015 12:14:08 +0200 Subject: [PATCH] utf(6), rune(2): document 21-bit runes --- sys/man/2/rune | 2 +- sys/man/6/utf | 20 +++++++++++--------- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/sys/man/2/rune b/sys/man/2/rune index ca290115d..124692797 100644 --- a/sys/man/2/rune +++ b/sys/man/2/rune @@ -54,7 +54,7 @@ bytes starting at and returns the number of bytes copied. .BR UTFmax , defined as -.B 3 +.B 4 in .BR , is the maximum number of bytes required to represent a rune. diff --git a/sys/man/6/utf b/sys/man/6/utf index 92f7c9534..7d15b8185 100644 --- a/sys/man/6/utf +++ b/sys/man/6/utf @@ -7,7 +7,7 @@ based on the Unicode Standard and on the ISO multibyte .SM UTF-8 encoding (Universal Character Set Transformation Format, 8 bits wide). -The Unicode Standard represents its characters in 16 +The Unicode Standard represents its characters in 21 bits; .SM UTF-8 represents such @@ -19,7 +19,7 @@ is shortened to .PP In Plan 9, a .I rune -is a 16-bit quantity representing a Unicode character. +is a 32-bit quantity representing a Unicode character. Internally, programs may store characters as runes. However, any external manifestation of textual information, in files or at the interface between programs, uses a @@ -65,19 +65,21 @@ a rune x is converted to a multibyte sequence as follows: .PP -01. x in [00000000.0bbbbbbb] → 0bbbbbbb +001. x in [00000000.00000000.0bbbbbbb] → 0bbbbbbb .br -10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb +010. x in [00000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb .br -11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb +011. x in [00000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb +.br +100. x in [000bbbbb.bbbbbbbb.bbbbbbbb] → 11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb .br .PP -Conversion 01 provides a one-byte sequence that spans the +Conversion 001 provides a one-byte sequence that spans the .SM ASCII character set in a compatible way. -Conversions 10 and 11 represent higher-valued characters -as sequences of two or three bytes with the high bit set. -Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open. +Conversions 010, 011 and 100 represent higher-valued characters +as sequences of two, three or four bytes with the high bit set. +Plan 9 does not support the 5 and 6 byte sequences proposed by X-Open. When there are multiple ways to encode a value, for example rune 0, the shortest encoding is used. .PP