1248 lines
41 KiB
Text
1248 lines
41 KiB
Text
.HTML "Hello World or Καλημέρα κόσμε or こんにちは 世界
|
||
.TL
|
||
Hello World
|
||
.br
|
||
or
|
||
.br
|
||
.ft R
|
||
Καλημέρα κόσμε
|
||
.ft
|
||
.br
|
||
or
|
||
.br
|
||
\f(Jpこんにちは 世界\fP
|
||
.AU
|
||
Rob Pike
|
||
Ken Thompson
|
||
.sp
|
||
rob,ken@plan9.bell-labs.com
|
||
.AB
|
||
.FS
|
||
Originally appeared, in a slightly different form, in
|
||
.I
|
||
Proc. of the Winter 1993 USENIX Conf.,
|
||
.R
|
||
pp. 43-50,
|
||
San Diego
|
||
.FE
|
||
Plan 9 from Bell Labs has recently been converted from ASCII
|
||
to an ASCII-compatible variant of the Unicode Standard, a 16-bit character set.
|
||
In this paper we explain the reasons for the change,
|
||
describe the character set and representation we chose,
|
||
and present the programming models and software changes
|
||
that support the new text format.
|
||
Although we stopped short of full internationalization\(emfor
|
||
example, system error messages are in Unixese, not Japanese\(emwe
|
||
believe Plan 9 is the first system to treat the representation
|
||
of all major languages on a uniform, equal footing throughout all its
|
||
software.
|
||
.AE
|
||
.SH
|
||
Introduction
|
||
.PP
|
||
The world is multilingual but most computer systems
|
||
are based on English and ASCII.
|
||
The first release of Plan 9 [Pike90], a new distributed operating
|
||
system from Bell Laboratories, seemed a good occasion
|
||
to correct this chauvinism.
|
||
It is easier to make such deep changes when building new systems than
|
||
by refitting old ones.
|
||
.PP
|
||
The ANSI C standard [ANSIC] contains some guidance on the matter of
|
||
`wide' and `multi-byte' characters but falls far short of
|
||
solving the myriad associated problems.
|
||
We could find no literature on how to convert a
|
||
.I system
|
||
to larger character sets, although some individual
|
||
.I programs
|
||
had been converted.
|
||
This paper reports what we discovered as we
|
||
explored the problem of representing multilingual
|
||
text at all levels of an operating system,
|
||
from the file system and kernel through
|
||
the applications and up to the window system
|
||
and display.
|
||
.PP
|
||
Plan 9 has not been `internationalized':
|
||
its manuals are in English,
|
||
its error messages are in English,
|
||
and it can display text that goes from left to right only.
|
||
But before we can address these other problems,
|
||
we need to handle, uniformly and comfortably,
|
||
the textual representation of all the major written languages.
|
||
That subproblem is richer than we had anticipated.
|
||
.SH
|
||
Standards
|
||
.PP
|
||
Our first step was to select a standard.
|
||
At the time (January 1992),
|
||
there were only two viable options:
|
||
ISO 10646 [ISO10646] and Unicode [Unicode].
|
||
The documents describing both proposals were still in the draft stage.
|
||
.PP
|
||
The draft of ISO 10646 was not
|
||
very attractive to us.
|
||
It defined a sparse set of 32-bit characters,
|
||
which would be
|
||
hard to implement
|
||
and have punitive storage requirements.
|
||
Also, the draft attempted to
|
||
mollify national interests by allocating
|
||
16-bit subspaces to national committees
|
||
to partition individually.
|
||
The suggested mode of use was to
|
||
``flip'' between separate national
|
||
standards to implement the international standard.
|
||
This did not strike us as a sound basis for a character set.
|
||
As well, transmitting 32-bit values in a byte stream,
|
||
such as in pipes, would be expensive and hard to implement.
|
||
Since the standard does not define a byte order for such
|
||
transmission, the byte stream would also have to carry
|
||
state to enable the values to be recovered.
|
||
.PP
|
||
The Unicode Standard is a proposal by a consortium of mostly American
|
||
computer companies formed
|
||
to protest the technical
|
||
failings of ISO 10646.
|
||
It defines a uniform 16-bit code based on the
|
||
principle of unification:
|
||
two characters are the same if they look the
|
||
same even though they are from different
|
||
languages.
|
||
This principle, called Han unification,
|
||
allows the large Japanese, Chinese, and Korean
|
||
character sets to be packed comfortably into a 16-bit representation.
|
||
.PP
|
||
We chose the Unicode Standard for its technical merits and because its
|
||
code space was better defined.
|
||
Moreover,
|
||
the Unicode Consortium was derailing the
|
||
ISO 10646 standard.
|
||
(Now, in 1995,
|
||
ISO 10646 is a standard
|
||
with one 16-bit group defined,
|
||
which is almost exactly the Unicode Standard.
|
||
As most people expected, the two standards bodies
|
||
reached a détente and
|
||
ISO 10646 and Unicode represent the same character set.)
|
||
.PP
|
||
The Unicode Standard defines an adequate character set
|
||
but an unreasonable representation.
|
||
It states that all characters
|
||
are 16 bits wide and are communicated and stored in
|
||
16-bit units.
|
||
It also reserves a pair of characters
|
||
(hexadecimal FFFE and FEFF) to detect byte order
|
||
in transmitted text, requiring state in the byte stream.
|
||
(The Unicode Consortium was thinking of files, not pipes.)
|
||
To adopt this encoding,
|
||
we would have had to convert all text going
|
||
into and out of Plan 9 between ASCII and Unicode, which cannot be done.
|
||
Within a single program, in command of all its input and output,
|
||
it is possible to define characters as 16-bit quantities;
|
||
in the context of a networked system with
|
||
hundreds of applications on diverse machines
|
||
by different manufacturers,
|
||
it is impossible.
|
||
.PP
|
||
We needed a way to adapt the Unicode Standard to the tools-and-pipes
|
||
model of text processing embodied by the Unix system.
|
||
To do that, we
|
||
needed an ASCII-compatible textual
|
||
representation of Unicode characters for transmission
|
||
and storage.
|
||
In the draft ISO standard there was an informative
|
||
(non-required)
|
||
Annex
|
||
called UTF
|
||
that provided a byte stream encoding
|
||
of the 32-bit ISO code.
|
||
The encoding uses multibyte sequences composed
|
||
from the 190 printable characters of Latin-1
|
||
to represent character values larger
|
||
than 159.
|
||
.PP
|
||
The UTF encoding has several good properties.
|
||
By far the most important is that
|
||
a byte in the ASCII range 0-127 represents
|
||
itself in UTF.
|
||
Thus UTF is backward compatible with ASCII.
|
||
.PP
|
||
UTF has other advantages.
|
||
It is a byte encoding and is
|
||
therefore byte-order independent.
|
||
ASCII control characters appear in the byte stream
|
||
only as themselves, never as an element of a sequence
|
||
encoding another character,
|
||
so newline bytes separate lines of UTF text.
|
||
Finally, ANSI C's
|
||
.CW strcmp
|
||
function applied to UTF strings preserves the ordering of Unicode characters.
|
||
.PP
|
||
To encode and decode UTF is expensive (involving multiplication,
|
||
division, and modulo operations) but workable.
|
||
UTF's major disadvantage is that the encoding
|
||
is not self-synchronizing.
|
||
It is in general impossible to find the character
|
||
boundaries in a UTF string without reading from
|
||
the beginning of the string, although in practice
|
||
control characters such as newlines,
|
||
tabs, and blanks provide synchronization points.
|
||
.PP
|
||
In August 1992,
|
||
X-Open circulated a proposal for another UTF-like
|
||
byte encoding of Unicode characters.
|
||
Their major concern was that an embedded character
|
||
in a file name
|
||
(in particular a slash)
|
||
could be part of an escape sequence in UTF and
|
||
therefore confuse a traditional file system.
|
||
Their proposal would allow all 7-bit ASCII characters
|
||
to represent themselves
|
||
.I "and only themselves"
|
||
in text.
|
||
Multibyte sequences would contain only characters
|
||
with the high bit set.
|
||
We proposed a modification to the new UTF that
|
||
would address our synchronization problem.
|
||
Our proposal, which was originally known informally as UTF-2 and FSS-UTF,
|
||
is now referred to as UTF-8 and has been approved by ISO to become
|
||
Annex P to ISO 10646.
|
||
.PP
|
||
The model for text in Plan 9 is chosen from these
|
||
three standards*:
|
||
.FS
|
||
* ``That's the nice thing about standards\(emthere's so many to choose from.'' \- Andy Tannenbaum (no, the other one)
|
||
.FE
|
||
the Unicode character set encoded as a byte stream by
|
||
UTF-8, from
|
||
(soon to be) Annex P of ISO 10646.
|
||
Although this mixture may seem like a precarious position for us to adopt,
|
||
it is not as bad as it sounds.
|
||
ISO 10646 and the Unicode Standard have converged,
|
||
other systems such as Linux have adopted the same character set and encoding,
|
||
and the general feeling seems to be that Unicode and UTF-8 will be accepted as the way
|
||
to exchange text between systems.
|
||
The prognosis for wide acceptance is good.
|
||
.PP
|
||
There are a couple of aspects of the Unicode Standard we have not faced.
|
||
One is the issue of right-to-left text such as Hebrew or Arabic.
|
||
Since that is an issue of display, not representation, we believe
|
||
we can defer that problem for the moment without affecting our
|
||
ability to solve it later.
|
||
Another issue is diacriticals and `combining characters',
|
||
which cause overstriking of multiple Unicode characters.
|
||
Although necessary for some scripts, such as Thai, Arabic, and Hebrew,
|
||
such characters confuse the issues for Latin languages because they
|
||
generate multiple representations for accented characters.
|
||
ISO 10646 describes three levels of implementation;
|
||
in Plan 9 we decided not to address the issue.
|
||
Again, this can be labeled as a display issue and its finer points are still being debated,
|
||
so we felt comfortable deferring. Mañana.
|
||
.PP
|
||
Although we converted Plan 9 in the altruistic interests of
|
||
serving foreign languages, we have found the large character
|
||
set attractive for other reasons. The Unicode Standard includes many
|
||
characters\(emmathematical symbols, scientific notation,
|
||
more general punctuation, and more\(emthat we now use
|
||
daily in our work. We no longer test our imaginations
|
||
to find ways to include non-ASCII symbols in our text;
|
||
why type
|
||
.CW :-)
|
||
when you can use the character ☺?
|
||
Most compelling is the ability to absorb documents
|
||
and data that contain non-ASCII characters; our browser for the
|
||
Oxford English Dictionary
|
||
lets us see the dictionary as it really is, with pronunciation
|
||
in the IPA font, foreign phrases properly rendered, and so on,
|
||
.I "in plain text.
|
||
.PP
|
||
In the rest of this paper, except when
|
||
stated otherwise, the term `UTF' refers to the UTF-8 encoding
|
||
of Unicode characters as adopted by Plan 9.
|
||
.SH
|
||
C Compiler
|
||
.PP
|
||
The first program to be converted to UTF
|
||
was the C Compiler.
|
||
There are two levels of conversion.
|
||
On the syntactic level,
|
||
input to the C compiler
|
||
is UTF; on the semantic level,
|
||
the C language needs to define
|
||
how compiled programs manipulate
|
||
the UTF set.
|
||
.PP
|
||
The syntactic part is simple.
|
||
The ANSI C language standard defines the
|
||
source character set to be ASCII.
|
||
Since UTF is backward compatible with ASCII,
|
||
the compiler needs little change.
|
||
The only places where a larger character set
|
||
is allowed are in character constants, strings, and comments.
|
||
Since 7-bit ASCII characters can represent only
|
||
themselves in UTF,
|
||
the compiler does not have to be careful while looking
|
||
for the termination of a string or comment.
|
||
.PP
|
||
The Plan 9 compiler extends ANSI C to treat any Unicode
|
||
character with a value outside of the ASCII range as
|
||
an alphabetic.
|
||
To a Greek programmer or an English mathematician,
|
||
α is a sensible and now valid variable name.
|
||
.PP
|
||
On the semantic level, ANSI C allows,
|
||
but does not tie down,
|
||
the notion of a
|
||
.I "wide character
|
||
and admits string and character constants
|
||
of this type.
|
||
We chose the wide character type to be
|
||
.CW unsigned
|
||
.CW short .
|
||
In the libraries, the word
|
||
.CW Rune
|
||
is defined by a
|
||
.CW typedef
|
||
to be equivalent to
|
||
.CW unsigned
|
||
.CW short
|
||
and is
|
||
used to signify a Unicode character.
|
||
.PP
|
||
There are surprises; for example:
|
||
.P1
|
||
L'x' \f1is 120\fP
|
||
\&'x' \f1is 120\fP
|
||
L'ÿ' \f1is 255\fP
|
||
\&'ÿ' \f1is -1, stdio \fPEOF\f1 (if \fPchar\f1 is signed)\fP
|
||
L'\f1α\fP' \f1is 945\fP
|
||
\&'\f1α\fP' \f1is illegal\fP
|
||
.P2
|
||
In the string constants,
|
||
.P1
|
||
"\f(Jpこんにちは 世界\fP"
|
||
L"\f(Jpこんにちは 世界\fP",
|
||
.P2
|
||
the former is an array of
|
||
.CW chars
|
||
with 22 elements
|
||
and a null byte,
|
||
while the latter is an array of
|
||
.CW unsigned
|
||
.CW shorts
|
||
.CW Runes ) (
|
||
with 8 elements and a null
|
||
.CW Rune .
|
||
.PP
|
||
The Plan 9 library provides an output conversion function,
|
||
.CW print
|
||
(analogous to
|
||
.CW printf ),
|
||
with formats
|
||
.CW %c ,
|
||
.CW %C ,
|
||
.CW %s ,
|
||
and
|
||
.CW %S .
|
||
Since
|
||
.CW print
|
||
produces text, its output is always UTF.
|
||
The character conversion
|
||
.CW %c
|
||
(lower case) masks its argument
|
||
to 8 bits before converting to UTF.
|
||
Thus
|
||
.CW L'ÿ'
|
||
and
|
||
.CW 'ÿ'
|
||
printed under
|
||
.CW %c
|
||
will be identical,
|
||
but
|
||
.CW L'\f1α\fP'
|
||
will print as the Unicode
|
||
character with decimal value 177.
|
||
The character conversion
|
||
.CW %C
|
||
(upper case) masks its argument
|
||
to 16 bits before converting to UTF.
|
||
Thus
|
||
.CW L'ÿ'
|
||
and
|
||
.CW L'\f1α\fP'
|
||
will print correctly under
|
||
.CW %C ,
|
||
but
|
||
.CW 'ÿ'
|
||
will not.
|
||
The conversion
|
||
.CW %s
|
||
(lower case)
|
||
expects a pointer to
|
||
.CW char
|
||
and copies UTF sequences up to a null byte.
|
||
The conversion
|
||
.CW %S
|
||
(upper case) expects a pointer to
|
||
.CW Rune
|
||
and
|
||
performs sequential
|
||
.CW %C
|
||
conversions until a null
|
||
.CW Rune
|
||
is encountered.
|
||
.PP
|
||
Another problem in format conversion
|
||
is the definition of
|
||
.CW %10s :
|
||
does the number refer to bytes or characters?
|
||
We decided that such formats were most
|
||
often used to align output columns and
|
||
so made the number count characters.
|
||
Some programs, however, use the count
|
||
to place blank-padded strings
|
||
in fixed-sized arrays.
|
||
These programs must be found and corrected.
|
||
.PP
|
||
Here is a complete example:
|
||
.P1
|
||
#include <u.h>
|
||
|
||
char c[] = "\f(Jpこんにちは 世界\fP";
|
||
Rune s[] = L"\f(Jpこんにちは 世界\fP";
|
||
|
||
main(void)
|
||
{
|
||
print("%d, %d\en", sizeof(c), sizeof(s));
|
||
print("%s\en", c);
|
||
print("%S\en", s);
|
||
}
|
||
.P2
|
||
.PP
|
||
This program prints
|
||
.CW 23,
|
||
.CW 18
|
||
and then two identical lines of
|
||
UTF text.
|
||
In practice,
|
||
.CW %S
|
||
and
|
||
.CW L"..."
|
||
are rare in programs; one reason is
|
||
that most formatted I/O is done in unconverted UTF.
|
||
.SH
|
||
Ramifications
|
||
.PP
|
||
All programs in Plan 9 now read and write text as UTF, not ASCII.
|
||
This change breaks two deep-rooted symmetries implicit in most C programs:
|
||
.IP 1.
|
||
A character is no longer a
|
||
.CW char .
|
||
.IP 2.
|
||
The internal representation (Rune) of a character now differs from its
|
||
external representation (UTF).
|
||
.PP
|
||
In the sections that follow,
|
||
we show how these issues were faced in the layers of
|
||
system software from the operating system up to the applications.
|
||
The effects are wide-reaching and often surprising.
|
||
.SH
|
||
Operating system
|
||
.PP
|
||
Since UTF is the only format for text in Plan 9,
|
||
the interface to the operating system had to be converted to UTF.
|
||
Text strings cross the interface in several places:
|
||
command arguments,
|
||
file names,
|
||
user names (people can log in using their native name),
|
||
error messages,
|
||
and miscellaneous minor places such as commands to the I/O system.
|
||
Little change was required: null-terminated UTF strings
|
||
are equivalent to null-terminated ASCII strings for most purposes
|
||
of the operating system.
|
||
The library routines described in the next section made that
|
||
change straightforward.
|
||
.PP
|
||
The window system, once called
|
||
.CW 8.5 ,
|
||
is now rightfully called
|
||
.CW 8½ .
|
||
.SH
|
||
Libraries
|
||
.PP
|
||
A header file included by all programs (see [Pike92]) declares
|
||
the
|
||
.CW Rune
|
||
type to hold 16-bit character values:
|
||
.P1
|
||
typedef unsigned short Rune;
|
||
.P2
|
||
Also defined are several constants relevant to UTF:
|
||
.P1
|
||
enum
|
||
{
|
||
UTFmax = 3, /* maximum bytes per rune */
|
||
Runesync = 0x80, /* can't appear in UTF sequence (<) */
|
||
Runeself = 0x80, /* rune==UTF sequence (<) */
|
||
Runeerror = 0x80, /* decoding error in UTF */
|
||
};
|
||
.P2
|
||
(With the original UTF,
|
||
.CW Runesync
|
||
was hexadecimal 21 and
|
||
.CW Runeself
|
||
was A0.)
|
||
.CW UTFmax
|
||
bytes are sufficient
|
||
to hold the UTF encoding of any Unicode character.
|
||
Characters of value less than
|
||
.CW Runesync
|
||
only appear in a UTF string as
|
||
themselves, never as part of a sequence encoding another character.
|
||
Characters of value less than
|
||
.CW Runeself
|
||
encode into single bytes
|
||
of the same value.
|
||
Finally, when the library detects errors in UTF input\(embyte sequences
|
||
that are not valid UTF sequences\(emit converts the first byte of the
|
||
error sequence to the character
|
||
.CW Runeerror .
|
||
There is little a rune-oriented program can do when given bad data
|
||
except exit, which is unreasonable, or carry on.
|
||
Originally the conversion routines, described below,
|
||
returned errors when given invalid UTF,
|
||
but we found ourselves repeatedly checking for errors and ignoring them.
|
||
We therefore decided to convert a bad sequence to a valid rune
|
||
and continue processing.
|
||
(The ANSI C routines, on the other hand, return errors.)
|
||
.PP
|
||
This technique does have the unfortunate property that converting
|
||
invalid UTF byte strings in and out of runes does not preserve the input,
|
||
but this circumstance only occurs when non-textual input is
|
||
given to a textual program.
|
||
The Unicode Standard defines an error character, value FFFD, to stand for
|
||
characters from other sets that it does not represent.
|
||
The
|
||
.CW Runeerror
|
||
character is a different concept, related to the encoding rather than the character set, so we
|
||
chose a different character for it.
|
||
.PP
|
||
The Plan 9 C library contains a number of routines for
|
||
manipulating runes.
|
||
The first set converts between runes and UTF strings:
|
||
.P1
|
||
extern int runetochar(char*, Rune*);
|
||
extern int chartorune(Rune*, char*);
|
||
extern int runelen(long);
|
||
extern int fullrune(char*, int);
|
||
.P2
|
||
.CW Runetochar
|
||
translates a single
|
||
.CW Rune
|
||
to a UTF sequence and returns the number of bytes produced.
|
||
.CW Chartorune
|
||
goes the other way, reporting how many bytes were consumed.
|
||
.CW Runelen
|
||
returns the number of bytes in the UTF encoding of a rune.
|
||
.CW Fullrune
|
||
examines a UTF string up to a specified number of bytes
|
||
and reports whether the string begins with a complete UTF encoding.
|
||
All these routines use the
|
||
.CW Runeerror
|
||
character to work around encoding problems.
|
||
.PP
|
||
There is also a set of routines for examining null-terminated UTF strings,
|
||
based on the model of the ANSI standard
|
||
.CW str
|
||
routines, but with
|
||
.CW utf
|
||
substituted for
|
||
.CW str
|
||
and
|
||
.CW rune
|
||
for
|
||
.CW chr :
|
||
.P1
|
||
extern int utflen(char*);
|
||
extern char* utfrune(char*, long);
|
||
extern char* utfrrune(char*, long);
|
||
extern char* utfutf(char*, char*);
|
||
.P2
|
||
.CW Utflen
|
||
returns the number of runes in a UTF string;
|
||
.CW utfrune
|
||
returns a pointer to the first occurrence of a rune in a UTF string;
|
||
and
|
||
.CW utfrrune
|
||
a pointer to the last.
|
||
.CW Utfutf
|
||
searches for the first occurrence of a UTF string in another UTF string.
|
||
Given the synchronizing property of UTF-8,
|
||
.CW utfutf
|
||
is the same as
|
||
.CW strstr
|
||
if the arguments point to valid UTF strings.
|
||
.PP
|
||
It is a mistake to use
|
||
.CW strchr
|
||
or
|
||
.CW strrchr
|
||
unless searching for a 7-bit ASCII character, that is, a character
|
||
less than
|
||
.CW Runeself .
|
||
.PP
|
||
We have no routines for manipulating null-terminated arrays of
|
||
.CW Runes .
|
||
Although they should probably exist for completeness, we have
|
||
found no need for them, for the same reason that
|
||
.CW %S
|
||
and
|
||
.CW L"..."
|
||
are rarely used.
|
||
.PP
|
||
Most Plan 9 programs use a new buffered I/O library, BIO, in place of
|
||
Standard I/O.
|
||
BIO contains routines to read and write UTF streams, converting to and from
|
||
runes.
|
||
.CW Bgetrune
|
||
returns, as a
|
||
.CW Rune
|
||
within a
|
||
.CW long ,
|
||
the next character in the UTF input stream;
|
||
.CW Bputrune
|
||
takes a rune and writes its UTF representation.
|
||
.CW Bungetrune
|
||
puts a rune back into the input stream for rereading.
|
||
.PP
|
||
Plan 9 programs use a simple set of macros to process command line arguments.
|
||
Converting these macros to UTF automatically updated the
|
||
argument processing of most programs.
|
||
In general,
|
||
argument flag names can no longer be held in bytes and
|
||
arrays of 256 bytes cannot be used to hold a set of flags.
|
||
.PP
|
||
We have done nothing analogous to ANSI C's locales, partly because
|
||
we do not feel qualified to define locales and partly because we remain
|
||
unconvinced of that model for dealing with the problems.
|
||
That is really more an issue of internationalization than conversion
|
||
to a larger character set; on the other hand,
|
||
because we have chosen a single character set that encompasses
|
||
most languages, some of the need for
|
||
locales is eliminated.
|
||
(We have a utility,
|
||
.CW tcs ,
|
||
that translates between UTF and other character sets.)
|
||
.PP
|
||
There are several reasons why our library does not follow the ANSI design
|
||
for wide and multi-byte characters.
|
||
The ANSI model was designed by a committee, untried, almost
|
||
as an afterthought, whereas
|
||
we wanted to design as we built.
|
||
(We made several major changes to the interface
|
||
as we became familiar with the problems involved.)
|
||
We disagree with ANSI C's handling of invalid multi-byte sequences.
|
||
Also, the ANSI C library is incomplete:
|
||
although it contains some crucial routines for handling
|
||
wide and multi-byte characters, there are some serious omissions.
|
||
For example, our software can exploit
|
||
the fact that UTF preserves ASCII characters in the byte stream.
|
||
We could remove that assumption by replacing all
|
||
calls to
|
||
.CW strchr
|
||
with
|
||
.CW utfrune
|
||
and so on.
|
||
(Because of the weaker properties of the original UTF,
|
||
we have actually done so.)
|
||
ANSI C cannot:
|
||
the standard says nothing about the representation, so portable code should
|
||
.I never
|
||
call
|
||
.CW strchr ,
|
||
yet there is no ANSI equivalent to
|
||
.CW utfrune .
|
||
ANSI C simultaneously invalidates
|
||
.CW strchr
|
||
and offers no replacement.
|
||
.PP
|
||
Finally, ANSI did nothing to integrate wide characters
|
||
into the I/O system: it gives no method for printing
|
||
wide characters.
|
||
We therefore needed to invent some things and decided to invent
|
||
everything.
|
||
In the end, some of our entry points do correspond closely to
|
||
ANSI routines\(emfor example
|
||
.CW chartorune
|
||
and
|
||
.CW runetochar
|
||
are similar to
|
||
.CW mbtowc
|
||
and
|
||
.CW wctomb \(embut
|
||
Plan 9's library defines more functionality, enough
|
||
to write real applications comfortably.
|
||
.SH
|
||
Converting the tools
|
||
.PP
|
||
The source for our tools and applications had already been converted to
|
||
work with Latin-1, so it was `8-bit safe', but the conversion to the Unicode
|
||
Standard and UTF is more involved.
|
||
Some programs needed no change at all:
|
||
.CW cat ,
|
||
for instance,
|
||
interprets its argument strings, delivered in UTF,
|
||
as file names that it passes uninterpreted to the
|
||
.CW open
|
||
system call,
|
||
and then just copies bytes from its input to its output;
|
||
it never makes decisions based on the values of the bytes.
|
||
(Plan 9
|
||
.CW cat
|
||
has no options such as
|
||
.CW -v
|
||
to complicate matters.)
|
||
Most programs, however, needed modest change.
|
||
.PP
|
||
It is difficult to
|
||
find automatically the places that need attention,
|
||
but
|
||
.CW grep
|
||
helps.
|
||
Software that uses the libraries conscientiously can be searched
|
||
for calls to library routines that examine bytes as characters:
|
||
.CW strchr ,
|
||
.CW strrchr ,
|
||
.CW strstr ,
|
||
etc.
|
||
Replacing these by calls to
|
||
.CW utfrune ,
|
||
.CW utfrrune ,
|
||
and
|
||
.CW utfutf
|
||
is enough to fix many programs.
|
||
Few tools actually need to operate on runes internally;
|
||
more typically they need only to look for the final slash in a file
|
||
name and similar trivial tasks.
|
||
Of the 170 C source programs in the top levels of
|
||
.CW /sys/src/cmd ,
|
||
only 23 now contain the word
|
||
.CW Rune .
|
||
.PP
|
||
The programs that
|
||
.I do
|
||
store runes internally
|
||
are mostly those whose
|
||
.I raison
|
||
.I d'être
|
||
is character manipulation:
|
||
.CW sam
|
||
(the text editor),
|
||
.CW sed ,
|
||
.CW sort ,
|
||
.CW tr ,
|
||
.CW troff ,
|
||
.CW 8½
|
||
(the window system and terminal emulator),
|
||
and so on.
|
||
To decide whether to compute using runes
|
||
or UTF-encoded byte strings requires balancing the cost of converting
|
||
the data when read and written
|
||
against the cost of converting relevant text on demand.
|
||
For programs such as editors that run a long time with a relatively
|
||
constant dataset, runes are the better choice.
|
||
There are space considerations too, but they are more complicated:
|
||
plain ASCII text grows when converted to runes; UTF-encoded Japanese
|
||
shrinks.
|
||
.PP
|
||
Again, it is hard to automate the conversion of a program from
|
||
.CW chars
|
||
to
|
||
.CW Runes .
|
||
It is not enough just to change the type of variables; the assumption
|
||
that bytes and characters are equivalent can be insidious.
|
||
For instance, to clear a character array by
|
||
.P1
|
||
memset(buf, 0, BUFSIZE)
|
||
.P2
|
||
becomes wrong if
|
||
.CW buf
|
||
is changed from an array of
|
||
.CW chars
|
||
to an array of
|
||
.CW Runes .
|
||
Any program that indexes tables based on character values needs
|
||
rethinking.
|
||
Consider
|
||
.CW tr ,
|
||
which originally used multiple 256-byte arrays for the mapping.
|
||
The naïve conversion would yield multiple 65536-rune arrays.
|
||
Instead Plan 9
|
||
.CW tr
|
||
saves space by building in effect
|
||
a run-encoded version of the map.
|
||
.PP
|
||
.CW Sort
|
||
has related problems.
|
||
The cooperation of UTF and
|
||
.CW strcmp
|
||
means that a simple sort\(emone with no options\(emcan be done
|
||
on the original UTF strings using
|
||
.CW strcmp .
|
||
With sorting options enabled, however,
|
||
.CW sort
|
||
may need to convert its input to runes: for example,
|
||
option
|
||
.CW -t\f1α\fP
|
||
requires searching for alphas in the input text to
|
||
crack the input into fields.
|
||
The field specifier
|
||
.CW +3.2
|
||
refers to 2 runes beyond the third field.
|
||
Some of the other options are hopelessly provincial:
|
||
consider the case-folding and dictionary order options
|
||
(Japanese doesn't even have an official dictionary order) or
|
||
.CW -M
|
||
which compares by case-insensitive English month name.
|
||
Handling these options involves the
|
||
larger issues of internationalization and is beyond the scope
|
||
of this paper and our expertise.
|
||
Plan 9
|
||
.CW sort
|
||
works sensibly with options that make sense relative to the input.
|
||
The simple and most important options are, however, usually meaningful.
|
||
In particular,
|
||
.CW sort
|
||
sorts UTF into the same order that
|
||
.CW look
|
||
expects.
|
||
.PP
|
||
Regular expression-matching algorithms need rethinking to
|
||
be applied to UTF text.
|
||
Deterministic automata are usually applied to bytes;
|
||
converting them to operate on variable-sized byte sequences is awkward.
|
||
On the other hand, converting the input stream to runes adds measurable
|
||
expense
|
||
and the state tables expand
|
||
from size 256 to 65536; it can be expensive just to generate them.
|
||
For simple string searching,
|
||
the Boyer-Moore algorithm works with UTF provided the input is
|
||
guaranteed to be only valid UTF strings; however, it does not work
|
||
with the old UTF encoding.
|
||
At a more mundane level, even character classes are harder:
|
||
the usual bit-vector representation within a non-deterministic automaton
|
||
is unwieldy with 65536 characters in the alphabet.
|
||
.PP
|
||
We compromised.
|
||
An existing library for compiling and executing regular expressions
|
||
was adapted to work on runes, with two entry points for searching
|
||
in arrays of runes and arrays of chars (the pattern is always UTF text).
|
||
Character classes are represented internally as runs of runes;
|
||
the reserved value
|
||
.CW FFFF
|
||
marks the end of the class.
|
||
Then
|
||
.I all
|
||
utilities that use regular expressions\(emeditors,
|
||
.CW grep ,
|
||
.CW awk ,
|
||
etc.\(emexcept the shell, whose notation
|
||
was grandfathered, were converted to use the library.
|
||
For some programs, there was a concomitant loss of performance,
|
||
but there was also a strong advantage.
|
||
To our knowledge, Plan 9 is the only Unix-like system
|
||
that has a single definition and implementation of
|
||
regular expressions; patterns are written and interpreted
|
||
identically by all the programs in the system.
|
||
.PP
|
||
A handful of programs have the notion of character built into them
|
||
so strongly as to confuse the issue of what they should do with UTF input.
|
||
Such programs were treated as individual special cases.
|
||
For example,
|
||
.CW wc
|
||
is, by default, unchanged in behavior and output; a new option,
|
||
.CW -r ,
|
||
counts the number of correctly encoded runes\(emvalid UTF sequences\(emin
|
||
its input;
|
||
.CW -b
|
||
the number of invalid sequences.
|
||
.PP
|
||
It took us several months to convert all the software in the system
|
||
to the Unicode Standard and the old UTF.
|
||
When we decided to convert from that to the new UTF,
|
||
only three things needed to be done.
|
||
First, we rewrote the library routines to encode and decode the
|
||
new UTF. This took an evening.
|
||
Next, we converted all the files containing UTF
|
||
to the new encoding.
|
||
We wrote a trivial program to look for non-ASCII bytes in
|
||
text files and used a Plan 9 program called
|
||
.CW tcs
|
||
(translate character set) to change encodings.
|
||
Finally, we recompiled all the system software;
|
||
the library interface was unchanged, so recompilation was sufficient
|
||
to effect the transformation.
|
||
The second two steps were done concurrently and took an afternoon.
|
||
We concluded that the actual encoding is relatively unimportant to the
|
||
software; the adoption of large characters and a byte-stream encoding
|
||
.I per
|
||
.I se
|
||
are much deeper issues.
|
||
.SH
|
||
Graphics and fonts
|
||
.PP
|
||
Plan 9 provides only minimal support for plain text terminals.
|
||
It is instead designed to be used with all character input and
|
||
output mediated by a window system such as
|
||
.CW 8½ .
|
||
The window system and related software are responsible for the
|
||
display of UTF text as Unicode character images.
|
||
For plain text, the window system must provide a user-settable
|
||
.I font
|
||
that provides a (possibly empty) picture for each Unicode character.
|
||
Fancier applications that use bold and Italic characters
|
||
need multiple fonts storing multiple pictures for each
|
||
Unicode value.
|
||
All the issues are apparent, though,
|
||
in just the problem of
|
||
displaying a single image for each character, that is, the
|
||
Unicode equivalent of a plain text terminal.
|
||
With 128 or even 256 characters, a font can be just
|
||
an array of bitmaps. With 65536 characters,
|
||
a more sophisticated design is necessary. To store the ideographs
|
||
for just Japanese as 16×16×1 bit images,
|
||
the smallest they can reasonably be, takes over a quarter of a
|
||
megabyte. Make the images a little larger, store more bits per
|
||
pixel, and hold a copy in every running application, and the
|
||
memory cost becomes unreasonable.
|
||
.PP
|
||
The structure of the bitmap graphics services is described at length elsewhere
|
||
[Pike91].
|
||
In summary, the memory holding the bitmaps is stored in the same machine that has
|
||
the display, mouse, and keyboard: the terminal in Plan 9 terminology,
|
||
the workstation in others'.
|
||
Access to that memory and associated services is provided
|
||
by device files served by system
|
||
software on the terminal. One of those files,
|
||
.CW /dev/bitblt ,
|
||
interprets messages written upon it as requests for actions
|
||
corresponding to entry points in the graphics library:
|
||
allocate a bitmap, execute a raster operation, draw a text string, etc.
|
||
The window system
|
||
acts as a multiplexer that mediates access to the services
|
||
and resources of the terminal by simulating in each client window
|
||
a set of files mirroring those provided by the system.
|
||
That is, each window has a distinct
|
||
.CW /dev/mouse ,
|
||
.CW /dev/bitblt ,
|
||
and so on through which applications drive graphical
|
||
input and output.
|
||
.PP
|
||
One of the resources managed by
|
||
.CW 8½
|
||
and the terminal is the set of active
|
||
.I subfonts.
|
||
Each subfont holds the
|
||
bitmaps and associated data structures for a sequential set of Unicode
|
||
characters.
|
||
Subfonts are stored in files and loaded into the terminal by
|
||
.CW 8½
|
||
or an application.
|
||
For example, one subfont
|
||
might hold the images of the first 256 characters of the Unicode space,
|
||
corresponding to the Latin-1 character set;
|
||
another might hold the standard phonetic character set, Unicode characters
|
||
with value 0250 to 02E9.
|
||
These files are collected in directories corresponding to typefaces:
|
||
.CW /lib/font/bit/pelm
|
||
contains the Pellucida Monospace character set, with subfonts holding
|
||
the Latin-1, Greek, Cyrillic and other components of the typeface.
|
||
A suffix on subfont files encodes (in a subfont-specific
|
||
way) the size of the images:
|
||
.CW /lib/font/bit/pelm/latin1.9
|
||
contains the Latin-1 Pellucida Monospace characters with lower
|
||
case letters 9 pixels high;
|
||
.CW /lib/font/bit/jis/jis5400.16
|
||
contains 16-pixel high
|
||
ideographs starting at Unicode value 5400.
|
||
.PP
|
||
The subfonts do not identify which portion of the Unicode space
|
||
they cover. Instead, a
|
||
font file, in plain text,
|
||
describes how to assemble subfonts into a complete
|
||
character set.
|
||
The font file is presented as an argument to the window system
|
||
to determine how plain text is displayed in text windows and
|
||
applications.
|
||
Here is the beginning of the font file
|
||
.CW /lib/font/bit/pelm/jis.9.font ,
|
||
which describes the layout of a font covering that portion of
|
||
the Unicode Standard for which we have characters of typical
|
||
display size, using Japanese characters
|
||
to cover the Han space:
|
||
.P1
|
||
18 14
|
||
0x0000 0x00FF latin1.9
|
||
0x0100 0x017E latineur.9
|
||
0x0250 0x02E9 ipa.9
|
||
0x0386 0x03F5 greek.9
|
||
0x0400 0x0475 cyrillic.9
|
||
0x2000 0x2044 ../misc/genpunc.9
|
||
0x2070 0x208E supsub.9
|
||
0x20A0 0x20AA currency.9
|
||
0x2100 0x2138 ../misc/letterlike.9
|
||
0x2190 0x21EA ../misc/arrows
|
||
0x2200 0x227F ../misc/math1
|
||
0x2280 0x22F1 ../misc/math2
|
||
0x2300 0x232C ../misc/tech
|
||
0x2500 0x257F ../misc/chart
|
||
0x2600 0x266F ../misc/ding
|
||
.P2
|
||
.P1
|
||
0x3000 0x303f ../jis/jis3000.16
|
||
0x30a1 0x30fe ../jis/katakana.16
|
||
0x3041 0x309e ../jis/hiragana.16
|
||
0x4e00 0x4fff ../jis/jis4e00.16
|
||
0x5000 0x51ff ../jis/jis5000.16
|
||
\&...
|
||
.P2
|
||
The first two numbers set the interline spacing of the font (18
|
||
pixels) and the distance from the baseline to the top of the
|
||
line (14 pixels).
|
||
When characters are displayed, they are placed so as best
|
||
to fit within those constraints; characters
|
||
too large to fit will be truncated.
|
||
The rest of the file associates subfont files
|
||
with portions of Unicode space.
|
||
The first four such files are in the Pellucida Monospace typeface
|
||
and directory; others reside in other directories. The file names
|
||
are relative to the font file's own location.
|
||
.PP
|
||
There are several advantages to this two-level structure.
|
||
First, it simultaneously breaks the huge Unicode space into manageable
|
||
components and provides a unifying architecture for
|
||
assembling fonts from disjoint pieces.
|
||
Second, the structure promotes sharing.
|
||
For example, we have only one set of Japanese
|
||
characters but dozens of typefaces for the Latin-1 characters,
|
||
and this structure permits us to store only one copy of the
|
||
Japanese set but use it with any Roman typeface.
|
||
Also, customization is easy.
|
||
English-speaking users who don't need Japanese characters
|
||
but may want to read an on-line Oxford English Dictionary can
|
||
assemble a custom font with the
|
||
Latin-1 (or even just ASCII) characters and the International
|
||
Phonetic Alphabet (IPA).
|
||
Moreover, to do so requires just editing a plain text file,
|
||
not using a special font editing tool.
|
||
Finally, the structure guides the design of
|
||
caching protocols to improve performance and memory usage.
|
||
.PP
|
||
To load a complete Unicode character set into each application
|
||
would consume too
|
||
much memory and, particularly on slow terminal lines, would take
|
||
unreasonably long.
|
||
Instead, Plan 9 assembles a multi-level cache structure for
|
||
each font.
|
||
An application opens a font file, reads and parses it,
|
||
and allocates a data structure.
|
||
A message written to
|
||
.CW /dev/bitblt
|
||
allocates an associated structure held in the terminal, in particular,
|
||
a bitmap to act as a cache
|
||
for recently used character images.
|
||
Other messages copy these images to bitmaps such as the screen
|
||
by loading characters from subfonts into the cache on demand and
|
||
from there to the destination bitmap.
|
||
The protocol to draw characters is in terms of cache indices,
|
||
not Unicode character number or UTF sequences.
|
||
These details are hidden from the application, which instead
|
||
sees only a subroutine to draw a string in a bitmap from a
|
||
given font, functions to discover character size information,
|
||
and routines to allocate and to free fonts.
|
||
.PP
|
||
As needed, whole
|
||
subfonts are opened by the graphics library, read, and then downloaded
|
||
to the terminal.
|
||
They are held open by the library in an LRU-replacement list.
|
||
Even when the program closes a subfont, it is retained
|
||
in the terminal for later use.
|
||
When the application opens the subfont, it asks the terminal
|
||
if it already has a copy to avoid reading it from the file
|
||
server if possible.
|
||
This level of cache has the property that the bitmaps for, say,
|
||
all the Japanese characters are stored only once, in the terminal;
|
||
the applications read only size and width information from the terminal
|
||
and share the images.
|
||
.PP
|
||
The sizes of the character and subfont caches held by the
|
||
application are adaptive.
|
||
A simple algorithm monitors the cache miss rate to enlarge and
|
||
shrink the caches as required.
|
||
The size of the character cache is limited to 2048 images maximum,
|
||
which in practice seems enough even for Japanese text.
|
||
For plain ASCII-like text it naturally stays around 128 images.
|
||
.PP
|
||
This mechanism sounds complicated but is implemented by only about
|
||
500 lines in the library and considerably less in each of the
|
||
terminal's graphics driver and
|
||
.CW 8½ .
|
||
It has the advantage that only characters that are
|
||
being used are loaded into memory.
|
||
It is also efficient: if the characters being drawn
|
||
are in the cache the extra overhead is negligible.
|
||
It works particularly well for alphabetic character sets,
|
||
but also adapts on demand for ideographic sets.
|
||
When a user first looks at Japanese text, it takes a few
|
||
seconds to read all the font data, but thereafter the
|
||
text is drawn almost as fast as regular text (the images
|
||
are larger, so draw a little slower).
|
||
Also, because the bitmaps are remembered by the terminal,
|
||
if a second application then looks at Japanese text
|
||
it starts faster than the first.
|
||
.PP
|
||
We considered
|
||
building a `font server'
|
||
to cache character images and associated data
|
||
for the applications, the window system, and the terminal.
|
||
We rejected this design because, although isolating
|
||
many of the problems of font management into a separate program,
|
||
it didn't simplify the applications.
|
||
Moreover, in a distributed system such as Plan 9 it is easy
|
||
to have too many special purpose servers.
|
||
Making the management of the fonts the concern of only
|
||
the essential components simplifies the system and makes
|
||
bootstrapping less intricate.
|
||
.SH
|
||
Input
|
||
.PP
|
||
A completely different problem is how to type Unicode characters
|
||
as input to the system.
|
||
We selected an unused key on our ASCII keyboards
|
||
to serve as a prefix for multi-keystroke
|
||
sequences that generate Unicode characters.
|
||
For example, the character
|
||
.CW ü
|
||
is generated by the prefix key
|
||
(typically
|
||
.CW ALT
|
||
or
|
||
.CW Compose )
|
||
followed by a double quote and a lower-case
|
||
.CW u .
|
||
When that character is read by the application, from the file
|
||
.CW /dev/cons ,
|
||
it is of course presented as its UTF encoding.
|
||
Such sequences generate characters from an arbitrary set that
|
||
includes all of Latin-1 plus a selection of mathematical
|
||
and technical characters.
|
||
An arbitrary Unicode character may be generated by typing the prefix,
|
||
an upper case X, and four hexadecimal digits that identify
|
||
the Unicode value.
|
||
.PP
|
||
These simple mechanisms are adequate for most of our day-to-day needs:
|
||
it's easy to remember to type `ALT 1 2' for ½\^ or `ALT accent letter'
|
||
for accented Latin letters.
|
||
For the occasional unusual character, the cut and paste features of
|
||
.CW 8½
|
||
serve well. A program called (perhaps misleadingly)
|
||
.CW unicode
|
||
takes as argument a hexadecimal value, and prints the UTF representation of that character,
|
||
which may then be picked up with the mouse and used as input.
|
||
.PP
|
||
These methods
|
||
are clearly unsatisfactory when working in a non-English language.
|
||
In the native country of such a language
|
||
the appropriate keyboard is likely to be at hand.
|
||
But it's also reasonable\(emespecially now that the system handles Unicode characters\(emto
|
||
work in a language foreign to the keyboard.
|
||
.PP
|
||
For alphabetic languages such as Greek or Russian, it is
|
||
straightforward to construct a program that does phonetic substitution,
|
||
so that, for example, typing a Latin `a' yields the Greek `α'.
|
||
Within Plan 9, such a program can be inserted transparently
|
||
between the real keyboard and a program such as the window system,
|
||
providing a manageable input device for such languages.
|
||
.PP
|
||
For ideographic languages such as Chinese or Japanese the problem is harder.
|
||
Native users of such languages have adopted methods for dealing with
|
||
Latin keyboards that involve a hybrid technique based on phonetics
|
||
to generate a list of possible symbols followed by menu selection to
|
||
choose the desired one.
|
||
Such methods can be
|
||
effective, but their design must be rooted in information about
|
||
the language unknown to non-native speakers.
|
||
.CW Cxterm , (
|
||
a Chinese terminal emulator built by and for
|
||
Chinese programmers,
|
||
employs such a technique
|
||
[Pong and Zhang].)
|
||
Although the technical problem of implementing such a device
|
||
is easy in Plan 9\(emit is just an elaboration of the technique for
|
||
alphabetic languages\(emour lack of familiarity with such languages
|
||
has restrained our enthusiasm for building one.
|
||
.PP
|
||
The input problem is technically the least interesting but perhaps
|
||
emotionally the most important of the problems of converting a system
|
||
to an international character set.
|
||
Beyond that remain the deeper problems of internationalization
|
||
such as multi-lingual error messages and command names,
|
||
problems we are not qualified to solve.
|
||
With the ability to treat text of most languages on an equal
|
||
footing, though, we can begin down that path.
|
||
Perhaps people in non-English speaking countries will
|
||
consider adopting Plan 9, solving the input problem locally\(emperhaps
|
||
just by plugging in their local terminals\(emand begin to use
|
||
a system with at least the capacity to be international.
|
||
.SH
|
||
Acknowledgements
|
||
.PP
|
||
Dennis Ritchie provided consultation and encouragement.
|
||
Bob Flandrena converted most of the standard tools to UTF.
|
||
Brian Kernighan suffered cheerfully with several
|
||
inadequate implementations and converted
|
||
.CW troff
|
||
to UTF.
|
||
Rich Drechsler converted his Postscript driver to UTF.
|
||
John Hobby built the Postscript ☺.
|
||
We thank them all.
|
||
.SH
|
||
References
|
||
.LP
|
||
[ANSIC] \f2American National Standard for Information Systems \-
|
||
Programming Language C\f1, American National Standards Institute, Inc.,
|
||
New York, 1990.
|
||
.LP
|
||
[ISO10646]
|
||
ISO/IEC DIS 10646-1:1993
|
||
\f2Information technology \-
|
||
Universal Multiple-Octet Coded Character Set (UCS) \(em
|
||
Part 1: Architecture and Basic Multilingual Plane\fP.
|
||
.LP
|
||
[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
|
||
``Plan 9 from Bell Labs'',
|
||
UKUUG Proc. of the Summer 1990 Conf.,
|
||
London, England,
|
||
1990.
|
||
.LP
|
||
[Pike91] R. Pike, ``8½, The Plan 9 Window System'', USENIX Summer
|
||
Conf. Proc., Nashville, 1991, reprinted in this volume.
|
||
.LP
|
||
[Pike92] R. Pike, ``How to Use the Plan 9 C Compiler'', this volume.
|
||
.LP
|
||
[Pong and Zhang] Man-Chi Pong and Yongguang Zhang, ``cxterm:
|
||
A Chinese Terminal Emulator for the X Window System'',
|
||
.I
|
||
Software\(emPractice and Experience,
|
||
.R
|
||
Vol 22(1), 809-926, October 1992.
|
||
.LP
|
||
[Unicode]
|
||
\f2The Unicode Standard,
|
||
Worldwide Character Encoding,
|
||
Version 1.0, Volume 1\f1,
|
||
The Unicode Consortium,
|
||
Addison Wesley,
|
||
New York,
|
||
1991.
|