2011-09-24 15:06:45 +00:00
|
|
|
.TH UHTML 1
|
|
|
|
.SH NAME
|
|
|
|
uhtml \- convert foreign character set HTML file to unicode
|
|
|
|
.SH SYNOPSIS
|
|
|
|
.B uhtml
|
|
|
|
[
|
|
|
|
.B -p
|
|
|
|
] [
|
|
|
|
.B -c
|
|
|
|
.I charset
|
|
|
|
] [
|
|
|
|
.I file
|
|
|
|
]
|
|
|
|
.SH DESCRIPTION
|
2021-08-24 21:45:37 +00:00
|
|
|
HTML comes in various character-set encodings
|
2011-09-24 15:06:45 +00:00
|
|
|
and has special forms to encode characters. To
|
2021-08-24 21:45:37 +00:00
|
|
|
make it easier to process HTML, uhtml is used
|
|
|
|
to normalize it to a Unicode-only form.
|
2011-09-24 15:06:45 +00:00
|
|
|
.LP
|
2021-08-24 21:45:37 +00:00
|
|
|
Uhtml detects the character set of the HTML input
|
2011-09-24 15:06:45 +00:00
|
|
|
.I file
|
|
|
|
and calls
|
|
|
|
.IR tcs (1)
|
2021-08-24 21:45:37 +00:00
|
|
|
to convert it to UTF replacing HTML-entity forms
|
|
|
|
by their Unicode character representations except for
|
|
|
|
.BR lt ,
|
|
|
|
.BR gt ,
|
|
|
|
.BR amp ,
|
|
|
|
.BR quot ,
|
2011-09-24 15:06:45 +00:00
|
|
|
and
|
2021-08-24 21:45:37 +00:00
|
|
|
.BR apos .
|
|
|
|
The converted HTML is written to
|
2011-09-24 15:06:45 +00:00
|
|
|
standard output. If no
|
|
|
|
.I file
|
|
|
|
was given, it is read from standard input. If the
|
|
|
|
.B -p
|
|
|
|
option is given, the detected character set is printed and
|
|
|
|
the program exits without conversion.
|
2021-08-24 21:45:37 +00:00
|
|
|
In case character-set detection fails, the default (UTF)
|
2011-09-24 15:06:45 +00:00
|
|
|
is assumed. This default can be changed with the
|
|
|
|
.B -c
|
|
|
|
option.
|
|
|
|
.SH SOURCE
|
|
|
|
.B /sys/src/cmd/uhtml.c
|
|
|
|
.SH SEE ALSO
|
|
|
|
.IR tcs (1)
|