1420 lines
29 KiB
Text
1420 lines
29 KiB
Text
.TH HTML 2
|
|
.SH NAME
|
|
parsehtml,
|
|
printitems,
|
|
validitems,
|
|
freeitems,
|
|
freedocinfo,
|
|
dimenkind,
|
|
dimenspec,
|
|
targetid,
|
|
targetname,
|
|
fromStr,
|
|
toStr
|
|
\- HTML parser
|
|
.SH SYNOPSIS
|
|
.nf
|
|
.PP
|
|
.ft L
|
|
#include <u.h>
|
|
#include <libc.h>
|
|
#include <html.h>
|
|
.ft P
|
|
.PP
|
|
.ta \w'\fLToken* 'u
|
|
.B
|
|
Item* parsehtml(uchar* data, int datalen, Rune* src, int mtype,
|
|
.B
|
|
int chset, Docinfo** pdi)
|
|
.PP
|
|
.B
|
|
void printitems(Item* items, char* msg)
|
|
.PP
|
|
.B
|
|
int validitems(Item* items)
|
|
.PP
|
|
.B
|
|
void freeitems(Item* items)
|
|
.PP
|
|
.B
|
|
void freedocinfo(Docinfo* d)
|
|
.PP
|
|
.B
|
|
int dimenkind(Dimen d)
|
|
.PP
|
|
.B
|
|
int dimenspec(Dimen d)
|
|
.PP
|
|
.B
|
|
int targetid(Rune* s)
|
|
.PP
|
|
.B
|
|
Rune* targetname(int targid)
|
|
.PP
|
|
.B
|
|
uchar* fromStr(Rune* buf, int n, int chset)
|
|
.PP
|
|
.B
|
|
Rune* toStr(uchar* buf, int n, int chset)
|
|
.SH DESCRIPTION
|
|
.PP
|
|
This library implements a parser for HTML 4.0 documents.
|
|
The parsed HTML is converted into an intermediate representation that
|
|
describes how the formatted HTML should be laid out.
|
|
.PP
|
|
.I Parsehtml
|
|
parses an entire HTML document contained in the buffer
|
|
.I data
|
|
and having length
|
|
.IR datalen .
|
|
The URL of the document should be passed in as
|
|
.IR src .
|
|
.I Mtype
|
|
is the media type of the document, which should be either
|
|
.B TextHtml
|
|
or
|
|
.BR TextPlain .
|
|
The character set of the document is described in
|
|
.IR chset ,
|
|
which can be one of
|
|
.BR US_Ascii ,
|
|
.BR ISO_8859_1 ,
|
|
.B UTF_8
|
|
or
|
|
.BR Unicode .
|
|
The return value is a linked list of
|
|
.B Item
|
|
structures, described in detail below.
|
|
As a side effect,
|
|
.BI * pdi
|
|
is set to point to a newly created
|
|
.B Docinfo
|
|
structure, containing information pertaining to the entire document.
|
|
.PP
|
|
The library expects two allocation routines to be provided by the
|
|
caller,
|
|
.B emalloc
|
|
and
|
|
.BR erealloc .
|
|
These routines are analogous to the standard malloc and realloc routines,
|
|
except that they should not return if the memory allocation fails.
|
|
In addition,
|
|
.B emalloc
|
|
is required to zero the memory.
|
|
.PP
|
|
For debugging purposes,
|
|
.I printitems
|
|
may be called to display the contents of an item list; individual items may
|
|
be printed using the
|
|
.B %I
|
|
print verb, installed on the first call to
|
|
.IR parsehtml .
|
|
.I validitems
|
|
traverses the item list, checking that all of the pointers are valid.
|
|
It returns
|
|
.B 1
|
|
is everything is ok, and
|
|
.B 0
|
|
if an error was found.
|
|
Normally, one would not call these routines directly.
|
|
Instead, one sets the global variable
|
|
.I dbgbuild
|
|
and the library calls them automatically.
|
|
One can also set
|
|
.IR warn ,
|
|
to cause the library to print a warning whenever it finds a problem with the
|
|
input document, and
|
|
.IR dbglex ,
|
|
to print debugging information in the lexer.
|
|
.PP
|
|
When an item list is finished with, it should be freed with
|
|
.IR freeitems .
|
|
Then,
|
|
.I freedocinfo
|
|
should be called on the pointer returned in
|
|
.BI * pdi\f1.
|
|
.PP
|
|
.I Dimenkind
|
|
and
|
|
.I dimenspec
|
|
are provided to interpret the
|
|
.B Dimen
|
|
type, as described in the section
|
|
.IR "Dimension Specifications" .
|
|
.PP
|
|
Frame target names are mapped to integer ids via a global, permanent mapping.
|
|
To find the value for a given name, call
|
|
.IR targetid ,
|
|
which allocates a new id if the name hasn't been seen before.
|
|
The name of a given, known id may be retrieved using
|
|
.IR targetname .
|
|
The library predefines
|
|
.BR FTtop ,
|
|
.BR FTself ,
|
|
.B FTparent
|
|
and
|
|
.BR FTblank .
|
|
.PP
|
|
The library handles all text as Unicode strings (type
|
|
.BR Rune* ).
|
|
Character set conversion is provided by
|
|
.I fromStr
|
|
and
|
|
.IR toStr .
|
|
.I FromStr
|
|
takes
|
|
.I n
|
|
Unicode characters from
|
|
.I buf
|
|
and converts them to the character set described by
|
|
.IR chset .
|
|
.I ToStr
|
|
takes
|
|
.I n
|
|
bytes from
|
|
.IR buf ,
|
|
interpretted as belonging to character set
|
|
.IR chset ,
|
|
and converts them to a Unicode string.
|
|
Both routines null-terminate the result, and use
|
|
.B emalloc
|
|
to allocate space for it.
|
|
.SS Items
|
|
The return value of
|
|
.I parsehtml
|
|
is a linked list of variant structures,
|
|
with the generic portion described by the following definition:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Genattr* 'u
|
|
typedef struct Item Item;
|
|
struct Item
|
|
{
|
|
Item* next;
|
|
int width;
|
|
int height;
|
|
int ascent;
|
|
int anchorid;
|
|
int state;
|
|
Genattr* genattr;
|
|
int tag;
|
|
};
|
|
.EE
|
|
.PP
|
|
The field
|
|
.B next
|
|
points to the successor in the linked list of items, while
|
|
.BR width ,
|
|
.BR height ,
|
|
and
|
|
.B ascent
|
|
are intended for use by the caller as part of the layout process.
|
|
.BR Anchorid ,
|
|
if non-zero, gives the integer id assigned by the parser to the anchor that
|
|
this item is in (see section
|
|
.IR Anchors ).
|
|
.B State
|
|
is a collection of flags and values described as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'IFindentshift = 'u
|
|
enum
|
|
{
|
|
IFbrk = 0x80000000,
|
|
IFbrksp = 0x40000000,
|
|
IFnobrk = 0x20000000,
|
|
IFcleft = 0x10000000,
|
|
IFcright = 0x08000000,
|
|
IFwrap = 0x04000000,
|
|
IFhang = 0x02000000,
|
|
IFrjust = 0x01000000,
|
|
IFcjust = 0x00800000,
|
|
IFsmap = 0x00400000,
|
|
IFindentshift = 8,
|
|
IFindentmask = (255<<IFindentshift),
|
|
IFhangmask = 255
|
|
};
|
|
.EE
|
|
.PP
|
|
.B IFbrk
|
|
is set if a break is to be forced before placing this item.
|
|
.B IFbrksp
|
|
is set if a 1 line space should be added to the break (in which case
|
|
.B IFbrk
|
|
is also set).
|
|
.B IFnobrk
|
|
is set if a break is not permitted before the item.
|
|
.B IFcleft
|
|
is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
|
|
before this item is placed, and
|
|
.B IFcright
|
|
is set for right floats.
|
|
In both cases, IFbrk is also set.
|
|
.B IFwrap
|
|
is set if the line containing this item is allowed to wrap.
|
|
.B IFhang
|
|
is set if this item hangs into the left indent.
|
|
.B IFrjust
|
|
is set if the line containing this item should be right justified,
|
|
and
|
|
.B IFcjust
|
|
is set for center justified lines.
|
|
.B IFsmap
|
|
is used to indicate that an image is a server-side map.
|
|
The low 8 bits, represented by
|
|
.BR IFhangmask ,
|
|
indicate the current hang into left indent, in tenths of a tabstop.
|
|
The next 8 bits, represented by
|
|
.B IFindentmask
|
|
and
|
|
.BR IFindentshift ,
|
|
indicate the current indent in tab stops.
|
|
.PP
|
|
The field
|
|
.B genattr
|
|
is an optional pointer to an auxiliary structure, described in the section
|
|
.IR "Generic Attributes" .
|
|
.PP
|
|
Finally,
|
|
.B tag
|
|
describes which variant type this item has.
|
|
It can have one of the values
|
|
.BR Itexttag ,
|
|
.BR Iruletag ,
|
|
.BR Iimagetag ,
|
|
.BR Iformfieldtag ,
|
|
.BR Itabletag ,
|
|
.B Ifloattag
|
|
or
|
|
.BR Ispacertag .
|
|
For each of these values, there is an additional structure defined, which
|
|
includes Item as an unnamed initial substructure, and then defines additional
|
|
fields.
|
|
.PP
|
|
Items of type
|
|
.B Itexttag
|
|
represent a piece of text, using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Rune* 'u
|
|
struct Itext
|
|
{
|
|
Item;
|
|
Rune* s;
|
|
int fnt;
|
|
int fg;
|
|
uchar voff;
|
|
uchar ul;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here
|
|
.B s
|
|
is a null-terminated Unicode string of the actual characters making up this text item,
|
|
.B fnt
|
|
is the font number (described in the section
|
|
.IR "Font Numbers" ),
|
|
and
|
|
.B fg
|
|
is the RGB encoded color for the text.
|
|
.B Voff
|
|
measures the vertical offset from the baseline; subtract
|
|
.B Voffbias
|
|
to get the actual value (negative values represent a displacement down the page).
|
|
The field
|
|
.B ul
|
|
is the underline style:
|
|
.B ULnone
|
|
if no underline,
|
|
.B ULunder
|
|
for conventional underline, and
|
|
.B ULmid
|
|
for strike-through.
|
|
.PP
|
|
Items of type
|
|
.B Iruletag
|
|
represent a horizontal rule, as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Dimen 'u
|
|
struct Irule
|
|
{
|
|
Item;
|
|
uchar align;
|
|
uchar noshade;
|
|
int size;
|
|
Dimen wspec;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here
|
|
.B align
|
|
is the alignment specification (described in the corresponding section),
|
|
.B noshade
|
|
is set if the rule should not be shaded,
|
|
.B size
|
|
is the height of the rule (as set by the size attribute),
|
|
and
|
|
.B wspec
|
|
is the desired width (see section
|
|
.IR "Dimension Specifications" ).
|
|
.PP
|
|
Items of type
|
|
.B Iimagetag
|
|
describe embedded images, for which the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Iimage* 'u
|
|
struct Iimage
|
|
{
|
|
Item;
|
|
Rune* imsrc;
|
|
int imwidth;
|
|
int imheight;
|
|
Rune* altrep;
|
|
Map* map;
|
|
int ctlid;
|
|
uchar align;
|
|
uchar hspace;
|
|
uchar vspace;
|
|
uchar border;
|
|
Iimage* nextimage;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here
|
|
.B imsrc
|
|
is the URL of the image source,
|
|
.B imwidth
|
|
and
|
|
.BR imheight ,
|
|
if non-zero, contain the specified width and height for the image,
|
|
and
|
|
.B altrep
|
|
is the text to use as an alternative to the image, if the image is not displayed.
|
|
.BR Map ,
|
|
if set, points to a structure describing an associated client-side image map.
|
|
.B Ctlid
|
|
is reserved for use by the application, for handling animated images.
|
|
.B Align
|
|
encodes the alignment specification of the image.
|
|
.B Hspace
|
|
contains the number of pixels to pad the image with on either side, and
|
|
.B Vspace
|
|
the padding above and below.
|
|
.B Border
|
|
is the width of the border to draw around the image.
|
|
.B Nextimage
|
|
points to the next image in the document (the head of this list is
|
|
.BR Docinfo.images ).
|
|
.PP
|
|
For items of type
|
|
.BR Iformfieldtag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Formfield* 'u
|
|
struct Iformfield
|
|
{
|
|
Item;
|
|
Formfield* formfield;
|
|
};
|
|
.EE
|
|
.PP
|
|
This adds a single field,
|
|
.BR formfield ,
|
|
which points to a structure describing a field in a form, described in section
|
|
.IR Forms .
|
|
.PP
|
|
For items of type
|
|
.BR Itabletag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Table* 'u
|
|
struct Itable
|
|
{
|
|
Item;
|
|
Table* table;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Table
|
|
points to a structure describing the table, described in the section
|
|
.IR Tables .
|
|
.PP
|
|
For items of type
|
|
.BR Ifloattag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Ifloat* 'u
|
|
struct Ifloat
|
|
{
|
|
Item;
|
|
Item* item;
|
|
int x;
|
|
int y;
|
|
uchar side;
|
|
uchar infloats;
|
|
Ifloat* nextfloat;
|
|
};
|
|
.EE
|
|
.PP
|
|
The
|
|
.B item
|
|
points to a single item (either a table or an image) that floats (the text of the
|
|
document flows around it), and
|
|
.B side
|
|
indicates the margin that this float sticks to; it is either
|
|
.B ALleft
|
|
or
|
|
.BR ALright .
|
|
.B X
|
|
and
|
|
.B y
|
|
are reserved for use by the caller; these are typically used for the coordinates
|
|
of the top of the float.
|
|
.B Infloats
|
|
is used by the caller to keep track of whether it has placed the float.
|
|
.B Nextfloat
|
|
is used by the caller to link together all of the floats that it has placed.
|
|
.PP
|
|
For items of type
|
|
.BR Ispacertag ,
|
|
the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Item; 'u
|
|
struct Ispacer
|
|
{
|
|
Item;
|
|
int spkind;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Spkind
|
|
encodes the kind of spacer, and may be one of
|
|
.B ISPnull
|
|
(zero height and width),
|
|
.B ISPvline
|
|
(takes on height and ascent of the current font),
|
|
.B ISPhspace
|
|
(has the width of a space in the current font) and
|
|
.B ISPgeneral
|
|
(for all other purposes, such as between markers and lists).
|
|
.SS Generic Attributes
|
|
.PP
|
|
The genattr field of an item, if non-nil, points to a structure that holds
|
|
the values of attributes not specific to any particular
|
|
item type, as they occur on a wide variety of underlying HTML tags.
|
|
The structure is as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'SEvent* 'u
|
|
typedef struct Genattr Genattr;
|
|
struct Genattr
|
|
{
|
|
Rune* id;
|
|
Rune* class;
|
|
Rune* style;
|
|
Rune* title;
|
|
SEvent* events;
|
|
};
|
|
.EE
|
|
.PP
|
|
Fields
|
|
.BR id ,
|
|
.BR class ,
|
|
.B style
|
|
and
|
|
.BR title ,
|
|
when non-nil, contain values of correspondingly named attributes of the HTML tag
|
|
associated with this item.
|
|
.B Events
|
|
is a linked list of events (with corresponding scripted actions) associated with the item:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'SEvent* 'u
|
|
typedef struct SEvent SEvent;
|
|
struct SEvent
|
|
{
|
|
SEvent* next;
|
|
int type;
|
|
Rune* script;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here,
|
|
.B next
|
|
points to the next event in the list,
|
|
.B type
|
|
is one of
|
|
.BR SEonblur ,
|
|
.BR SEonchange ,
|
|
.BR SEonclick ,
|
|
.BR SEondblclick ,
|
|
.BR SEonfocus ,
|
|
.BR SEonkeypress ,
|
|
.BR SEonkeyup ,
|
|
.BR SEonload ,
|
|
.BR SEonmousedown ,
|
|
.BR SEonmousemove ,
|
|
.BR SEonmouseout ,
|
|
.BR SEonmouseover ,
|
|
.BR SEonmouseup ,
|
|
.BR SEonreset ,
|
|
.BR SEonselect ,
|
|
.B SEonsubmit
|
|
or
|
|
.BR SEonunload ,
|
|
and
|
|
.B script
|
|
is the text of the associated script.
|
|
.SS Dimension Specifications
|
|
.PP
|
|
Some structures include a dimension specification, used where
|
|
a number can be followed by a
|
|
.B %
|
|
or a
|
|
.B *
|
|
to indicate
|
|
percentage of total or relative weight.
|
|
This is encoded using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'int 'u
|
|
typedef struct Dimen Dimen;
|
|
struct Dimen
|
|
{
|
|
int kindspec;
|
|
};
|
|
.EE
|
|
.PP
|
|
Separate kind and spec values are extracted using
|
|
.I dimenkind
|
|
and
|
|
.IR dimenspec .
|
|
.I Dimenkind
|
|
returns one of
|
|
.BR Dnone ,
|
|
.BR Dpixels ,
|
|
.B Dpercent
|
|
or
|
|
.BR Drelative .
|
|
.B Dnone
|
|
means that no dimension was specified.
|
|
In all other cases,
|
|
.I dimenspec
|
|
should be called to find the absolute number of pixels, the percentage of total,
|
|
or the relative weight.
|
|
.SS Background Specifications
|
|
.PP
|
|
It is possible to set the background of the entire document, and also
|
|
for some parts of the document (such as tables).
|
|
This is encoded as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Rune* 'u
|
|
typedef struct Background Background;
|
|
struct Background
|
|
{
|
|
Rune* image;
|
|
int color;
|
|
};
|
|
.EE
|
|
.PP
|
|
.BR Image ,
|
|
if non-nil, is the URL of an image to use as the background.
|
|
If this is nil,
|
|
.B color
|
|
is used instead, as the RGB value for a solid fill color.
|
|
.SS Alignment Specifications
|
|
.PP
|
|
Certain items have alignment specifiers taken from the following
|
|
enumerated type:
|
|
.PP
|
|
.EX
|
|
.ta 6n
|
|
enum
|
|
{
|
|
ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
|
|
ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
|
|
};
|
|
.EE
|
|
.PP
|
|
These values correspond to the various alignment types named in the HTML 4.0
|
|
standard.
|
|
If an item has an alignment of
|
|
.B ALleft
|
|
or
|
|
.BR ALright ,
|
|
the library automatically encapsulates it inside a float item.
|
|
.PP
|
|
Tables, and the various rows, columns and cells within them, have a more
|
|
complex alignment specification, composed of separate vertical and
|
|
horizontal alignments:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'uchar 'u
|
|
typedef struct Align Align;
|
|
struct Align
|
|
{
|
|
uchar halign;
|
|
uchar valign;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Halign
|
|
can be one of
|
|
.BR ALnone ,
|
|
.BR ALleft ,
|
|
.BR ALcenter ,
|
|
.BR ALright ,
|
|
.B ALjustify
|
|
or
|
|
.BR ALchar .
|
|
.B Valign
|
|
can be one of
|
|
.BR ALnone ,
|
|
.BR ALmiddle ,
|
|
.BR ALbottom ,
|
|
.BR ALtop
|
|
or
|
|
.BR ALbaseline .
|
|
.SS Font Numbers
|
|
.PP
|
|
Text items have an associated font number (the
|
|
.B fnt
|
|
field), which is encoded as
|
|
.BR style*NumSize+size .
|
|
Here,
|
|
.B style
|
|
is one of
|
|
.BR FntR ,
|
|
.BR FntI ,
|
|
.B FntB
|
|
or
|
|
.BR FntT ,
|
|
for roman, italic, bold and typewriter font styles, respectively, and size is
|
|
.BR Tiny ,
|
|
.BR Small ,
|
|
.BR Normal ,
|
|
.B Large
|
|
or
|
|
.BR Verylarge .
|
|
The total number of possible font numbers is
|
|
.BR NumFnt ,
|
|
and the default font number is
|
|
.B DefFnt
|
|
(which is roman style, normal size).
|
|
.SS Document Info
|
|
.PP
|
|
Global information about an HTML page is stored in the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'DestAnchor* 'u
|
|
typedef struct Docinfo Docinfo;
|
|
struct Docinfo
|
|
{
|
|
// stuff from HTTP headers, doc head, and body tag
|
|
Rune* src;
|
|
Rune* base;
|
|
Rune* doctitle;
|
|
Background background;
|
|
Iimage* backgrounditem;
|
|
int text;
|
|
int link;
|
|
int vlink;
|
|
int alink;
|
|
int target;
|
|
int chset;
|
|
int mediatype;
|
|
int scripttype;
|
|
int hasscripts;
|
|
Rune* refresh;
|
|
Kidinfo* kidinfo;
|
|
int frameid;
|
|
|
|
// info needed to respond to user actions
|
|
Anchor* anchors;
|
|
DestAnchor* dests;
|
|
Form* forms;
|
|
Table* tables;
|
|
Map* maps;
|
|
Iimage* images;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Src
|
|
gives the URL of the original source of the document,
|
|
and
|
|
.B base
|
|
is the base URL.
|
|
.B Doctitle
|
|
is the document's title, as set by a
|
|
.B <title>
|
|
element.
|
|
.B Background
|
|
is as described in the section
|
|
.IR "Background Specifications" ,
|
|
and
|
|
.B backgrounditem
|
|
is set to be an image item for the document's background image (if given as a URL),
|
|
or else nil.
|
|
.B Text
|
|
gives the default foregound text color of the document,
|
|
.B link
|
|
the unvisited hyperlink color,
|
|
.B vlink
|
|
the visited hyperlink color, and
|
|
.B alink
|
|
the color for highlighting hyperlinks (all in 24-bit RGB format).
|
|
.B Target
|
|
is the default target frame id.
|
|
.B Chset
|
|
and
|
|
.B mediatype
|
|
are as for the
|
|
.I chset
|
|
and
|
|
.I mtype
|
|
parameters to
|
|
.IR parsehtml .
|
|
.B Scripttype
|
|
is the type of any scripts contained in the document, and is always
|
|
.BR TextJavascript .
|
|
.B Hasscripts
|
|
is set if the document contains any scripts.
|
|
Scripting is currently unsupported.
|
|
.B Refresh
|
|
is the contents of a
|
|
.B "<meta http-equiv=Refresh ...>"
|
|
tag, if any.
|
|
.B Kidinfo
|
|
is set if this document is a frameset (see section
|
|
.IR Frames ).
|
|
.B Frameid
|
|
is this document's frame id.
|
|
.PP
|
|
.B Anchors
|
|
is a list of hyperlinks contained in the document,
|
|
and
|
|
.B dests
|
|
is a list of hyperlink destinations within the page (see the following section for details).
|
|
.BR Forms ,
|
|
.B tables
|
|
and
|
|
.B maps
|
|
are lists of the various forms, tables and client-side maps contained
|
|
in the document, as described in subsequent sections.
|
|
.B Images
|
|
is a list of all the image items in the document.
|
|
.SS Anchors
|
|
.PP
|
|
The library builds two lists for all of the
|
|
.B <a>
|
|
elements (anchors) in a document.
|
|
Each anchor is assigned a unique anchor id within the document.
|
|
For anchors which are hyperlinks (the
|
|
.B href
|
|
attribute was supplied), the following structure is defined:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Anchor* 'u
|
|
typedef struct Anchor Anchor;
|
|
struct Anchor
|
|
{
|
|
Anchor* next;
|
|
int index;
|
|
Rune* name;
|
|
Rune* href;
|
|
int target;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next anchor in the list (the head of this list is
|
|
.BR Docinfo.anchors ).
|
|
.B Index
|
|
is the anchor id; each item within this hyperlink is tagged with this value
|
|
in its
|
|
.B anchorid
|
|
field.
|
|
.B Name
|
|
and
|
|
.B href
|
|
are the values of the correspondingly named attributes of the anchor
|
|
(in particular, href is the URL to go to).
|
|
.B Target
|
|
is the value of the target attribute (if provided) converted to a frame id.
|
|
.PP
|
|
Destinations within the document (anchors with the name attribute set)
|
|
are held in the
|
|
.B Docinfo.dests
|
|
list, using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'DestAnchor* 'u
|
|
typedef struct DestAnchor DestAnchor;
|
|
struct DestAnchor
|
|
{
|
|
DestAnchor* next;
|
|
int index;
|
|
Rune* name;
|
|
Item* item;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is the next element of the list,
|
|
.B index
|
|
is the anchor id,
|
|
.B name
|
|
is the value of the name attribute, and
|
|
.B item
|
|
is points to the item within the parsed document that should be considered
|
|
to be the destination.
|
|
.SS Forms
|
|
.PP
|
|
Any forms within a document are kept in a list, headed by
|
|
.BR Docinfo.forms .
|
|
The elements of this list are as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Formfield* 'u
|
|
typedef struct Form Form;
|
|
struct Form
|
|
{
|
|
Form* next;
|
|
int formid;
|
|
Rune* name;
|
|
Rune* action;
|
|
int target;
|
|
int method;
|
|
int nfields;
|
|
Formfield* fields;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next form in the list.
|
|
.B Formid
|
|
is a serial number for the form within the document.
|
|
.B Name
|
|
is the value of the form's name or id attribute.
|
|
.B Action
|
|
is the value of any action attribute.
|
|
.B Target
|
|
is the value of the target attribute (if any) converted to a frame target id.
|
|
.B Method
|
|
is one of
|
|
.B HGet
|
|
or
|
|
.BR HPost .
|
|
.B Nfields
|
|
is the number of fields in the form, and
|
|
.B fields
|
|
is a linked list of the actual fields.
|
|
.PP
|
|
The individual fields in a form are described by the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Formfield* 'u
|
|
typedef struct Formfield Formfield;
|
|
struct Formfield
|
|
{
|
|
Formfield* next;
|
|
int ftype;
|
|
int fieldid;
|
|
Form* form;
|
|
Rune* name;
|
|
Rune* value;
|
|
int size;
|
|
int maxlength;
|
|
int rows;
|
|
int cols;
|
|
uchar flags;
|
|
Option* options;
|
|
Item* image;
|
|
int ctlid;
|
|
SEvent* events;
|
|
};
|
|
.EE
|
|
.PP
|
|
Here,
|
|
.B next
|
|
points to the next field in the list.
|
|
.B Ftype
|
|
is the type of the field, which can be one of
|
|
.BR Ftext ,
|
|
.BR Fpassword ,
|
|
.BR Fcheckbox ,
|
|
.BR Fradio ,
|
|
.BR Fsubmit ,
|
|
.BR Fhidden ,
|
|
.BR Fimage ,
|
|
.BR Freset ,
|
|
.BR Ffile ,
|
|
.BR Fbutton ,
|
|
.B Fselect
|
|
or
|
|
.BR Ftextarea .
|
|
.B Fieldid
|
|
is a serial number for the field within the form.
|
|
.B Form
|
|
points back to the form containing this field.
|
|
.BR Name ,
|
|
.BR value ,
|
|
.BR size ,
|
|
.BR maxlength ,
|
|
.B rows
|
|
and
|
|
.B cols
|
|
each contain the values of corresponding attributes of the field, if present.
|
|
.B Flags
|
|
contains per-field flags, of which
|
|
.B FFchecked
|
|
and
|
|
.B FFmultiple
|
|
are defined.
|
|
.B Image
|
|
is only used for fields of type
|
|
.BR Fimage ;
|
|
it points to an image item containing the image to be displayed.
|
|
.B Ctlid
|
|
is reserved for use by the caller, typically to store a unique id
|
|
of an associated control used to implement the field.
|
|
.B Events
|
|
is the same as the corresponding field of the generic attributes
|
|
associated with the item containing this field.
|
|
.B Options
|
|
is only used by fields of type
|
|
.BR Fselect ;
|
|
it consists of a list of possible options that may be selected for that
|
|
field, using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Option* 'u
|
|
typedef struct Option Option;
|
|
struct Option
|
|
{
|
|
Option* next;
|
|
int selected;
|
|
Rune* value;
|
|
Rune* display;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element of the list.
|
|
.B Selected
|
|
is set if this option is to be displayed initially.
|
|
.B Value
|
|
is the value to send when the form is submitted if this option is selected.
|
|
.B Display
|
|
is the string to display on the screen for this option.
|
|
.SS Tables
|
|
.PP
|
|
The library builds a list of all the tables in the document,
|
|
headed by
|
|
.BR Docinfo.tables .
|
|
Each element of this list has the following format:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Tablecell*** 'u
|
|
typedef struct Table Table;
|
|
struct Table
|
|
{
|
|
Table* next;
|
|
int tableid;
|
|
Tablerow* rows;
|
|
int nrow;
|
|
Tablecol* cols;
|
|
int ncol;
|
|
Tablecell* cells;
|
|
int ncell;
|
|
Tablecell*** grid;
|
|
Align align;
|
|
Dimen width;
|
|
int border;
|
|
int cellspacing;
|
|
int cellpadding;
|
|
Background background;
|
|
Item* caption;
|
|
uchar caption_place;
|
|
Lay* caption_lay;
|
|
int totw;
|
|
int toth;
|
|
int caph;
|
|
int availw;
|
|
Token* tabletok;
|
|
uchar flags;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element in the list of tables.
|
|
.B Tableid
|
|
is a serial number for the table within the document.
|
|
.B Rows
|
|
is an array of row specifications (described below) and
|
|
.B nrow
|
|
is the number of elements in this array.
|
|
Similarly,
|
|
.B cols
|
|
is an array of column specifications, and
|
|
.B ncol
|
|
the size of this array.
|
|
.B Cells
|
|
is a list of all cells within the table (structure described below)
|
|
and
|
|
.B ncell
|
|
is the number of elements in this list.
|
|
Note that a cell may span multiple rows and/or columns, thus
|
|
.B ncell
|
|
may be smaller than
|
|
.BR nrow*ncol .
|
|
.B Grid
|
|
is a two-dimensional array of cells within the table; the cell
|
|
at row
|
|
.B i
|
|
and column
|
|
.B j
|
|
is
|
|
.BR Table.grid[i][j] .
|
|
A cell that spans multiple rows and/or columns will
|
|
be referenced by
|
|
.B grid
|
|
multiple times, however it will only occur once in
|
|
.BR cells .
|
|
.B Align
|
|
gives the alignment specification for the entire table,
|
|
and
|
|
.B width
|
|
gives the requested width as a dimension specification.
|
|
.BR Border ,
|
|
.B cellspacing
|
|
and
|
|
.B cellpadding
|
|
give the values of the corresponding attributes for the table,
|
|
and
|
|
.B background
|
|
gives the requested background for the table.
|
|
.B Caption
|
|
is a linked list of items to be displayed as the caption of the
|
|
table, either above or below depending on whether
|
|
.B caption_place
|
|
is
|
|
.B ALtop
|
|
or
|
|
.BR ALbottom .
|
|
Most of the remaining fields are reserved for use by the caller,
|
|
except
|
|
.BR tabletok ,
|
|
which is reserved for internal use.
|
|
The type
|
|
.B Lay
|
|
is not defined by the library; the caller can provide its
|
|
own definition.
|
|
.PP
|
|
The
|
|
.B Tablecol
|
|
structure is defined for use by the caller.
|
|
The library ensures that the correct number of these
|
|
is allocated, but leaves them blank.
|
|
The fields are as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Point 'u
|
|
typedef struct Tablecol Tablecol;
|
|
struct Tablecol
|
|
{
|
|
int width;
|
|
Align align;
|
|
Point pos;
|
|
};
|
|
.EE
|
|
.PP
|
|
The rows in the table are specified as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Background 'u
|
|
typedef struct Tablerow Tablerow;
|
|
struct Tablerow
|
|
{
|
|
Tablerow* next;
|
|
Tablecell* cells;
|
|
int height;
|
|
int ascent;
|
|
Align align;
|
|
Background background;
|
|
Point pos;
|
|
uchar flags;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is only used during parsing; it should be ignored by the caller.
|
|
.B Cells
|
|
provides a list of all the cells in a row, linked through their
|
|
.B nextinrow
|
|
fields (see below).
|
|
.BR Height ,
|
|
.B ascent
|
|
and
|
|
.B pos
|
|
are reserved for use by the caller.
|
|
.B Align
|
|
is the alignment specification for the row, and
|
|
.B background
|
|
is the background to use, if specified.
|
|
.B Flags
|
|
is used by the parser; ignore this field.
|
|
.PP
|
|
The individual cells of the table are described as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Background 'u
|
|
typedef struct Tablecell Tablecell;
|
|
struct Tablecell
|
|
{
|
|
Tablecell* next;
|
|
Tablecell* nextinrow;
|
|
int cellid;
|
|
Item* content;
|
|
Lay* lay;
|
|
int rowspan;
|
|
int colspan;
|
|
Align align;
|
|
uchar flags;
|
|
Dimen wspec;
|
|
int hspec;
|
|
Background background;
|
|
int minw;
|
|
int maxw;
|
|
int ascent;
|
|
int row;
|
|
int col;
|
|
Point pos;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is used to link together the list of all cells within a table
|
|
.RB ( Table.cells ),
|
|
whereas
|
|
.B nextinrow
|
|
is used to link together all the cells within a single row
|
|
.RB ( Tablerow.cells ).
|
|
.B Cellid
|
|
provides a serial number for the cell within the table.
|
|
.B Content
|
|
is a linked list of the items to be laid out within the cell.
|
|
.B Lay
|
|
is reserved for the user to describe how these items have
|
|
been laid out.
|
|
.B Rowspan
|
|
and
|
|
.B colspan
|
|
are the number of rows and columns spanned by this cell,
|
|
respectively.
|
|
.B Align
|
|
is the alignment specification for the cell.
|
|
.B Flags
|
|
is some combination of
|
|
.BR TFparsing ,
|
|
.B TFnowrap
|
|
and
|
|
.B TFisth
|
|
or'd together.
|
|
Here
|
|
.B TFparsing
|
|
is used internally by the parser, and should be ignored.
|
|
.B TFnowrap
|
|
means that the contents of the cell should not be
|
|
wrapped if they don't fit the available width,
|
|
rather, the table should be expanded if need be
|
|
(this is set when the nowrap attribute is supplied).
|
|
.B TFisth
|
|
means that the cell was created by the
|
|
.B <th>
|
|
element (rather than the
|
|
.B <td>
|
|
element),
|
|
indicating that it is a header cell rather than a data cell.
|
|
.B Wspec
|
|
provides a suggested width as a dimension specification,
|
|
and
|
|
.B hspec
|
|
provides a suggested height in pixels.
|
|
.B Background
|
|
gives a background specification for the individual cell.
|
|
.BR Minw ,
|
|
.BR maxw ,
|
|
.B ascent
|
|
and
|
|
.B pos
|
|
are reserved for use by the caller during layout.
|
|
.B Row
|
|
and
|
|
.B col
|
|
give the indices of the row and column of the top left-hand
|
|
corner of the cell within the table grid.
|
|
.SS Client-side Maps
|
|
.PP
|
|
The library builds a list of client-side maps, headed by
|
|
.BR Docinfo.maps ,
|
|
and having the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Rune* 'u
|
|
typedef struct Map Map;
|
|
struct Map
|
|
{
|
|
Map* next;
|
|
Rune* name;
|
|
Area* areas;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element in the list,
|
|
.B name
|
|
is the name of the map (use to bind it to an image), and
|
|
.B areas
|
|
is a list of the areas within the image that comprise the map,
|
|
using the following structure:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Dimen* 'u
|
|
typedef struct Area Area;
|
|
struct Area
|
|
{
|
|
Area* next;
|
|
int shape;
|
|
Rune* href;
|
|
int target;
|
|
Dimen* coords;
|
|
int ncoords;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
points to the next element in the map's list of areas.
|
|
.B Shape
|
|
describes the shape of the area, and is one of
|
|
.BR SHrect ,
|
|
.B SHcircle
|
|
or
|
|
.BR SHpoly .
|
|
.B Href
|
|
is the URL associated with this area in its role as
|
|
a hypertext link, and
|
|
.B target
|
|
is the target frame it should be loaded in.
|
|
.B Coords
|
|
is an array of coordinates for the shape, and
|
|
.B ncoords
|
|
is the size of this array (number of elements).
|
|
.SS Frames
|
|
.PP
|
|
If the
|
|
.B Docinfo.kidinfo
|
|
field is set, the document is a frameset.
|
|
In this case, it is typical for
|
|
.I parsehtml
|
|
to return nil, as a document which is a frameset should have no actual
|
|
items that need to be laid out (such will appear only in subsidiary documents).
|
|
It is possible that items will be returned by a malformed document; the caller
|
|
should check for this and free any such items.
|
|
.PP
|
|
The
|
|
.B Kidinfo
|
|
structure itself reflects the fact that framesets can be nested within a document.
|
|
If is defined as follows:
|
|
.PP
|
|
.EX
|
|
.ta 6n +\w'Kidinfo* 'u
|
|
typedef struct Kidinfo Kidinfo;
|
|
struct Kidinfo
|
|
{
|
|
Kidinfo* next;
|
|
int isframeset;
|
|
|
|
// fields for "frame"
|
|
Rune* src;
|
|
Rune* name;
|
|
int marginw;
|
|
int marginh;
|
|
int framebd;
|
|
int flags;
|
|
|
|
// fields for "frameset"
|
|
Dimen* rows;
|
|
int nrows;
|
|
Dimen* cols;
|
|
int ncols;
|
|
Kidinfo* kidinfos;
|
|
Kidinfo* nextframeset;
|
|
};
|
|
.EE
|
|
.PP
|
|
.B Next
|
|
is only used if this structure is part of a containing frameset; it points to the next
|
|
element in the list of children of that frameset.
|
|
.B Isframeset
|
|
is set when this structure represents a frameset; if clear, it is an individual frame.
|
|
.PP
|
|
Some fields are used only for framesets.
|
|
.B Rows
|
|
is an array of dimension specifications for rows in the frameset, and
|
|
.B nrows
|
|
is the length of this array.
|
|
.B Cols
|
|
is the corresponding array for columns, of length
|
|
.BR ncols .
|
|
.B Kidinfos
|
|
points to a list of components contained within this frameset, each
|
|
of which may be a frameset or a frame.
|
|
.B Nextframeset
|
|
is only used during parsing, and should be ignored.
|
|
.PP
|
|
The remaining fields are used if the structure describes a frame, not a frameset.
|
|
.B Src
|
|
provides the URL for the document that should be initially loaded into this frame.
|
|
Note that this may be a relative URL, in which case it should be interpretted
|
|
using the containing document's URL as the base.
|
|
.B Name
|
|
gives the name of the frame, typically supplied via a name attribute in the HTML.
|
|
If no name was given, the library allocates one.
|
|
.BR Marginw ,
|
|
.B marginh
|
|
and
|
|
.B framebd
|
|
are the values of the marginwidth, marginheight and frameborder attributes, respectively.
|
|
.B Flags
|
|
can contain some combination of the following:
|
|
.B FRnoresize
|
|
(the frame had the noresize attribute set, and the user should not be allowed to resize it),
|
|
.B FRnoscroll
|
|
(the frame should not have any scroll bars),
|
|
.B FRhscroll
|
|
(the frame should have a horizontal scroll bar),
|
|
.B FRvscroll
|
|
(the frame should have a vertical scroll bar),
|
|
.B FRhscrollauto
|
|
(the frame should be automatically given a horizontal scroll bar if its contents
|
|
would not otherwise fit), and
|
|
.B FRvscrollauto
|
|
(the frame gets a vertical scrollbar only if required).
|
|
.SH SOURCE
|
|
.B /sys/src/libhtml
|
|
.SH SEE ALSO
|
|
.IR fmt (1)
|
|
.PP
|
|
W3C World Wide Web Consortium,
|
|
``HTML 4.01 Specification''.
|
|
.SH BUGS
|
|
The entire HTML document must be loaded into memory before
|
|
any of it can be parsed.
|