355 lines
13 KiB
Plaintext
355 lines
13 KiB
Plaintext
.EQ
|
||
delim $$
|
||
.EN
|
||
.TL
|
||
Scaling Upas
|
||
.AU
|
||
Erik Quanstrom
|
||
quanstro@coraid.com
|
||
.AB
|
||
The Plan 9 email system, Upas, uses traditional methods of delivery to
|
||
.UX
|
||
mail boxes while using a user-level file system, Upas/fs, to
|
||
translate mail boxes of various formats into a single, convenient format for access.
|
||
Unfortunately, it does not do so efficiently. Upas/fs
|
||
reads entire folders into core. When deleting email from mail boxes,
|
||
the entire mail box is rewritten. I describe how Upas/fs has been
|
||
adapted to use caching, indexing and a new mail box format (mdir) to
|
||
limit I/O, reduce core size and eliminate the need to rewrite mail
|
||
boxes.
|
||
.AE
|
||
.NH
|
||
Introduction
|
||
.LP
|
||
.DS I
|
||
Chained at his root two scion demons dwell
|
||
.br
|
||
– Erasmus Darwin, The Botanic Garden
|
||
.DE
|
||
.LP
|
||
At Coraid, email is the largest resource user in the system by orders
|
||
of magnitude. As of July, 2007, rewriting mail boxes was using
|
||
300MB/day on the WORM and several users required more than 400MB of
|
||
core. As of July, 2008, rewriting mail boxes was using 800MB/day on
|
||
the WORM and several users required more than 1.2GB of core to read
|
||
email. Clearly these are difficult to sustain levels of growth, even
|
||
without growth of the company. We needed to limit the amount of disk
|
||
space used and, more urgently, reduce Upas/fs' core size.
|
||
.LP
|
||
The techniques employed are simple. Mail is now stored in a directory
|
||
with one message per file. This eliminates the need to rewrite mail
|
||
boxes. Upas/fs now maintains an index which allows it to present
|
||
complete message summaries without reading indexed messages.
|
||
Combining the two techniques allows Upas/fs to read only new or
|
||
referenced messages. Finally, caching limits both the total number of
|
||
in-core messages and their total size.
|
||
.NH
|
||
Mdir Format
|
||
.LP
|
||
In addition to meeting our urgent operational requirements of reducing
|
||
memory and disk footprint, to meet the expectations of our users we
|
||
require a solution that is able to handle folders up to ten thousand
|
||
messages, open folders quickly, list the contents of folders quickly
|
||
and support the current set of mail readers.
|
||
.LP
|
||
There are several potential styles of mail boxes. The maildir[1] format
|
||
has some attractive properties. Mail can be delivered to or deleted
|
||
from a mail box without locking. New mail or deleted mail may be
|
||
detected with a directory scan. When used with WORM storage, the
|
||
amount of storage required is no more than the size of new mail
|
||
received. Mbox format can require that a new copy of the inbox be
|
||
stored every day. Even with storage that coalesces duplicate blocks
|
||
such as Venti, deleting a message will generally require new storage
|
||
since messages are not disk-block aligned. Maildir does not reduce
|
||
the cost of the common task of a summary listing of mail such as
|
||
generated by acme Mail.
|
||
.LP
|
||
The mails[2] format proposes a directory per mail. A copy of
|
||
the mail as delivered is stored and each mime part is decoded
|
||
in such a way that a mail reader could display the file directly.
|
||
Command line tools in the style of MH[3] are used to display and
|
||
process mail. Upas/fs is not necessary for reading local mail.
|
||
Mails has the potential to reduce memory footprint below that
|
||
offered by mdirs for native email reading. However all of the
|
||
largest mail boxes at our site are served exclusively through IMAP.
|
||
The preformatting by mails would be unnecessary for such accounts.
|
||
.LP
|
||
Other mail servers such as Lotus Notes[4] store email in a custom
|
||
database format which allows for fielded and full-text searching
|
||
of mail folders. Such a format provides very quick mail
|
||
listings and good search capabilities. Such a solution would not
|
||
lend itself well to a tool-based environment, nor would it be simple.
|
||
.LP
|
||
Maildir format seemed the best basic format but its particulars are
|
||
tied to the
|
||
.UX
|
||
environment; mdir is a descendant. A mdir folder
|
||
is a directory with the name of the folder. Messages in the mdir
|
||
folder are stored in a file named
|
||
.I "utime.seq" .
|
||
.I Utime
|
||
is defined as the decimal
|
||
.UX
|
||
seconds when the message was added to
|
||
the folder. For the inbox, this time will correspond to the
|
||
.UX
|
||
“From ” line.
|
||
.I Seq
|
||
is a two-digit sequence number starting with
|
||
.CW "00."
|
||
The lowest available sequence number is used to store the message.
|
||
Thus, the first email possible would be named
|
||
.CW "0.00."
|
||
To prevent accidents, message files are stored with
|
||
the append-only and exclusive-access bits turned on.
|
||
The message is stored in the same format it would be in mbox
|
||
format; each message is a valid mbox folder with a single message.
|
||
.NH
|
||
Indexing
|
||
.LP
|
||
When upas/fs finds an unindexed message, it is added to the index.
|
||
The index is a file named
|
||
.I "foldername" .idx
|
||
and consists a signature and one line per MIME part. Each line
|
||
contains the SHA1 checksum of the message (or a place holder for
|
||
subparts), one field per entry in the
|
||
.I "messageid/info"
|
||
file, flags and the number of subparts. The flags are currently a
|
||
superset of the standard IMAP flags. They provide the similar
|
||
functionality to maildir's modified file names. Thus the `S'
|
||
(answered) flag remains set between invocations of mail readers.
|
||
Other mutable information about a message may be stored in a similar
|
||
way.
|
||
.LP
|
||
Since the
|
||
.I info
|
||
file is read by all the mail readers to produce mail listings,
|
||
mail boxes may be listed without opening any mail files when no new
|
||
mail has arrived. Similarly, opening a new mail box requires reading
|
||
the index and checking new mail. Index files are typically between
|
||
0.5% and 5% the size of the full mail box. Each time the index is
|
||
generated, it is fully rewritten.
|
||
.NH
|
||
Caching
|
||
.LP
|
||
Upas/fs stores each message in a
|
||
.CW "Message"
|
||
structure. To enable caching, this structure was split
|
||
into four parts: The
|
||
.CW "Idx"
|
||
(or index), message subparts, information on the cache state of the
|
||
message and a set of pointers into the processed header and body.
|
||
Only the pointers to the processed header and body are subject to
|
||
caching. The available cache states are
|
||
.CW "Cidx" ,
|
||
.CW "Cheader"
|
||
and
|
||
.CW "Cbody" .
|
||
.LP
|
||
When the header and body are not present, the average message with
|
||
subparts takes roughly 3KB of memory. Thus a 10,000 message mail box
|
||
would require roughly 30MB of core in addition to any cached
|
||
messages. Reads of the
|
||
.CW "info"
|
||
or
|
||
.CW "subject"
|
||
files can be satisfied from the information in the
|
||
.CW "Idx"
|
||
structure.
|
||
.LP
|
||
Since there are a fair number of very large messages, requests that
|
||
can be satisfied by reading the message headers do not result in the
|
||
full message being read. Reads of the
|
||
.CW "header"
|
||
or
|
||
.CW "rawheader"
|
||
files of top-level messages are satisfied in this way. Reading the
|
||
same files for subparts, however, results in the entire message being
|
||
read. Caching the header results in the
|
||
.CW "Cheader"
|
||
cache state.
|
||
.LP
|
||
Similarly, reading the
|
||
.CW "body"
|
||
requires the body to be read, processed and results in
|
||
the
|
||
.CW "Cbody"
|
||
cache state. Reading from MIME subparts also results
|
||
in the
|
||
.CW "Cbody"
|
||
cache state.
|
||
.LP
|
||
The cache has a simple LRU replacement policy. Each time a cached
|
||
member of a message is accessed, it is moved to the head of the list.
|
||
The cache contains a maximum number of messages and a maximum size.
|
||
While the maximum number of messages may not be exceeded, the maximum
|
||
cache size may be exceeded if the sum of all the currently referenced
|
||
messages is greater than the size of the cache. In this case all
|
||
unreferenced messages will be freed. When removing a message
|
||
from the cache all of the cacheable information is freed.
|
||
.NH
|
||
Collateral damage
|
||
.LP
|
||
.DS I
|
||
Each new user of a new system uncovers a new class of bugs.
|
||
.br
|
||
— Brian Kernighan
|
||
.DE
|
||
.LP
|
||
In addition to upas/fs, programs that have assumptions about how
|
||
mail boxes are structured needed to be modified. Programs which
|
||
deliver mail to mail boxes (deliver, marshal, ml, smtp) and append messages to
|
||
folders were given a common (nedmail) function to call. Since this
|
||
was done by modifying functions in the Upas common library, this
|
||
presented a problem for programs not traditionally part of Upas
|
||
such as acme Mail and imap4d. Rather than fold these programs
|
||
into Upas, a new program, mbappend, was added to Upas.
|
||
.LP
|
||
Imap4d also requires the ability to rename and remove folders.
|
||
While an external program would work for this as well, that
|
||
approach has some drawbacks. Most importantly, IMAP folders
|
||
can't be moved or renamed in the same way without reimplementing
|
||
functionality that is already in upas/fs. It also emphasises the
|
||
asymmetry between reading and deleting email and other folder
|
||
actions. Folder renaming and removal were added to upas/fs.
|
||
It is intended that mbappend will be removed soon
|
||
and replaced with equivalent upas/fs functionality —
|
||
at least for non-delivery programs.
|
||
.LP
|
||
Mdirs also expose an oddity about file permissions. An append-only
|
||
file that is mode
|
||
.CW 0622
|
||
may be appended to by anyone, but is readable only by the owner.
|
||
With a directory, such a setup is not directly possible as write permission
|
||
to a directory implies permission to remove. There are a number of
|
||
solutions to this problem. Delivery could be made asymmetrical—incoming
|
||
files could be written to a mbox. Or, following the example of the outbound
|
||
mail queue, each user could deliver to a directory owned by that user.
|
||
In many BSD-derived
|
||
.UX
|
||
systems, the “sticky bit” on directories is used to modify
|
||
the meaning of the
|
||
.CW w
|
||
bit for users matching only the other bits. For them, the
|
||
.CW w
|
||
bit gives permission to create but not to remove.
|
||
.LP
|
||
While this is somewhat of a one-off situation, I chose to implement
|
||
a version of the “sticky bit” using the existing append-only bit on our
|
||
file server. This was implemented as an extra permission check when
|
||
removing files. Fewer than 10 lines of code were required.
|
||
.NH
|
||
Performance
|
||
.LP
|
||
A representative local mail box was used to generate some rough
|
||
performance numbers. The mail box is 110MB and contains 868 messages.
|
||
These figures are shown in table 1. In the worse case—an unindexed
|
||
mail box—the new upas/fs uses 18% of the memory of the original while
|
||
using 13% more cpu. In the best case, it uses only 5% of the memory
|
||
while using only 13% of the cpu. Clearly, a larger mail box will make
|
||
these ratios more attractive. In the two months since the snapshot was
|
||
taken, that same mail box has grown to 220MB and contains 1814
|
||
messages.
|
||
.ps -2
|
||
.DS C
|
||
.TS
|
||
box, tab(:);
|
||
c s s s s
|
||
c | c | c | c | c
|
||
a | n | n | n | n.
|
||
Table 1 – Performance
|
||
_
|
||
action:user:system:real:core size:
|
||
:s:s:s:MB:
|
||
_
|
||
old fs read:1.69:0.84:6.07:135
|
||
_
|
||
initial read:1.65:0.90:6.90:25
|
||
_
|
||
indexed read:0.64:0.03:0.77:6.5
|
||
.TE
|
||
.DE
|
||
.NL
|
||
.NH
|
||
Future Work
|
||
.LP
|
||
While Upas' memory usage has been drastically reduced,
|
||
it is still a work-in-progress. Caching and indexing are
|
||
adequate but primitive. Upas/fs is still inconsistently
|
||
bypassed for appending messages to mail boxes. There
|
||
are also some features which remain incomplete. Finally,
|
||
the small increase in scale brings some new questions about
|
||
the organization of email.
|
||
.LP
|
||
It may be useful for mail boxes with very large numbers
|
||
of messages to divide the index into fixed-size chunks.
|
||
Then messages could be read into a fixed-sized pool of
|
||
structures as needed. However it is currently hard to
|
||
see how clients could easily interface a mail box large
|
||
enough for this technique to be useful. Currently, all
|
||
clients assume that it is reasonable to allocate an
|
||
in-core data structure for each message in a mail box.
|
||
To take advantage of a chunked index, clients (or the
|
||
server) would need a way of limiting the number of
|
||
messages considered at a time. Also, for such large
|
||
mail boxes, it would be important to separate the
|
||
incoming messages from older messages to limit the work
|
||
required to scan for new messages.
|
||
.LP
|
||
Caching is particularly unsatisfactory. Files should
|
||
be read in fixed-sized buffers so maximum memory usage
|
||
does not depend on the size of the largest file in the
|
||
mail box. Unfortunately, current data structures do not readily
|
||
support this. In practice, this limitation has not yet
|
||
been noticeable.
|
||
.LP
|
||
There are also a few features that need to be completed.
|
||
Tracking of references has been added to marshal and
|
||
upas/fs. In addition, the index provides a place to store
|
||
mutable information about a message. These capabilities
|
||
should be built upon to provide general threading and
|
||
tagging capabilities.
|
||
.NH
|
||
Speculation
|
||
.LP
|
||
Freed from the limitation that all messages in a
|
||
mail box must be read and stored in memory before a
|
||
single message may be accessed, it is interesting to
|
||
speculate on a few further possibilites.
|
||
.LP
|
||
For example, it may be
|
||
useful to replace separate mail boxes with a single
|
||
collection of messages assigned to one or more virtual
|
||
mail boxes. The association between a message and a
|
||
mail box would be a “tag.” A message could be added to
|
||
or removed from one or more mail boxes without modifying
|
||
the mdir file. If threads were implemented by tagging
|
||
each message with its references, it would be possible
|
||
to follow threads across mail boxes, even to messages
|
||
removed from all mail boxes, provided the underlying
|
||
file were not also removed. If a facility for adding
|
||
arbitrary, automatic tags were enabled, it would be
|
||
possible to tag messages with the email address in
|
||
the SMTP From line.
|
||
.NH
|
||
References
|
||
.IP [1]
|
||
D. Bernstein, “Using maildir format”,
|
||
published online at
|
||
.br
|
||
http://cr.yp.to/proto/maildir.html
|
||
.IP [2]
|
||
F. Ballesteros
|
||
.IR mails (1),
|
||
published online at
|
||
http://lsub.org/magic/man2html/1/mails
|
||
.IP [3]
|
||
MH Wikipedia entry,
|
||
http://en.wikipedia.org/wiki/MH_Message_Handling_System
|
||
.IP [4]
|
||
Lotus Notes Wikipedia entry,
|
||
http://en.wikipedia.org/wiki/Lotus_Notes
|
||
.IP [5]
|
||
D. Presotto, “Upas—a Simpler Approach to Network Mail”,
|
||
Proceedings of the 10th Usenix conference, 1985.
|