355 lines
13 KiB
Text
355 lines
13 KiB
Text
|
.EQ
|
|||
|
delim $$
|
|||
|
.EN
|
|||
|
.TL
|
|||
|
Scaling Upas
|
|||
|
.AU
|
|||
|
Erik Quanstrom
|
|||
|
quanstro@coraid.com
|
|||
|
.AB
|
|||
|
The Plan 9 email system, Upas, uses traditional methods of delivery to
|
|||
|
.UX
|
|||
|
mail boxes while using a user-level file system, Upas/fs, to
|
|||
|
translate mail boxes of various formats into a single, convenient format for access.
|
|||
|
Unfortunately, it does not do so efficiently. Upas/fs
|
|||
|
reads entire folders into core. When deleting email from mail boxes,
|
|||
|
the entire mail box is rewritten. I describe how Upas/fs has been
|
|||
|
adapted to use caching, indexing and a new mail box format (mdir) to
|
|||
|
limit I/O, reduce core size and eliminate the need to rewrite mail
|
|||
|
boxes.
|
|||
|
.AE
|
|||
|
.NH
|
|||
|
Introduction
|
|||
|
.LP
|
|||
|
.DS I
|
|||
|
Chained at his root two scion demons dwell
|
|||
|
.br
|
|||
|
– Erasmus Darwin, The Botanic Garden
|
|||
|
.DE
|
|||
|
.LP
|
|||
|
At Coraid, email is the largest resource user in the system by orders
|
|||
|
of magnitude. As of July, 2007, rewriting mail boxes was using
|
|||
|
300MB/day on the WORM and several users required more than 400MB of
|
|||
|
core. As of July, 2008, rewriting mail boxes was using 800MB/day on
|
|||
|
the WORM and several users required more than 1.2GB of core to read
|
|||
|
email. Clearly these are difficult to sustain levels of growth, even
|
|||
|
without growth of the company. We needed to limit the amount of disk
|
|||
|
space used and, more urgently, reduce Upas/fs' core size.
|
|||
|
.LP
|
|||
|
The techniques employed are simple. Mail is now stored in a directory
|
|||
|
with one message per file. This eliminates the need to rewrite mail
|
|||
|
boxes. Upas/fs now maintains an index which allows it to present
|
|||
|
complete message summaries without reading indexed messages.
|
|||
|
Combining the two techniques allows Upas/fs to read only new or
|
|||
|
referenced messages. Finally, caching limits both the total number of
|
|||
|
in-core messages and their total size.
|
|||
|
.NH
|
|||
|
Mdir Format
|
|||
|
.LP
|
|||
|
In addition to meeting our urgent operational requirements of reducing
|
|||
|
memory and disk footprint, to meet the expectations of our users we
|
|||
|
require a solution that is able to handle folders up to ten thousand
|
|||
|
messages, open folders quickly, list the contents of folders quickly
|
|||
|
and support the current set of mail readers.
|
|||
|
.LP
|
|||
|
There are several potential styles of mail boxes. The maildir[1] format
|
|||
|
has some attractive properties. Mail can be delivered to or deleted
|
|||
|
from a mail box without locking. New mail or deleted mail may be
|
|||
|
detected with a directory scan. When used with WORM storage, the
|
|||
|
amount of storage required is no more than the size of new mail
|
|||
|
received. Mbox format can require that a new copy of the inbox be
|
|||
|
stored every day. Even with storage that coalesces duplicate blocks
|
|||
|
such as Venti, deleting a message will generally require new storage
|
|||
|
since messages are not disk-block aligned. Maildir does not reduce
|
|||
|
the cost of the common task of a summary listing of mail such as
|
|||
|
generated by acme Mail.
|
|||
|
.LP
|
|||
|
The mails[2] format proposes a directory per mail. A copy of
|
|||
|
the mail as delivered is stored and each mime part is decoded
|
|||
|
in such a way that a mail reader could display the file directly.
|
|||
|
Command line tools in the style of MH[3] are used to display and
|
|||
|
process mail. Upas/fs is not necessary for reading local mail.
|
|||
|
Mails has the potential to reduce memory footprint below that
|
|||
|
offered by mdirs for native email reading. However all of the
|
|||
|
largest mail boxes at our site are served exclusively through IMAP.
|
|||
|
The preformatting by mails would be unnecessary for such accounts.
|
|||
|
.LP
|
|||
|
Other mail servers such as Lotus Notes[4] store email in a custom
|
|||
|
database format which allows for fielded and full-text searching
|
|||
|
of mail folders. Such a format provides very quick mail
|
|||
|
listings and good search capabilities. Such a solution would not
|
|||
|
lend itself well to a tool-based environment, nor would it be simple.
|
|||
|
.LP
|
|||
|
Maildir format seemed the best basic format but its particulars are
|
|||
|
tied to the
|
|||
|
.UX
|
|||
|
environment; mdir is a descendant. A mdir folder
|
|||
|
is a directory with the name of the folder. Messages in the mdir
|
|||
|
folder are stored in a file named
|
|||
|
.I "utime.seq" .
|
|||
|
.I Utime
|
|||
|
is defined as the decimal
|
|||
|
.UX
|
|||
|
seconds when the message was added to
|
|||
|
the folder. For the inbox, this time will correspond to the
|
|||
|
.UX
|
|||
|
“From ” line.
|
|||
|
.I Seq
|
|||
|
is a two-digit sequence number starting with
|
|||
|
.CW "00."
|
|||
|
The lowest available sequence number is used to store the message.
|
|||
|
Thus, the first email possible would be named
|
|||
|
.CW "0.00."
|
|||
|
To prevent accidents, message files are stored with
|
|||
|
the append-only and exclusive-access bits turned on.
|
|||
|
The message is stored in the same format it would be in mbox
|
|||
|
format; each message is a valid mbox folder with a single message.
|
|||
|
.NH
|
|||
|
Indexing
|
|||
|
.LP
|
|||
|
When upas/fs finds an unindexed message, it is added to the index.
|
|||
|
The index is a file named
|
|||
|
.I "foldername" .idx
|
|||
|
and consists a signature and one line per MIME part. Each line
|
|||
|
contains the SHA1 checksum of the message (or a place holder for
|
|||
|
subparts), one field per entry in the
|
|||
|
.I "messageid/info"
|
|||
|
file, flags and the number of subparts. The flags are currently a
|
|||
|
superset of the standard IMAP flags. They provide the similar
|
|||
|
functionality to maildir's modified file names. Thus the `S'
|
|||
|
(answered) flag remains set between invocations of mail readers.
|
|||
|
Other mutable information about a message may be stored in a similar
|
|||
|
way.
|
|||
|
.LP
|
|||
|
Since the
|
|||
|
.I info
|
|||
|
file is read by all the mail readers to produce mail listings,
|
|||
|
mail boxes may be listed without opening any mail files when no new
|
|||
|
mail has arrived. Similarly, opening a new mail box requires reading
|
|||
|
the index and checking new mail. Index files are typically between
|
|||
|
0.5% and 5% the size of the full mail box. Each time the index is
|
|||
|
generated, it is fully rewritten.
|
|||
|
.NH
|
|||
|
Caching
|
|||
|
.LP
|
|||
|
Upas/fs stores each message in a
|
|||
|
.CW "Message"
|
|||
|
structure. To enable caching, this structure was split
|
|||
|
into four parts: The
|
|||
|
.CW "Idx"
|
|||
|
(or index), message subparts, information on the cache state of the
|
|||
|
message and a set of pointers into the processed header and body.
|
|||
|
Only the pointers to the processed header and body are subject to
|
|||
|
caching. The available cache states are
|
|||
|
.CW "Cidx" ,
|
|||
|
.CW "Cheader"
|
|||
|
and
|
|||
|
.CW "Cbody" .
|
|||
|
.LP
|
|||
|
When the header and body are not present, the average message with
|
|||
|
subparts takes roughly 3KB of memory. Thus a 10,000 message mail box
|
|||
|
would require roughly 30MB of core in addition to any cached
|
|||
|
messages. Reads of the
|
|||
|
.CW "info"
|
|||
|
or
|
|||
|
.CW "subject"
|
|||
|
files can be satisfied from the information in the
|
|||
|
.CW "Idx"
|
|||
|
structure.
|
|||
|
.LP
|
|||
|
Since there are a fair number of very large messages, requests that
|
|||
|
can be satisfied by reading the message headers do not result in the
|
|||
|
full message being read. Reads of the
|
|||
|
.CW "header"
|
|||
|
or
|
|||
|
.CW "rawheader"
|
|||
|
files of top-level messages are satisfied in this way. Reading the
|
|||
|
same files for subparts, however, results in the entire message being
|
|||
|
read. Caching the header results in the
|
|||
|
.CW "Cheader"
|
|||
|
cache state.
|
|||
|
.LP
|
|||
|
Similarly, reading the
|
|||
|
.CW "body"
|
|||
|
requires the body to be read, processed and results in
|
|||
|
the
|
|||
|
.CW "Cbody"
|
|||
|
cache state. Reading from MIME subparts also results
|
|||
|
in the
|
|||
|
.CW "Cbody"
|
|||
|
cache state.
|
|||
|
.LP
|
|||
|
The cache has a simple LRU replacement policy. Each time a cached
|
|||
|
member of a message is accessed, it is moved to the head of the list.
|
|||
|
The cache contains a maximum number of messages and a maximum size.
|
|||
|
While the maximum number of messages may not be exceeded, the maximum
|
|||
|
cache size may be exceeded if the sum of all the currently referenced
|
|||
|
messages is greater than the size of the cache. In this case all
|
|||
|
unreferenced messages will be freed. When removing a message
|
|||
|
from the cache all of the cacheable information is freed.
|
|||
|
.NH
|
|||
|
Collateral damage
|
|||
|
.LP
|
|||
|
.DS I
|
|||
|
Each new user of a new system uncovers a new class of bugs.
|
|||
|
.br
|
|||
|
— Brian Kernighan
|
|||
|
.DE
|
|||
|
.LP
|
|||
|
In addition to upas/fs, programs that have assumptions about how
|
|||
|
mail boxes are structured needed to be modified. Programs which
|
|||
|
deliver mail to mail boxes (deliver, marshal, ml, smtp) and append messages to
|
|||
|
folders were given a common (nedmail) function to call. Since this
|
|||
|
was done by modifying functions in the Upas common library, this
|
|||
|
presented a problem for programs not traditionally part of Upas
|
|||
|
such as acme Mail and imap4d. Rather than fold these programs
|
|||
|
into Upas, a new program, mbappend, was added to Upas.
|
|||
|
.LP
|
|||
|
Imap4d also requires the ability to rename and remove folders.
|
|||
|
While an external program would work for this as well, that
|
|||
|
approach has some drawbacks. Most importantly, IMAP folders
|
|||
|
can't be moved or renamed in the same way without reimplementing
|
|||
|
functionality that is already in upas/fs. It also emphasises the
|
|||
|
asymmetry between reading and deleting email and other folder
|
|||
|
actions. Folder renaming and removal were added to upas/fs.
|
|||
|
It is intended that mbappend will be removed soon
|
|||
|
and replaced with equivalent upas/fs functionality —
|
|||
|
at least for non-delivery programs.
|
|||
|
.LP
|
|||
|
Mdirs also expose an oddity about file permissions. An append-only
|
|||
|
file that is mode
|
|||
|
.CW 0622
|
|||
|
may be appended to by anyone, but is readable only by the owner.
|
|||
|
With a directory, such a setup is not directly possible as write permission
|
|||
|
to a directory implies permission to remove. There are a number of
|
|||
|
solutions to this problem. Delivery could be made asymmetrical—incoming
|
|||
|
files could be written to a mbox. Or, following the example of the outbound
|
|||
|
mail queue, each user could deliver to a directory owned by that user.
|
|||
|
In many BSD-derived
|
|||
|
.UX
|
|||
|
systems, the “sticky bit” on directories is used to modify
|
|||
|
the meaning of the
|
|||
|
.CW w
|
|||
|
bit for users matching only the other bits. For them, the
|
|||
|
.CW w
|
|||
|
bit gives permission to create but not to remove.
|
|||
|
.LP
|
|||
|
While this is somewhat of a one-off situation, I chose to implement
|
|||
|
a version of the “sticky bit” using the existing append-only bit on our
|
|||
|
file server. This was implemented as an extra permission check when
|
|||
|
removing files. Fewer than 10 lines of code were required.
|
|||
|
.NH
|
|||
|
Performance
|
|||
|
.LP
|
|||
|
A representative local mail box was used to generate some rough
|
|||
|
performance numbers. The mail box is 110MB and contains 868 messages.
|
|||
|
These figures are shown in table 1. In the worse case—an unindexed
|
|||
|
mail box—the new upas/fs uses 18% of the memory of the original while
|
|||
|
using 13% more cpu. In the best case, it uses only 5% of the memory
|
|||
|
while using only 13% of the cpu. Clearly, a larger mail box will make
|
|||
|
these ratios more attractive. In the two months since the snapshot was
|
|||
|
taken, that same mail box has grown to 220MB and contains 1814
|
|||
|
messages.
|
|||
|
.ps -2
|
|||
|
.DS C
|
|||
|
.TS
|
|||
|
box, tab(:);
|
|||
|
c s s s s
|
|||
|
c | c | c | c | c
|
|||
|
a | n | n | n | n.
|
|||
|
Table 1 – Performance
|
|||
|
_
|
|||
|
action:user:system:real:core size:
|
|||
|
:s:s:s:MB:
|
|||
|
_
|
|||
|
old fs read:1.69:0.84:6.07:135
|
|||
|
_
|
|||
|
initial read:1.65:0.90:6.90:25
|
|||
|
_
|
|||
|
indexed read:0.64:0.03:0.77:6.5
|
|||
|
.TE
|
|||
|
.DE
|
|||
|
.NL
|
|||
|
.NH
|
|||
|
Future Work
|
|||
|
.LP
|
|||
|
While Upas' memory usage has been drastically reduced,
|
|||
|
it is still a work-in-progress. Caching and indexing are
|
|||
|
adequate but primitive. Upas/fs is still inconsistently
|
|||
|
bypassed for appending messages to mail boxes. There
|
|||
|
are also some features which remain incomplete. Finally,
|
|||
|
the small increase in scale brings some new questions about
|
|||
|
the organization of email.
|
|||
|
.LP
|
|||
|
It may be useful for mail boxes with very large numbers
|
|||
|
of messages to divide the index into fixed-size chunks.
|
|||
|
Then messages could be read into a fixed-sized pool of
|
|||
|
structures as needed. However it is currently hard to
|
|||
|
see how clients could easily interface a mail box large
|
|||
|
enough for this technique to be useful. Currently, all
|
|||
|
clients assume that it is reasonable to allocate an
|
|||
|
in-core data structure for each message in a mail box.
|
|||
|
To take advantage of a chunked index, clients (or the
|
|||
|
server) would need a way of limiting the number of
|
|||
|
messages considered at a time. Also, for such large
|
|||
|
mail boxes, it would be important to separate the
|
|||
|
incoming messages from older messages to limit the work
|
|||
|
required to scan for new messages.
|
|||
|
.LP
|
|||
|
Caching is particularly unsatisfactory. Files should
|
|||
|
be read in fixed-sized buffers so maximum memory usage
|
|||
|
does not depend on the size of the largest file in the
|
|||
|
mail box. Unfortunately, current data structures do not readily
|
|||
|
support this. In practice, this limitation has not yet
|
|||
|
been noticeable.
|
|||
|
.LP
|
|||
|
There are also a few features that need to be completed.
|
|||
|
Tracking of references has been added to marshal and
|
|||
|
upas/fs. In addition, the index provides a place to store
|
|||
|
mutable information about a message. These capabilities
|
|||
|
should be built upon to provide general threading and
|
|||
|
tagging capabilities.
|
|||
|
.NH
|
|||
|
Speculation
|
|||
|
.LP
|
|||
|
Freed from the limitation that all messages in a
|
|||
|
mail box must be read and stored in memory before a
|
|||
|
single message may be accessed, it is interesting to
|
|||
|
speculate on a few further possibilites.
|
|||
|
.LP
|
|||
|
For example, it may be
|
|||
|
useful to replace separate mail boxes with a single
|
|||
|
collection of messages assigned to one or more virtual
|
|||
|
mail boxes. The association between a message and a
|
|||
|
mail box would be a “tag.” A message could be added to
|
|||
|
or removed from one or more mail boxes without modifying
|
|||
|
the mdir file. If threads were implemented by tagging
|
|||
|
each message with its references, it would be possible
|
|||
|
to follow threads across mail boxes, even to messages
|
|||
|
removed from all mail boxes, provided the underlying
|
|||
|
file were not also removed. If a facility for adding
|
|||
|
arbitrary, automatic tags were enabled, it would be
|
|||
|
possible to tag messages with the email address in
|
|||
|
the SMTP From line.
|
|||
|
.NH
|
|||
|
References
|
|||
|
.IP [1]
|
|||
|
D. Bernstein, “Using maildir format”,
|
|||
|
published online at
|
|||
|
.br
|
|||
|
http://cr.yp.to/proto/maildir.html
|
|||
|
.IP [2]
|
|||
|
F. Ballesteros
|
|||
|
.IR mails (1),
|
|||
|
published online at
|
|||
|
http://lsub.org/magic/man2html/1/mails
|
|||
|
.IP [3]
|
|||
|
MH Wikipedia entry,
|
|||
|
http://en.wikipedia.org/wiki/MH_Message_Handling_System
|
|||
|
.IP [4]
|
|||
|
Lotus Notes Wikipedia entry,
|
|||
|
http://en.wikipedia.org/wiki/Lotus_Notes
|
|||
|
.IP [5]
|
|||
|
D. Presotto, “Upas—a Simpler Approach to Network Mail”,
|
|||
|
Proceedings of the 10th Usenix conference, 1985.
|