the intend of posting a note to the faulting process is to
interrupt the syscall to give the note handler a chance
to handle it. kernel processes however, have no note handlers
and all the postnote() does is set up->notepending which will
make the next attempt to sleep raise an Eintr[] error. this
is harmless, but usually not what we want.
there's no need to waste space for a error buffer in the Segio
structure, as the segmentio kproc will be waiting for the next
command after an error and will not overwite it until we issue
another command.
devproc's procctlmemio() did not handle physical segment
types correctly, as it assumed it can just kmap() the page
in question and write to it. physical segments however
need to be mapped uncached but kmap() will always map
cached as it assumes normal memory. on some machines with
aliasing memory with different cache attributes
leads to undefined behaviour!
we borrow the code from devsegment and provide a generic
segio() function to read and write user segments which
handles all the cases without using kmap by just spawning
a kproc that attaches the segment that needs to be read
from or written to. fault() will setup the right mmu
attributes for us. it will also properly flush pages for
segments that maintain instruction cache when written.
however, tlb's have to be flushed separately.
segio() is used for devsegment and devproc now, which
also allows for simplification of fixfault() as there is no
special error handling case anymore as fixfault() is now
called from faulting process *only*.
reads from /proc/$pid/mem can now span multiple pages.
code like "return g->dlen;" is wrong as we do not hold the
qlock of the global segment. another process could come in
and override g->dlen making us return the wrong byte count.
avoid copying when we already got a kernel address (kernel memory
is the same on processes) which is the case with bread()/bwrite().
this is the same optimization that devsd does.
also avoid allocating/freeing and copying while holding the qlock.
when we copy to/from user memory, we might fault preventing
others from accessing the segment while fault handling is in
progress.
walking the freelist for every page is too slow. as we
are freeing a range, we can do a single pass unlinking all
pages in our range and at the end, check if all pages
where freed, if not put the pages that we did free back
and retry, otherwise we'r done.
fixed segments are continuous in physical memory but
allocated in user pages. unlike shared segments, they
are not allocated on demand but the pages are allocated
on creation time (devsegment). fixed segments are
never swapped out, segfreed or resized and can only be
destroyed as a whole.
the physical base address can be discovered by userspace
reading the ctl file in devsegment.
the mount cache uses Page.va to store cached range offset and
limit, but mips kernel uses cache index bits from Page.va to
maintain page coloring. Page.va was not initialized by auxpage().
this change removes auxpage() which was primarily used only
by the mount cache and use newpage() with cache file offset
page as va so we will get a page of the right color.
mount cache keeps the index bits intact by only using the top
and buttom PGSHIFT bits of Page.va for the range offset/limit.
when we are skipping a process because we could not acquire
its segment lock, dont call reclaim() again (which is pointless
as we didnt pageout any pages), instead try the next process.
the Pte.last pointer is inclusive, so don't miss the last page
in pageout().
mcountseg(), mfreeseg():
use Pte.first/last pointers when possible and avoid constructs
like s->map[i]->pages[j].
freepte():
do not zero entries in freepte(), the segment is going away and
here is no point in zeroing page pointers. hoist common code at
the top avoiding duplication.
segpage(), fixfault():
avoid load after store for Pte** pointer.
fixfault():
return -1 in default case to avoid the "used but not set" warning
for mmuphys and get rid of the useless initialization.
syssegflush():
due to len being unsigned, the pe = PGROUND(pe) can make "chunk"
bigger than len causing a overflow. rewrite the function and deal
with page alignment and errors at the beginning.
syssegflush(), segpage(), fixfault(), putseg(), relocateseg(),
mcountseg(), mfreeseg():
keep naming consistent.
the "to" address can overflow in syssegfree() causing wrong
number of pages to be passed to mfreeseg(). with the current
implementation of mfreeseg() however, this doesnt cause any
data corruption but was just freeing an unexpected number of
pages.
this change checks for this condition in syssegfree() and
errors out instead. also mfreeseg() was changed to take
ulong argument for number of pages instead of int to keep
it consistent with other routines that work with page counts.
sdbio() tests if it can pass the buffer pointer directly to
the driver when it is already in kernel memory. we also need
to check if the buffer is properly aligned but alignment
requirement is handled in system specific sdmalloc() and
was not known to devsd.
to solve this, we *always* page align sd buffers and get rid
of the system specific sdmalloc() macro (was only used in bcm
kernel).
ignore physical segments in mcountseg() and mfreeseg(). physical
segments are not backed by user pages, and doing putpage() on
physical segment pages in mfreeseg() is an error.
do now allow physical segemnts to be resized. the segment size
is only checked in segattach() to be within the physical segment!
ignore physical segments in portcountpagerefs() as pagenumber()
does not work on the malloced page structures of a physical segment.
get rid of Physseg.pgalloc() and Physseg.pgfree() indirection as
this was never used and if theres a need to do more efficient
allocation, it should be done in a portable way.
the following hooks have been added to the ehci Ctlr
structore to handle cache coherency (on arm):
void* (*tdalloc)(ulong,int,ulong);
void* (*dmaalloc)(ulong);
void (*dmafree)(void*);
void (*dmaflush)(int,void*,ulong);
tdalloc() is used to allocate descriptors and the periodic
frame schedule array. on arm, this needs to return uncached
memory. tdalloc()ed memory is never freed.
dmaalloc()/dmafree() is used for io buffers. this can return
cached memory when when hardware maintains cache coherency (pc)
or dmaflush() is provided to flush/invalidate the cache (zynq),
otherwise needs to return uncached memory.
dmaflush() is used to flush/invalidate the cache. the first
argument tells us if we need to flush (non zero) or
invalidate (zero).
uncached.h is gone now. this change makes the handling explicit.
there are no kernels currently that do page coloring,
so the only use of cachectl[] is flushing the icache
(on arm and ppc).
on pc64, cachectl consumes 32 bytes in each page resulting
in over 200 megabytes of overhead for 32gb of ram with 4K
pages.
this change removes cachectl[] and adds txtflush ulong
that is set to ~0 by pio() to instruct putmmu() to flush
the icache.
this bug happens when the kernel runs out of mount rpc
buffers when allocating a flush rpc. in this case, mntflushalloc()
will errorjump out of mountio() leaving the currently in
flight rpc in the mount. the caller of mountrpc()/mountio()
frees the rpc thats still queued in the mount leaving
to interesting results.
for the fix, we add a waserror() arround mntflushalloc() and
handle the error case like a mount rpc failure which will
properly dequeue the rpc's in flight.
to allow bytewise access to /proc/#/fd, the contents of the file where
recreated on each call. if fd's had been closed or reassigned between
the reads, the offset would be inconsistent and a read could start off
in the middle of a line. this happens when you cat /proc/#/fd file of
a busy process that mutates its filedescriptor table.
to fix this, we now return one line record at a time. if the line
fits in the read size, then this means the next read will always start
at the beginning of the next line record. we remember the consumed
byte count in Chan.mrock and the current record in Chan.nrock. (these
fields are free to usefor non-directory files)
if a read comes in and the offset is the same as c->mrock, we do not
need to regenerate the file and just render the next c->nrock's record.
for reads smaller than the line count, we have to regenerate the content
up to the offset and the race is still possible, but this should not
be the common case.
the same algorithm is now used for /proc/#/ns file, allowing a simpler
reimplementation and getting rid of Mntwalk state strcture.
the purpose of checkpages() is to verify consitency of the hardware mmu state,
not to notify on the console that a program faulted. a program could also
continue after handling the note. (this seems to be the case in go programs)
this is a new more simple version of the mount cache
that does not require dynamic allocations for extends.
the Mntcache structure now contains a page bitmap
that is used for quick page invalidation. the size
of the bitmap is proportional to MAXCACHE.
instead of keeping track of cached range in the
Extend data structure, we keep all the information
in the Page itself. the offset from the page where
the cache range starts is in the low PGSHIT bits and
the end in the top bits of Page.va.
we choose Page.daddr to map 1:1 the Mountcache number
and page number (pn) in the Mountcache. to find a page,
we first check the bitmap if the page is there and then
do a pagelookup() with the daddr key.
change page cache ids (bid) to uintptr so we use the full
address space of Page.daddr.
make maxcache offset check consistent in cread().
use consistent types in cupdate() and simplify with goto.
make internal functions static.
use nil instead of 0 for pointers.
there is no use for "bootdisk" variable parametrization
of /boot/boot and no point for the boot section with its
boot methods in the kernel configuration anymore. so
mkboot and boot$CONF.out are gone.
move the rules for bootfs.paq creation in 9/boot/bootmkfile.
location of bootfs.proto is now in 9/boot/bootfs.proto.
our /boot/boot target is now just "boot".
expand the list of files specified in bootfs.proto and use them
as dependencies to bootfs.paq rule. this way, bootfs.paq is
regenerated when the to be included files have been modified.
we can avoid some flickering when removing the software cursor
from the shadow framebuffer by avoiding the flushscreenimage()
call.
once the cursor is redrawn, we flush the combined rect of its
old and new position in one go.
theres a race where procstopwait() is interrupted by a note,
setting p->pdbg to nil *before* acquiering the lock and
and pexit() and procctl() accessing it assuming it doesnt
change under them while they are holding the lock.
alexchandel got the kernel to crash with divide error
on qemu 2.1.2/macosx at this location. probably
caused by perfticks()/tsc being wrong or accounttime()
not having been called yet from timer interrupt yet for
some reason.
WHAT WHERE THEY *THINKING*??!?!
unlike seek, the (new) nsec syscall (not used in 9front libc)
returns the time value in register (from nix), so do the same
for compatibility.
from segattach(2):
Va and len specify the position of the segment in the
process's address space. Va is rounded down to the nearest
page boundary and va+len is rounded up. The system does not
permit segments to overlap. If va is zero, the system will
choose a suitable address.
just rounding up len isnt enougth. we have to round up va+len
instead of just len so that the span [va, va+len) is covered
even if va is not page aligned.
kenjis example:
print("%p\n",ap); // 206cb0
ap = segattach(0, "shared", ap, 1024);
print("%p\n",ap); // 206000
term% cat /proc/612768/segment
Stack defff000 dffff000 1
Text R 1000 6000 1
Data 6000 7000 1
Bss 7000 7000 1
Shared 206000 207000 1
term%
note that 0x206cb0 + 0x400 > 0x20700.
we have to recheck the condition under tod lock, otherwise
another process can come in and updated tod.last and
tod.off and once we have the lock, we would make time
jump backwards.
when mountmux() completes a request for another process, enforce odering
of the loads and stores to the request prior to writing q->done = 1
so mntflushfree() sees q->done != 0 only when the request has actually
completed. otherwise, the q->done = 1 store could have been reordered
before the load from q->z, reading from already freed request and causing
spurious wakeups.
removing unused mntstats callback.
use nil for pointers instead of 0.
dont kill the calling process when demand load fails if fixfault()
is called from devproc. this happens when you delete the binary
of a running process and try to debug the process accessing uncached
pages thru /proc/$pid/mem file.
fixes to procctlmemio():
- fix missed unlock as txt2data() can error
- make sure the segment isnt freed by taking a reference (under p->seglock)
- access the page with segment locked (see comment)
- get rid of the segment stealer lock
other stuff:
- move txt2data() and data2txt() to segment.c
- add procpagecount() function
- make return type mcounseg() to ulong
make the Page stucture less than half its original size by getting rid of
the Lock and the lru.
The Lock was required to coordinate the unchaining of pages that where
both cached and on the lru freelist.
now pages have a single next pointer that is used for palloc.head
freelist xor for page cache hash chains in Image.pghash[].
cached pages are not on the freelist anymore, but will be reclaimed
from images by the pager when the freelist runs out of pages.
each Image has its own 512 hash chains for cached page lookup. That is
2MB worth of pages and there should be no collisions for most text images.
page reclaiming can be done without holding palloc.lock as the Image is
the owner of the page hash chains protected by the Image's lock.
reclaiming Image structures can be done quickly by only reclaiming pages from
inactive images, that is images which are not currently in use by segments.
the Ref structure has no Lock anymore. Only a single long that is atomically
incremented or decremnted using cmpswap().
there are various other changes as a consequence code. and lots of pikeshedding,
sorry.
we have to make sure the *swap address* doesnt go away,
after putting the swap address in the segment pte.
after we unlock the segment, the process could be
killed or fault which would cause the swap address to
be freed *before* we write the page to disk when it
pulls the page from the cache and putswap() swap pte.
keeping a reference to the page is no good. we have
to hold on the swap address. this also has the advantage
that we can now test if the swap address is still
referenced and can avoid writing to disk.
as with the Block refcount changes, _xinc() and _xdec() arent
used anymore, so remove them.
architecure can still define ainc()/adec() when it needs them.
change Proc.nlocks from Ref to int and just use normal increment and decrelemt
as done in erik quanstros 9atom.
It is not clear why we used atomic increment in the fist place as even if we
get preempted by interrupt and scheduled before we write back the incremented
value, it shouldnt be a problem and we'll just continue where we left off as
our process is the only one that can write to it.
Yoann Padioleau found that the Mach pointer Lock.m wasnt maintained
consistently for lock() vs canlock() and ilock(). Fixed.
Use uintptr instead of ulong for maxlockpc, maxilockpc and ilockpc debug variables.
the palloc.pages array takes arround 5% of the upages which
gives us:
16GB = ~0.8GB
32GB = ~1.6GB
64GB = ~3.2GB
we only have 2GB of address space above KZERO so this will not
work for long.
instead, pageinit() was altered to accept a preallocated memory
in palloc.pages. and preallocpages() in pc64/main.c allocates the
in upages memory, mapping it in the VMAP area (which has 512GB).
the drawback is that we cannot poke at Page structures now from
/proc/n/mem as the VMAP area is not accessible from it.
procwrite() did truncate the offset to 32bit ulong.
introduce off2addr() function that does the sign
extension hack and use it conststently for Qmem
reads and writes.
for the CMclose procctl, the fd number was not
bounds checked before indexing in the Fgrp.fd
array.
for the CMclosefiles, we looped fd from 0..maxfd-1,
but need to loop from 0..maxfd as maxfd is inclusive.
on amd64, the text segment is aligned and padded to
2MB, but segment granularity is 4K which can give
us page faults that are beyond the highest file
offset. this is perfectly valid, but was not handled
correctly in pio().
make mntflushfree() return the original rpc and do the
botched clunk check on the original instead of the
current rpc.
so if we get a botched flush of a clunk, we abandon the
fid of the channel as well.
if theres an error transmitting a Tclunk or Tremove request,
we cannot assume the fid to be clunked. in case this was
a transient error, reusing the fid on further requests
will fail.
as a work arround, we zero the channels fid and allocate
a new fid before the chan is reused.
this is not correct as we essentially leak the fid
on the fileserver, but we will still be able to use
the mount.
ftrvxmtrx repots devices that use the endpoint number for
input and output of different types like:
nusb/ether: parsedesc endpoint 5[7] 07 05 81 03 08 00 09 # ep1 in intr
nusb/ether: parsedesc endpoint 5[7] 07 05 82 02 00 02 00
nusb/ether: parsedesc endpoint 5[7] 07 05 01 02 00 02 00 # ep1 out bulk
the previous change tried to work arround this but had the
concequence that only the lastly defined endpoint was
usable.
this change addresses the issue by allowing up to 32 endpoints
per device (16 output + 16 input endpoints) in devusb. the
hci driver will ignore the 4th bit and will only use the
lower 4 bits as endpoint address when talking to the usb
device.
when we encounter a conflict, we map the input endpoint
to the upper id range 16..31 and the output endpoint
to id 0..15 so two distinct endpoints are created.
the original format for addresses was %8lux which was changed
to %p for amd64. this broke linuxemu which assumes fixed format
in the segment file. as a compromize we change it to %8p and
amd64 port of linuxemu will hopefully use a more robust parser :)
the comment about Physseg.size being in pages is wrong,
change type to uintptr and correct the comment.
change the length parameter of segattach() and isoverlap()
to uintptr as well. segments can grow over 4GB in pc64 now
and globalsegattach() in devsegment calculates len argument
of isoverlap() by s->top - s->bot. note that the syscall
still takes 32bit ulong argument for the length!
check for integer overflow in segattach(), make sure segment
goes not beyond USTKTOP.
change PTEMAPMEM constant to uvlong as it is used to calculate
SEGMAXSIZE.
simplifying paging code by getting rid of duppage(). instead,
fixfault() now always makes a copy of the shared/cached page
and leaves the cache alone. newpage() uncaches pages as
neccesary.
thanks charles forsyth for the suggestion.
from http://9fans.net/archive/2014/03/26:
> It isn't needed at all. When a cached page is written, it's trying hard to
> replace the page in the cache by a new copy,
> to return the previously cached page. Instead, I copy the cached page and
> return the copy, which is what it already
> does in another instance. ...
imagereclaim() sabotaged itself by breaking the invariant
that cached pages are kept at the end of the page list.
once we made a hole of uncached pages, we would stop
reclaiming cached pages before it as the loop breaks
once it hits a uncached page. (we iterate backwards from
the tail to the head of the pagelist until pages have been
reclaimed or we hit a uncached page).
the solution is to move pages to the head of the pagelist
after removing them from the image cache.
file offset is 64 bit signed integer, negative offsets
are invalid and rejected by the kernel. to still access
kernel memory on amd64, we unconditionally clear the sign
bit of the 64 bit offset in libmach and devproc sign
extends the offset back to a 64 bit address.
this doubling affects all segment types, not just bss.
(tho text/data are usually small...)
and theres no telling if the segment will actually
grow in the future justifying the reduction of memmove
overhead in ibrk().
some ape programs are approaching the 16mb ssegmap size
so that code might trigger.
removing the smarts...
as erik quanstro suggests, theres not much of a point in
storing the full 64bit pc as one cannot get a code segment
bigger than 4G and amd64 makes it hard to use a pc that
isnt 64bit sign extension of 32bit.
instead, we only store ulong (as originally), but sign
extend back when returning in getmalloctag() and
getrealloctag().
getrealloctag() used to be broken. its now fixed.
this change is in preparation for amd64. the systab calling
convention was also changed to return uintptr (as segattach
returns a pointer) and the arguments are now passed as
va_list which handles amd64 arguments properly (all arguments
are passed in 64bit quantities on the stack, tho the upper
part will not be initialized when the element is smaller
than 8 bytes).
this is partial. xalloc needs to be converted in the future.
when user does read of exactly 12*12 bytes on draw
ctl file, the snprint() adds one more \0 byte writing
beyond the user buffer and corrupting memory.
fix this by not snprint()ing the final space and add
it manually.
the stats and ifstats files in the 3rd level of a netif
are not per connection, but for the interface.
this made fstat fail for /net/ether0/N/*stats where N > 0
as the NETID() bits in the qid didnt compare.
from erik quanstros 9fans post:
i think the list insertion code needs a single-read
test that f->alarm != 0. to prevent the 0 from
acting like a fencepost. e.g. trying to insert -10 into
list -40 -30 0 -20.
if(alarms.head) {
l = &alarms.head;
for(f = *l; f; f = f->palarm) {
>> fw = f->alarm;
>> if(fw != 0 && (long)(fw - when) >= 0) {
up->palarm = f;
*l = up;
goto done;
}
l = &f->palarm;
}
*l = up;
}
make sure not to dereference Proc* nil pointer. this can potentially
happen from devip which has code like:
if(er->read4p)
postnote(er->read4p, 1, "unbind", 0);
the process it is about to kill can zero er->read4p at any time,
so there is the possibility of the condition to be true and then
er->read4p becoming nil.
check if the process has already exited (p->pid == 0) in postnote()
under p->debug qlock.
when alarmkproc is commited to send the alarm note to the process,
the process might have exited already, or worse, being reused for
another process. pexit() zeros p->alarm at the beginning, but the
kalarmproc() might read p->alarm before pexit() zeroed it, decide
to send the note, then get preempted and pexit() releases the proc.
once kalarmproc() is resumed, the proc might be already something
different and we send the note to the wrong thing.
we now check p->alarm under the debug qlock. that way, pexit()
cannot make progress while we test the condition.
remove the error label arround postnote(). postnote does not raise
error.
make sure noteid is valid (>0).
prohibit changing note group of kernel processes. this is also
checked for in pgrpnote().
prevent "none" user from changing its note group to another "none"
sessions. this would allow him to send notes other none processes
other than its own.
(11:02:29 PM) me: why is buf in /sys/src/9/port/devssl.c:/^sslwrite only 128 bytes?
(11:02:58 PM) me: it makes it so you can't use a 128 bytes secret as negotiated by infauth in a secretin or secretout ctl message
(11:03:30 PM) me: which in turn means you can't use such a secret with pushssl(2)
(11:06:15 PM) me: inferno's sslwrite is limited to 32 bytes, but its ssl library writes to the secret files instead of to the ctl file
(11:08:50 PM) mischief: what should it be instead of 128 bytes
(11:08:58 PM) me: larger
(11:09:16 PM) mischief: how about 129 bytes?
(11:09:59 PM) me: also broken in 9front, by the way
(11:15:14 PM) me: i guess it should be replaced with parsecmd
replaced the p->pid != 0 check with up->parentpid != 0 so
p->pid == up->parentpid is never true for p->pid == 0.
avoid allocating the wait records when up->parentpid == 0.
when a process got forked with RFNOWAIT, its p->parent will still
point to the parent process, but its p->parentpid == 0.
this causes the "parent still alive" check in pexit to get confused
as it only checked p->pid == up->parentpid. this condition is *TRUE*
in the case of RFNOWAIT when the parent process is actually dead
(p->pid == 0) so we attached the wait structure to the dead parent
leaking the memory.
catch the error() that can be thrown by sleep() and tsleep()
in kprocs.
add missing pexit() calls.
always set the freemem argument to pexit() from kproc otherwise
the process gets added to the broken list.
the image cache should not hold onto the text file channel
when not neccesary. now, the image keeps track of the number
of page cache references in Image.pgref. if the number of
page cache references and Image.ref are equal, this means
all the references to this image are from the page cache.
so no segments are using this image. in that case, we can
close the channel, but keep the Image in the hash table.
when attachimage() finds our image, it will check if Image.c
is nil and reattach the channel to the image before it is
used.
the Image.nocache flag isnt needed anymore.
the shr mount is linked into the Mhead with m->to initially nil. only
after the the server has been attached is m->to set. just check for
it in createdir().
the image cache has the property of keeping a channel
for the executable binary arround which prevents the
mountpoint from going away.
this can easily be reproduced by running:
@{rfork n; ramfs; cp /bin/echo /tmp; /tmp/echo}
observe how ramfs stays arround until the image is
reclaimed. the echo binary is also cached but is
unreachable from any namespace.
we now restrict the caching to mounts that use the client
cache (-C flag) only. this should always be the case
for /bin. places where this isnt the case might observe
a performance regression.
closechanq() is unable to fork a new closeproc when palloc
is locked. so we spawn a closeproc early in chandevinit()
and make sure theres always one process arround to handle
the queue.
the automatic routing from jack to dac/adc sometimes gets us
a path thats not audible. manually specifying a route path
gets us arround these. the syntax is just a comma separated
list of node ids in the "pin" and "inpin" audioctl commands
instead of a single pin node id.
to find alternative paths, audiostat now lists all the widgets;
not just the pins; and ther input connections.
initially mute all pins and amps of all function groups.
connectpath() and disconnectpath() will mute and unmute
the widgets as required later.
validaddr looks up the segments for an address range
and checks the flags and if the address range lies
within bounds on the segments.
as we'r going to lookup the segment in the syssem*
syscalls anyway, we can do the checks ourselfs avoiding
the double segment array lookups.
the implication of this tho is that now a semaphore cannot
span multiple segments. but this would be highly unusual
given that segments are page aligned.
Port Reset R/W. 1=Port is in Reset. 0=Port is not in Reset. Default = 0. When
software writes a one to this bit (from a zero), the bus reset sequence as defined in the
USB Specification Revision 2.0 is started. Software writes a zero to this bit to terminate
the bus reset sequence. Software must keep this bit at a one long enough to ensure the
reset sequence, as specified in the USB Specification Revision 2.0, completes. Note:
when software writes this bit to a one, it must also write a zero to the Port Enable bit.
Note that when software writes a zero to this bit there may be a delay before the bit
status changes to a zero. The bit status will not read as a zero until after the reset
has completed.
openmode() can raise error with invalid mode passed, but we already
incremented the exclusive mouse refcount at that point! call openmode()
early to avoid this.
double the td abort delay and make sure the tsleep() isnt
shortened by a pending note. in that case, tsleep() would
raise error(Eintr); immidiately and would not sleep the
requested amount potentially cauing us to release active
dma memory too early! so we wrap the tsleep() call in a
while(waserror()) so we will at least wait the Abortdelay
amount if error is raised.
also, only try to idle the still active td's.
do not copy data in epio() when there was an error, theres
no reason to touch user buffer in that case.
for uhci, we also check that theres not more data in the
buffers than requested to avoid overflowing user buffer
in epio(). this should not happen but we'r paranoid.
for ehci, we also halt the queue head first in aborttds().
mark the queue heads as Qfree after unlinking and remove
some silly nil checks that are impossible.
the driver should work for standard sdhc
(see http://www.sdcard.org/) controllers,
but matches for the ricoh controller only
as it was the only one i have for testing.
- same problem with wstat, if we error then owner has been already updated...
- avoid smalloc while holding qlock in wstat, do it before
- pikeshedd style...
- wstat would half update the Srv data structure if name was too long
- srvname() walked the linked srv list without holding srv qlock
- dont access sp->chan while not holding srv qlock in srvopen()
- dont modify sp->owner while not holding srv qlock in srvcreate()
- avoid smalloc() allocations while holding srv qlock
- style pikeshedding, sorry
when we fail to fork resources for the child due to resource
exhaustion, make the half forked child process call pexit()
to free the resources that where allocated and error out.
imagereclaim would happily uncache pages from the mountcache (port/cache.c)
without ever getting a Image* released from it. simple fix, just check for
p->image->notext but make sure todo it under the page lock :)
the compiler optimizes setting unused variables out, which is
problematic if they are used in waserror() handler which the
compiler isnt aware of. rearrange the code to avoid this problem.
catch potential interrupt error from kproc(). this can happen when
we run out of processes, then newproc() will call rsrcwait()
which does tsleep(). if the process gets a note, this might
raise a interrupt error.
get a bit more verbose about process image exhaustion and make
imagreclaim() try to get at least one image on the freelist.
use rsrcwait() to notify the state, and call freebroken() in
case imagereclaim() couldnt free any images.
the variables elem and file0 and commited are explicitely
set to avoid that they get freed in ther waserror() handlers.
but it turns out the compiler optimizes this out as he
thinks the variables arent used any further. (the compiler
is not aware of the waserror() / longjmp() semantics).
rearrange the code to account for this. instead of using
a local variable to check for point of no return (commited),
we use up->seg[SSEG] to figure it out.
for file0 and elem, we just rearrange the code. elem can be
checked in the error handler if it was already assigned to
up->text, and file0 is just free()'d after the poperror().
remove silly busy loop in sysrendez. it is not needed.
dequeueproc() will make sure that the process has come to
rest.
we now always use the new FXSAVE format in FPsave structure and fpregs
file, converting back and forth in fpx87save() and fpx87restore().
document that fprestore() is a destructive operation now.
change fp register definition in libmach and adapt fpr() acid funciton.
avoid unneccesary copy of fpstate and fpsave in sysfork(). functions
including syscalls do not preserve the fp registers and copying fpstate
from the current process would mean we had to fpsave(&up->fpsave); first.
simply not doing it, new process starts in FPinit state.
use resrcwait() when waiting for memory to become available. randomize
the sleep time and properly restore old process status in case tsleep()
gets interrupted.
filesystems do not handle i/o errors well (cwfs will abandon the blocks),
and temporary exhaustion of kernel memory (because of too many i/o's in
parallel) causes read and write on the partition to fail.
i think it is better to wait for the memory to become available in
this case. the single allocation is at max SDmaxio bytes, which makes
it likely to become available. if we havnt even enought fo that, then
rebooting the machine would be the best option. (aux/reboot)
the fist problem is that qopen() might return nil and that kstrdup() will
sleep, so we should try to avoid holding the mntalloc lock. so we move
the kstrdup() and qopen() calls before the Mnt allocation, and properly
recover the memory if we fail later.
the second problem was that we error(Eshort) after we already created the Mnt
when returnlen < sizeof(f.version). this check has to happen *before* we
even attempt to allocate the Mnt structures. note that we only copy the
version string once everything is in the clear, so the semantics of the
user buffer not being modified in case of error is not changed.
a little cleanup in muxclose(), getting rid of mntptfree()...
1 the config string was grabbed Aoehsz too far into the packet due to using the wrong pointer to start.
2 never accept a response with tag Tmgmt or Tfree.
3 defend against "malicious" responses; ones with a response Aoehdr.type != request Aoehdr.type. this previously could
cause the initiator to crash.
4 vendor commands were improperly filtered out.
the software cursor starts flickering and reacts bumby if a process
spends most of its time with drawlock acquired because the timer interrupt
thats supposed to redraw the cursor fails to acquire the lock at the time
the timer fires.
instead of trying to draw the cursor on the screen from a timer interrupt
30 times per second, devmouse now creates a process calling cursoron() and
cursoroff() when the cursor needs to be redrawn. this allows the swcursor
to schedule a redraw while holding the drawlock in swcursoravoid() and
cursoron()/cursoroff() are now able to wait for a qlock (drawlock) because
they get called from process context.
the overall responsiveness is also improved with this change as the cursor
redraw rate isnt limited to 30 times a second anymore.
from ehci spec:
The buffer pointer list in the qTD is long enough to support a maximum
transfer size of 20K bytes. This case occurs when all five buffer pointers
are used and the first offset is zero. A qTD handles a 16Kbyte buffer
with any starting buffer alignment.
the kernel uses fixed area (TSTKTOP, TSTKSIZ) of the address
space to temporarily map the new stack segment for exec. for
386 and arm, this area was right below the stack segment which
has the problem that the program can map arbitrary segments
there (even readonly).
alpha and ppc dont have this problem as they map the temporary
exec stack *above* the user reachable stack segement and segattach
prevents one from mapping anything above or overlaping the stack.
lots of arch code assumes USTKTOP being the end of userspace
address space and changing this to TSTKTOP would work, but results
in lots of hard to test changes.
instead, we'r going to map the temporary stack programmatically
finding a hole in the address space where to map it. we also lift
the size limitation for arguments and allow arguments to fill
the whole new stack segement.
the TSTKTOP and TSTKSIZ are not used anymore so they where removed.
references:
http://9fans.net/archive/2013/03/203http://9fans.net/archive/2013/03/202http://9fans.net/archive/2013/03/197http://9fans.net/archive/2013/03/195http://9fans.net/archive/2013/03/181
the kernel would go into endless loop when stating "stats" and "ifstats"
files and the network interface having no connections, or otherwise return
wrong stat info.
some controls are inverted. we reflect this by specifying
negative range in the volume table now and let genaudiovolread()
and genaudiovolwrite() do the conversion.
access to non standard serial port COM3 at i/o port 0x200 causes
kernel panic on some machines (Toshiba Sattelite 1415-S115). also,
some machines have gameport at 0x200.
i readded uartisa to the pcf and pccpuf kernel configurations so
one can use plan9.ini to add non standard uarts like:
uart2=type=isa port=0x200 irq=5
some devices freeze up with inqiry allocation length
other than 36 bytes. as we do not really care about
the vendor specific part of the inquiry, lets only do
36 byte inquiry for now.
the unit inquiry data might change in case the drive got pulled
with ahci. so keep track if we locked the ctl in a local stack
variable instead of relying on that the inquiry data stays the
same.
the syscallno check in syscallfmt() was wrong. the unsigned
syscall number was cast to an signed integer. so negative
values would pass the check provoking bad memory access from
kernel. the check also has an off by one. one has to check
syscallno >= nsyscalls instead of syscallno > nsyscalls.
access to the p->syscalltrace string was not protected
from modification in devproc. you could awake the process
and cause it to free the string giving an opportunity for
the kernel to access bad memory. or someone could kill the
process (pexit would just free it).
now the string is protected by the usual p->debug qlock. we
also keep the string arround until it is overwritten again
or the process exists. this has the nice side effect that
one can inspect it after the process crashed.
another problem was that our validaddr() would error() instead
of pexiting the current process. the code was changed to only
access up->s.args after it was validated and copied instead of
accessing the user stack directly. this also prevents a sneaky
multithreaded process from chaning the arguments under us.
in case our validaddr() errors, we cannot assume valid user
stack after the waserror() if block. use up->s.arg[0] for the
noted() call to avoid bad access.
newproc() didnt zero parentpid and kproc() didnt set it, so
kprocs ended up with random parent pid. this is harmless as
kprocs have no up->parent but it gives confusing results in
pstree(1).
now we zero parentpid in newproc(), and set it in sysrfork()
unless RFNOWAIT has been set.
assuming that this check tried to prevent the hostowner
from killing init, it is silly because init would just
handle the note.
with kbdfs, we actually want to send interrupt note to
the initial process group so instead of working arround
this with rfork(RFNOTEG|RFNAMEG), we remove the check.
avoid double entries in the cache for copen() and properly handle
locking so we wont just give up if we cant lock the Mntcache entry,
but drop the cache lock, qlock the Mntcache entry, and then recheck
the cache.
general cleanup (cdev -> ccache, use eqchantdqid())
the lock order of page.Lock -> palloc.hashlock was
violated in cachedel() which is called from the
pager. change the code to do it in the right oder
to prevent deadlock.
change lookpage to retry on false hit. i assume that
a false hit means:
a) we'r low on memory -> cached page got uncached/reused
b) duppage() got called on the page, meaning theres another
cached copy in the image now.
paging in is expensive compared to the hashtable lookup, so
i think retrying is better.
cleanup fixfault, adding comments.
swaped pages use a 8bit refcount where as the Page uses a 16bit one.
this might be exploited with having a process having a single page
swaped out and then forking 255 times to make the swap map refcount
overflow and panic the kernel.
this condition is probably very rare. so instead of doubling the
size of the swap map, we add a single 32bit refcount swapalloc.xref
which will keep the combined refcount of all swap map entries who
exceeded 255 references.
zero swapimage.c in setswapchan() after closing it as the stat() call
below might error leaving a dangeling pointer.
attachimage()'s approach to handling newseg() error is flawed:
a) the the image is on the hash table, but ref is still 0, and
there is no segment/pages attached to it so nobody is going to
reclaim / putimage() it -> leak
b) calling pexit() would deadlock us because exec has acquired
up->seglock when calling attachimage(), so this would just deadlock.
the fix does the following:
attachimage() will putimage() and nexterror() if newseg() fails
instead of pexit(). this is less surprising.
exec now keeps the condition variable commit which is set once
we are commited / reached the point of no return and check this
variable in the highest waserror() handler and pexit() us there.
this way we have released up all the locks and pexit() will
cleanup.
note: this bug shouldnt us hit in with the current newseg()
implementation as it uses smalloc() which would wait to
satisfy the allocation instead of erroring.
kstrcpy() did not null terminate for < 4 byte buffers. fixed,
but i dont think there is any case where this can happen in
practice.
always set malloctag in kstrdup(), cleanup.
always use ERRMAX bounded kstrcpy() to set up->errstr, q->err
and note[]->msg. paranoia.
instead of silently truncating interface name in netifinit(),
panic the kernel if interface name is too long as this case
is clearly a mistake.
panic kernel when filename is too long for addbootfile() in
devroot. this might happen if your kernel configuration is
messed up.
in devproc status read handler the p->status, p->text and p->user
could overflow the local statbuf buffer as they where copied into
it with code like: memmove(statbuf+someoff, p->text, strlen(p->text)).
now using readstr() which will truncate if the string is too long.
make strncpy() usage consistent, make sure results are always null
terminated.
we have to acquire p->seglock before we lock the individual
segments of the process and lock them. if we dont then pexit()
might free the segments before we can lock them causing the
"qunlock called with qlock not held, from ..." prints.
always make sure that there are child processes we can wait for
before sleeping.
put pwait() sleep into a loop and recheck. this is not strictly
neccesary but prevents accidents if there are spurious wakeups
or a bug.
once we set q->done = 1 in mountmux, the sleeper might return freeing q
so the wakeup might access invalid memory. we change the embedded Rendez
structure in the Mntrpc into a pointer to the sleeping procs up->sleep
rendez so the rendez is always going to be valid even if the rpc has been
freed.
the call to mntstats was moved before we set q->done also to prevent
accessing invalid memory.
the limit for overqueueing was too small for stuff like fcp
on a fileserver connected with a standard 32K limit pipe like
ramfs.
fcp usesd 8K*16procs > 32K*2
the biggest queue limit used in the kernel is 256K making
the maximum queue bloat 2.5MB or 320K for standard pipes.
that should be big enougth to never happen in practice
unless there is a bug which we like to catch before we
exhaust all kernel memory.