Commit graph

52 commits

Author SHA1 Message Date
cinap_lenrek
d25ca13ed8 kernel: make user stack segment non-executable 2019-08-27 04:04:46 +02:00
cinap_lenrek
0b5e782882 kernel: exec support for arm64 binaries 2019-05-03 23:15:42 +02:00
cinap_lenrek
fe594760eb kernel: get rid of checkpagerefs() debugging
was only implemented by the pc kernel. does not
account pages used by the mount cache.
2019-05-01 12:40:27 +02:00
cinap_lenrek
eed90aa0ad kernel: don't cap the minimum sleep time to TK2MS(1) for syssleep()
on HZ 100 systems like pc and pc64, the minium sleep time
was 10ms, which is quite high. the cap isnt really needed
as arch specific timerset() enforces its own limit, but on
a higher resolution.

background:

from Charles Forsyth:

I haven't really got an opinion on it. The 10ms interval was first used on
machines that were much slower.
I thought someone did set HZ to a bigger value, partly to support better
in-kernel timing. I haven't done it because I never had a need for it.
If I were doing (say) protocol implementation in user mode, I'd certainly
reconsider. Sleep itself forces at best ms granularity,
and for some applications that's too big.

initial mail from qwx raising the issue:

> Hello,
>
> I found out recently that sleep(2)'s resolution on 386 and 9front's amd64
> kernel is 10 ms rather than 1 ms.  The reason is that on those kernels,
> HZ is set to 100 rather than say 1000.  In syssleep, we get 1 tich every
> 10 ms.
>
> What is unclear is why.
>
> To paraphrase cinap_lenrek's answer to my question:
>
> In syssleep:
>                 if(ms < TK2MS(1))
>                         ms = TK2MS(1);
>                 tsleep(&up->sleep, return0, 0, ms);
>
> "TK2MS(1)" can be replaced with just "1", and the arch specific
> timerset() routine would do its own capping of the period if it's too
> small for the timer resolution, and make better decisions based on what
> the minimum timer period should be given the latency overhead of the
> given arch's interrupt handling and performance characteristics.
>
> Alternatively, HZ could be raised to 500 or 1000.
>
> It seems it's just trying to prevent excessive context switches and
> interrupts, but it seems somewhat arbitrary.  A ton of syscalls can be
> done in 1 ms, and it's the lowest we can go without changing the unit.
>
>
> What do you think?
>
> Thanks in advance,
>
> qwx
2018-06-10 19:47:21 +02:00
cinap_lenrek
2723c9fc77 kernel: add support for sticky segments (cached, preallocated, never paged) 2017-06-20 21:53:45 +02:00
cinap_lenrek
1848f4e946 kernel: tsemacquire() use MACHP(0)->ticks for time delta
we might wake up on a different cpu after the sleep so
delta from machX->ticks - machY->ticks can become negative
giving spurious timeouts. to avoid this always use the
same mach 0 tick counter for the delta.
2016-09-07 23:36:04 +02:00
cinap_lenrek
1057a859b8 devsegment: cleanups
- return distinct error message when attempting to create Globalseg with physseg name
- copy directory name to up->genbuf so it stays valid after we unlock(&glogalseglock)
- cleanup wstat() handling, allow changing uid
- make sure global segment size is below SEGMAXSIZE
- move isoverlap() check from globalsegattach() into segattach()
- remove Proc* argument from globalsegattach(), segattach() and isoverlap()
- make Physseg.attr and segattach attr parameter an int for consistency
2016-03-30 22:49:13 +02:00
cinap_lenrek
d19144155e kernel: missing changes for ibrk() prototype 2015-12-21 04:49:29 +01:00
cinap_lenrek
87d7a3c875 kernel: have to validate argv[] again when copying to the new stack
we have to validaddr() and vmemchr() all argv[] elements a second
time when we copy to the new stack to deal with the fact that another
process can come in and modify the memory of the process doing the
exec. so the argv[] strings could have changed and increased in
length. we just make sure the data being copied will fit into the
new stack and error when we would overflow.

also make sure to free the ESEG in case the copy pass errors.
2015-08-06 13:20:41 +02:00
cinap_lenrek
281729551f kernel: limit argv[] strings to the USTKSIZE to avoid overflow
argv[] strings get copied to the new processes stack segment, which
has a maximum size of USTKSIZE, so limit the size of the strings to
that and check early for overflow.
2015-08-06 11:51:23 +02:00
cinap_lenrek
b09cd67860 kernel: validnamedup() the name argument for segattach()
this moves the name validation out of segattach() to syssegattach()
to make sure the segment name cannot be changed by the user while
segattach looks at it.
2015-08-06 11:48:51 +02:00
cinap_lenrek
86eb8ea6bb kernel: change vmemchr() length argument to ulong and simplify 2015-08-06 10:15:07 +02:00
cinap_lenrek
9110ae6eae kernel: make shargs() function static in sysproc.c 2015-08-06 09:09:57 +02:00
cinap_lenrek
2acb02f29b kernel: reject empty argv (argv[0] == nil) in sysexec()
when executing a script, we did advance argp0 unconditionally
to replace argv[0] with the script name. this fails when
argv[] is empty, then we'd advance argp0 past the nil terminator.

the alternative would be to *not* advance if *argp0 == nil, but that
would require another validaddr() check for a case that is unlikely
to have been anticipated in most programs being invoked as
libc's ARGBEGIN macro assumes argv[0] being non-nil as it also
unconditionally advances the argv pointer.

to keep us sane, we now reject an empty argv[]. on entry, we
verify that argv[] is valid for at least two elements:
- the program name argv[0], has to be non-nil
- the first potential nil terminator in argv[1]

when argv[0] == nil, we throw Ebadarg "bad arg in system call"
2015-08-06 08:47:38 +02:00
cinap_lenrek
4ec93f94c9 kernel: use HDR_MAGIC constant to handle Exec header extension, make rebootcmd() handle AOUT_MAGIC macro 2015-07-10 23:56:39 +02:00
cinap_lenrek
3ca9ac70c4 sysexec(): need () arround AOUT_MAGIC comparsion to handle #define hack on mips 2015-07-09 08:51:38 +02:00
cinap_lenrek
e3217c6f6a sysexec(): make the mips compiler happy 2015-07-09 08:34:20 +02:00
cinap_lenrek
9ab096a707 kernel: reject bogus two byte "#!" shell scripts in sysexec()
- reject files smaller or equal to two bytes, they are bogus
- fix out of bounds access in shargs() when n <= 2
- only copy the bytes read into line buffer
- use nil for pointers instead of 0
2015-07-09 08:03:18 +02:00
cinap_lenrek
461c2b68a1 kernel: fixed segment support (for fpga experiments)
fixed segments are continuous in physical memory but
allocated in user pages. unlike shared segments, they
are not allocated on demand but the pages are allocated
on creation time (devsegment). fixed segments are
never swapped out, segfreed or resized and can only be
destroyed as a whole.

the physical base address can be discovered by userspace
reading the ctl file in devsegment.
2015-04-12 22:30:30 +02:00
cinap_lenrek
8caec8564d vl, libmach, kernel: mips has 16K alignment for segments (for bigpages) 2015-03-22 17:49:28 +01:00
cinap_lenrek
fcc336b902 kernel: catch address overflow in syssegfree()
the "to" address can overflow in syssegfree() causing wrong
number of pages to be passed to mfreeseg(). with the current
implementation of mfreeseg() however, this doesnt cause any
data corruption but was just freeing an unexpected number of
pages.

this change checks for this condition in syssegfree() and
errors out instead. also mfreeseg() was changed to take
ulong argument for number of pages instead of int to keep
it consistent with other routines that work with page counts.
2015-03-07 18:59:06 +01:00
cinap_lenrek
eaf91d0f8e kernel: fix physical segment handling
ignore physical segments in mcountseg() and mfreeseg(). physical
segments are not backed by user pages, and doing putpage() on
physical segment pages in mfreeseg() is an error.

do now allow physical segemnts to be resized. the segment size
is only checked in segattach() to be within the physical segment!

ignore physical segments in portcountpagerefs() as pagenumber()
does not work on the malloced page structures of a physical segment.

get rid of Physseg.pgalloc() and Physseg.pgfree() indirection as
this was never used and if theres a need to do more efficient
allocation, it should be done in a portable way.
2015-03-03 13:08:29 +01:00
cinap_lenrek
acd15f13c4 pc64: put return value of nsec syscall in register on amd64
WHAT WHERE THEY *THINKING*??!?!

unlike seek, the (new) nsec syscall (not used in 9front libc)
returns the time value in register (from nix), so do the same
for compatibility.
2014-09-20 01:07:46 +02:00
cinap_lenrek
3b661a96ef kernel: make noswap flag exclude processes from killbig() if not eve, reset noswap flag on exec 2014-08-17 00:50:20 +02:00
cinap_lenrek
d4d86df2ab kernel: new pagecache, remove Lock from page, use cmpswap for Ref instead of Lock
make the Page stucture less than half its original size by getting rid of
the Lock and the lru.

The Lock was required to coordinate the unchaining of pages that where
both cached and on the lru freelist.

now pages have a single next pointer that is used for palloc.head
freelist xor for page cache hash chains in Image.pghash[].

cached pages are not on the freelist anymore, but will be reclaimed
from images by the pager when the freelist runs out of pages.

each Image has its own 512 hash chains for cached page lookup. That is
2MB worth of pages and there should be no collisions for most text images.

page reclaiming can be done without holding palloc.lock as the Image is
the owner of the page hash chains protected by the Image's lock.

reclaiming Image structures can be done quickly by only reclaiming pages from
inactive images, that is images which are not currently in use by segments.

the Ref structure has no Lock anymore. Only a single long that is atomically
incremented or decremnted using cmpswap().

there are various other changes as a consequence code. and lots of pikeshedding,
sorry.
2014-06-22 15:12:45 +02:00
cinap_lenrek
3207e8b6a4 add _nsec() syscall 53 for binary compatibility with labs distribution
the new syscall is added under the symbol _nsec() for
binary compatibility.

nsec() is still a library function reading /dev/bintime.
2014-05-20 05:06:31 +02:00
cinap_lenrek
868a262bb8 pc64: dont 4 byte align stack pointer for amd64 in sysexec() 2014-02-05 19:48:36 +01:00
cinap_lenrek
0cdb32cc18 kernel: fix bogus free in sysexec.
we free the wrong pointer in the waserror() block.
2014-02-02 15:11:19 +01:00
cinap_lenrek
7613608b23 kernel: handle amd64 40 byte headers in exec() 2014-02-01 10:16:55 +01:00
cinap_lenrek
ad1eefb355 kernel: various cleanups 2014-01-20 02:16:42 +01:00
cinap_lenrek
6c2e983d32 kernel: apply uintptr for ulong when a pointer is stored
this change is in preparation for amd64. the systab calling
convention was also changed to return uintptr (as segattach
returns a pointer) and the arguments are now passed as
va_list which handles amd64 arguments properly (all arguments
are passed in 64bit quantities on the stack, tho the upper
part will not be initialized when the element is smaller
than 8 bytes).

this is partial. xalloc needs to be converted in the future.
2014-01-20 00:47:55 +01:00
cinap_lenrek
afc2d547e1 kernel: make sure user text, data and bss wont overlap the stack segment in sysexec() 2013-12-29 06:11:18 +01:00
cinap_lenrek
62b3eea271 syssem*: eleminate redundant validaddr() checks
validaddr looks up the segments for an address range
and checks the flags and if the address range lies
within bounds on the segments.

as we'r going to lookup the segment in the syssem*
syscalls anyway, we can do the checks ourselfs avoiding
the double segment array lookups.

the implication of this tho is that now a semaphore cannot
span multiple segments. but this would be highly unusual
given that segments are page aligned.
2013-09-24 01:52:20 +02:00
cinap_lenrek
34cd9dc4c4 kernel: reset up->setargs on sysexec(), fix race with devproc
up->setargs wasnt reset in sysexec(). also, up->args should only
be exchanged/freed under up->debug qlock. otherwise double free
could happen.
2013-09-18 01:07:06 +02:00
cinap_lenrek
177c175fda kernel: allow sysr1 debugging only for hostowner 2013-06-10 01:09:52 +02:00
cinap_lenrek
8cce104fcb kernel: sysrfork abortion
when we fail to fork resources for the child due to resource
exhaustion, make the half forked child process call pexit()
to free the resources that where allocated and error out.
2013-05-28 23:41:54 +02:00
cinap_lenrek
3e567afed5 kernel: fix sysexec() error handling compiler problem, sysrendez() busyloop
the variables elem and file0 and commited are explicitely
set to avoid that they get freed in ther waserror() handlers.

but it turns out the compiler optimizes this out as he
thinks the variables arent used any further. (the compiler
is not aware of the waserror() / longjmp() semantics).

rearrange the code to account for this. instead of using
a local variable to check for point of no return (commited),
we use up->seg[SSEG] to figure it out.

for file0 and elem, we just rearrange the code. elem can be
checked in the error handler if it was already assigned to
up->text, and file0 is just free()'d after the poperror().

remove silly busy loop in sysrendez. it is not needed.
dequeueproc() will make sure that the process has come to
rest.
2013-05-27 00:59:43 +02:00
cinap_lenrek
257c7e958e keep fpregs always in sse (FXSAVE) format, adapt libmach and acid files for new format
we now always use the new FXSAVE format in FPsave structure and fpregs
file, converting back and forth in fpx87save() and fpx87restore().

document that fprestore() is a destructive operation now.

change fp register definition in libmach and adapt fpr() acid funciton.

avoid unneccesary copy of fpstate and fpsave in sysfork(). functions
including syscalls do not preserve the fp registers and copying fpstate
from the current process would mean we had to fpsave(&up->fpsave); first.
simply not doing it, new process starts in FPinit state.
2013-05-26 22:41:40 +02:00
cinap_lenrek
f37465fd7f sysexec: fix possible segment overlap with temporary stack
the kernel uses fixed area (TSTKTOP, TSTKSIZ) of the address
space to temporarily map the new stack segment for exec. for
386 and arm, this area was right below the stack segment which
has the problem that the program can map arbitrary segments
there (even readonly).

alpha and ppc dont have this problem as they map the temporary
exec stack *above* the user reachable stack segement and segattach
prevents one from mapping anything above or overlaping the stack.

lots of arch code assumes USTKTOP being the end of userspace
address space and changing this to TSTKTOP would work, but results
in lots of hard to test changes.

instead, we'r going to map the temporary stack programmatically
finding a hole in the address space where to map it. we also lift
the size limitation for arguments and allow arguments to fill
the whole new stack segement.

the TSTKTOP and TSTKSIZ are not used anymore so they where removed.

references:

http://9fans.net/archive/2013/03/203
http://9fans.net/archive/2013/03/202
http://9fans.net/archive/2013/03/197
http://9fans.net/archive/2013/03/195
http://9fans.net/archive/2013/03/181
2013-03-16 02:37:07 +01:00
cinap_lenrek
4b4070a8b9 ratrace: fix race conditions and range check
the syscallno check in syscallfmt() was wrong. the unsigned
syscall number was cast to an signed integer. so negative
values would pass the check provoking bad memory access from
kernel. the check also has an off by one. one has to check
syscallno >= nsyscalls instead of syscallno > nsyscalls.

access to the p->syscalltrace string was not protected
from modification in devproc. you could awake the process
and cause it to free the string giving an opportunity for
the kernel to access bad memory. or someone could kill the
process (pexit would just free it).

now the string is protected by the usual p->debug qlock. we
also keep the string arround until it is overwritten again
or the process exists. this has the nice side effect that
one can inspect it after the process crashed.

another problem was that our validaddr() would error() instead
of pexiting the current process. the code was changed to only
access up->s.args after it was validated and copied instead of
accessing the user stack directly. this also prevents a sneaky
multithreaded process from chaning the arguments under us.

in case our validaddr() errors, we cannot assume valid user
stack after the waserror() if block. use up->s.arg[0] for the
noted() call to avoid bad access.
2012-11-23 20:27:09 +01:00
cinap_lenrek
6c8097a84d fix spurious kproc ppid
newproc() didnt zero parentpid and kproc() didnt set it, so
kprocs ended up with random parent pid. this is harmless as
kprocs have no up->parent but it gives confusing results in
pstree(1).

now we zero parentpid in newproc(), and set it in sysrfork()
unless RFNOWAIT has been set.
2012-11-07 20:46:30 +01:00
cinap_lenrek
2f732e9a85 kernel: attachimage / exec error handling
attachimage()'s approach to handling newseg() error is flawed:

a) the the image is on the hash table, but ref is still 0, and
there is no segment/pages attached to it so nobody is going to
reclaim / putimage() it -> leak

b) calling pexit() would deadlock us because exec has acquired
up->seglock when calling attachimage(), so this would just deadlock.

the fix does the following:

attachimage() will putimage() and nexterror() if newseg() fails
instead of pexit(). this is less surprising.

exec now keeps the condition variable commit which is set once
we are commited / reached the point of no return and check this
variable in the highest waserror() handler and pexit() us there.

this way we have released up all the locks and pexit() will
cleanup.

note: this bug shouldnt us hit in with the current newseg()
implementation as it uses smalloc() which would wait to
satisfy the allocation instead of erroring.
2012-10-14 19:48:46 +02:00
cinap_lenrek
9e7ecc41d5 devproc buffer overflow, strncpy
in devproc status read handler the p->status, p->text and p->user
could overflow the local statbuf buffer as they where copied into
it with code like: memmove(statbuf+someoff, p->text, strlen(p->text)).
now using readstr() which will truncate if the string is too long.

make strncpy() usage consistent, make sure results are always null
terminated.
2012-10-01 02:52:05 +02:00
aiju
5ba4ccd30e fixed RFNOMNT 2012-08-27 17:50:48 +02:00
cinap_lenrek
49ac0b93d3 add tsemacquire syscall for go 2012-07-29 20:26:49 +02:00
cinap_lenrek
9d60d8262e fix potential double ready in postnote() for rendezvous 2012-02-06 00:23:38 +01:00
cinap_lenrek
8ef32ed38c fix double free in exec 2012-01-23 05:12:05 +01:00
cinap_lenrek
2450b55c7b kernel: add pidalloc() and reuse pid once the counter wraps arround 2011-12-20 22:22:08 +01:00
cinap_lenrek
3fce94e785 fix _tos->pcycles, make _tos->kcycles actually count cycles executing kernel code on behalf of the process 2011-10-25 20:17:39 +02:00
cinap_lenrek
c6c2e04d4a segdesc: add /dev/^(ldt gdt) support 2011-07-12 15:46:22 +02:00