Commit graph

71 commits

Author SHA1 Message Date
cinap_lenrek
29f60cace1 kernel: avoid palloc lock during mmurelease()
Previously, mmurelease() was always called with
palloc spinlock held.

This is unneccesary for some mmurelease()
implementations as they wont release pages
to the palloc pool.

This change removes pagechainhead() and
pagechaindone() and replaces them with just
freepages() call, which aquires the palloc
lock internally as needed.

freepages() avoids holding the palloc lock
while walking the linked list of pages,
avoding some lock contention.
2020-12-22 16:29:55 +01:00
cinap_lenrek
e4ce6aadac kernel: handle tos and per process pcycle counters in port/
we might as well handle the per process cycle
counter in the portable part instead of duplicating the code
in every arch and have inconsistent implementations.

we now have a portable kenter() and kexit() function,
that is ment to be used in trap/syscall from user,
which updates the counters.

some kernels missed initializing Mach.cyclefreq.
2020-12-20 22:34:41 +01:00
cinap_lenrek
58e6750401 kernel: remove Proc* argument from procsetuser() function 2020-12-19 18:07:12 +01:00
cinap_lenrek
0b33b3b8ad kernel: implement per file descriptor OCEXEC flag, reject ORCLOSE when opening /fd, /srv and /shr
The OCEXEC flag used to be maintained per channel,
making it shared between all the file desciptors.

This has a unexpected side effects with regard to
channel passing drivers such as devdup (/fd),
devsrv (/srv) and devshr (/shr).

For example, opening a /srv file with OCEXEC
makes it impossible to be remounted by exportfs
as it internally does a exec() to mount and
re-export it. There is no way to reset the flag.

This change makes the OCEXEC flag per file descriptor,
so a open with the OCEXEC flag only affects the fd
group of the calling process, and not the channel
itself.

On rfork(RFFDG), the per file descriptor flags get
copied.

On dup(), the per file descriptor flags are reset.

The second modification is that /fd, /srv and /shr
should reject the ORCLOSE flag, as the files that
are returned have already been opend.
2020-12-13 16:04:09 +01:00
cinap_lenrek
0ba91ae22a pc, pc64: allocate i/o port space for unassigned pci bars, move ioalloc() to port/iomap.c
With some newer UEFI firmware, not all pci bars get
programmed and we have to assign them ourselfs.

This was already done for memory bars. This change
adds the same for i/o port space, by providing a
ioreservewin() function which can be used to allocate
port space within the parent pci-pci bridge window.

Also, the pci code now allocates the pci config
space i/o ports 0xCF8/0xCFC so userspace needs to
use devpnp to access pci config space now. (see
latest realemu change).

Also, this moves the ioalloc()/iofree() code out
of devarch into port/iomap.c as it can be shared
with the ppc mtx kernel.
2020-11-03 20:46:09 +01:00
cinap_lenrek
61a062ee9f kernel: improve page reclaimation strategy and locking
when reclaiming pages from an image, always reclaim all
the hash chains equally. that way, we avoid being biased
towards the chains at the start of the Image.pghash[] array.

images can be in two states: active or inactive. inactive
images are the ones which are not used by program while
active ones aare.

when reclaiming pages, we should try to reclaim pages
from inactive images first and only if that set becomes
exhausted attempt to release text pages and attempt to
reclaim pages from active images.

when we run out of Image structures, it makes only sense
to reclaim pages from inactive images, as reclaiming pages
from active ones will never free any Image structures.

change putimage() to require a image already locked and
make it unlock the image. this avoids many pointless
unlock()/lock() sequences as all callers of putimage()
already had the image locked.
2020-04-26 19:54:46 +02:00
cinap_lenrek
1aa80c1d10 kernel: remove unused mem2bl() prototype 2020-04-12 16:11:41 +02:00
cinap_lenrek
8debb0736e kernel: add portable memory map code (port/memmap.c)
This is a generic memory map for physical addresses. Entries
can be added with memmapadd() giving a range and a type.
Ranges can be allocated and freed from the map. The code
automatically resolves overlapping ranges by type priority.
2020-04-04 16:04:27 +02:00
cinap_lenrek
4a80d9d029 kernel: fix multiple devproc bugs and pid reuse issues
devproc assumes that when we hold the Proc.debug qlock,
the process will be prevented from exiting. but there is
another race where the process has already exited and
the Proc* slot gets reused. to solve this, on process
creation we also have to acquire the debug qlock while
initializing the fields of the process. this also means
newproc() should only initialize fields *not* protected
by the debug qlock.

always acquire the Proc.debug qlock when changing strings
in the proc structure to avoid doublefree on concurrent
update. for changing the user string, we add a procsetuser()
function that does this for auth.c and devcap.

remove pgrpnote() from pgrp.c and replace by static
postnotepg() in devproc.

avoid the assumption that the Proc* entries returned by
proctab() are continuous.

fixed devproc permission issues:
	- make sure only eve can access /proc/trace
	- none should only be allowed to read its own /proc/n/text
	- move Proc.kp checks into procopen()

pid reuse was not handled correctly, as we where only
checking if a pid had a living process, but there still
could be processes expecting a particular parentpid or
noteid.

this is now addressed with reference counted Pid
structures which are organized in a hash table.
read access to the hash table does not require locks
which will be usefull for dtracy later.
2020-02-23 18:00:21 +01:00
cinap_lenrek
8d51e7fa1a kernel: implement portable userinit() and simplify process creation
replace machine specific userinit() by a portable
implemntation that uses kproc() to create the first
process. the initcode text is mapped using kmap(),
so there is no need for machine specific tmpmap()
functions.

initcode stack preparation should be done in init0()
where the stack is mapped and can be accessed directly.

replacing the machine specific userinit() allows some
big simplifications as sysrfork() and kproc() are now
the only callers of newproc() and we can avoid initializing
fields that we know are being initialized by these
callers.

rename autogenerated init.h and reboot.h headers.
the initcode[] and rebootcode[] blobs are now in *.i
files and hex generation was moved to portmkfile. the
machine specific mkfile only needs to specify how to
build rebootcode.out and initcode.out.
2020-01-26 19:01:36 +01:00
cinap_lenrek
13785bbbef pc: replace duplicated and broken mmu flush code in vunmap()
comparing m with MACHP() is wrong as m is a constant on 386.

add procflushothers(), which flushes all processes except up
using common procflushmmu() routine.
2019-12-07 02:19:14 +01:00
cinap_lenrek
24d1fbde27 kernel: simplify pgrpnote(); moving the note string copying to procwrite()
keeps handling of devproc's note and notepg files similar and in the
same place and reduces stack usage.
2019-09-19 02:07:46 +02:00
cinap_lenrek
2149600d12 kernel: catch execution read fault on SG_NOEXEC segment
fault() now has an additional pc argument that is
used to detect fault on a non-executable segment.
that is, we check on read fault if the segment
has the SG_NOEXEC attribute and the program counter
is within faulting page.
2019-08-27 03:47:18 +02:00
cinap_lenrek
fe594760eb kernel: get rid of checkpagerefs() debugging
was only implemented by the pc kernel. does not
account pages used by the mount cache.
2019-05-01 12:40:27 +02:00
cinap_lenrek
b452f8857f kernel: export freepages() function so it can be used in mmurelease() 2019-05-01 10:07:39 +02:00
cinap_lenrek
26d36c3ae2 devswap: simplify, don't panic when writing swapfile fails
always start the pager kproc in swapinit(), simplifying kickpager().

allow zero conf.nswap and conf.nswppo. avoid allocating the reference
map and iolist arrays in that case.

use ulong for ioptr and iolist indices.

don't panic when writing pages out to the swapfile fails. just
requeue the page in the io transaction list so we will try
again next time executeio() is run or just free the page when
the swap reference was dropped.

remove unused pagersummary() function.
2019-01-22 22:06:42 +01:00
cinap_lenrek
5da4f0fc0f sdram: experimental ramdisk driver
this driver makes regions of physical memory accessible as a disk.

to use it, ramdiskinit() has to be called before confinit(), so
that conf.mem[] banks can be reserved. currently, only pc and pc64
kernel use it, but otherwise the implementation is portable.

ramdisks are not zeroed when allocated, so that the contents are
preserved across warm reboots.

to not waste memory, physical segments do not allocate Page structures
or populate the segment pte's anymore. theres also a new SG_CHACHED
attribute.
2018-05-27 22:59:19 +02:00
cinap_lenrek
b437065950 stats: show amount of reclaimable pages (add -r flag)
reclaimable pages are user pages that are used for
caches like the image cache, mount cache and swap cache.
2018-01-05 00:52:14 +01:00
cinap_lenrek
f3f9392517 kernel: introduce devswap #¶ to serve /dev/swap and handle swapfile encryption 2017-10-29 23:09:54 +01:00
aiju
773be02aa1 kernel: add support for hardware watchpoints 2017-06-12 19:03:07 +00:00
cinap_lenrek
0c1110ace2 kernel: fix twakeup()/timerdel() race condition
timerdel() did not make sure that the timer function
is not active (on another cpu). just acquiering the
Timer lock in the timer function only blocks the caller
of timerdel()/timeradd() but not the other way arround
(on a multiprocessor).

this changes the timer code to track activity of
the timer function, having timerdel() wait until
the timer has finished executing.
2017-03-29 00:30:53 +02:00
cinap_lenrek
47f07b2669 kernel: make the mntcache robust against fileserver like fossil that do not change the qid.vers on wstat
introducing new ctrunc() function that invalidates any caches
for the passed in chan, invoked when handling wstat with a
specified file length or on file creation/truncation.

test program to reproduce the problem:

#include <u.h>
#include <libc.h>
#include <libsec.h>

void
main(int argc, char *argv[])
{
	int fd;
	Dir *d, nd;

	fd = create("xxx", ORDWR, 0666);
	write(fd, "1234", 4);
	d = dirstat("xxx");
	assert(d->length == 4);
	nulldir(&nd);
	nd.length = 0;
	dirwstat("xxx", &nd);
	d = dirstat("xxx");
	assert(d->length == 0);
	fd = open("xxx", OREAD);
	assert(read(fd, (void*)&d, 4) == 0);
}
2017-01-12 20:13:20 +01:00
cinap_lenrek
c86b5ddaa6 kernel/qio: make readblist() offset of type ulong as the rest 2016-11-12 17:41:58 +01:00
cinap_lenrek
a54d1cd95e kernel/qio: big cleanup of qio functions
remove bl2mem(), it is broken. a fault while copying to memory
yields a partially freed block list. it can be simply replaced
by readblist() and freeblist(), which we also use for qcopy()
now.

remove mem2bl(), and handle putting back remainer from a short
read internally (splitblock()) avoiding the releasing and re-
acquiering of the ilock.

always attempt to free blocks outside of the ilock.

have qaddlist() return the number of bytes enqueued, which
avoids walking the block list twice.
2016-11-07 22:20:10 +01:00
cinap_lenrek
963497f06b kernel: avoid padblock copying for devtls/devssl/esp, cleanup debugging
to avoid copying in padblock() when adding cryptographics macs to a block
in devtls/devssl/esp we reserve 16 extra bytes to the allocation.

remove qio ixsummary() function and add acid function qiostats() to
/sys/lib/acid/kernel

simplify iallocb(), remove iallocsummary() statitics.
2016-11-05 20:05:40 +01:00
cinap_lenrek
10275ad6dd kernel: xoroshiro128+ generator for rand()/nrand()
the kernels custom rand() and nrand() functions where not working
as specified in rand(2). now we just use libc's rand() and nrand()
functions but provide a custom lrand() impelmenting the xoroshiro128+
algorithm as proposed by aiju.
2016-09-11 02:10:25 +02:00
cinap_lenrek
0a5f81a442 kernel: switch to fast portable chacha based seed-once random number generator 2016-08-27 20:42:31 +02:00
cinap_lenrek
0f97eb3a60 kernel: add secalloc() and secfree() functions for secret memory allocation
The kernel needs to keep cryptographic keys and cipher states
confidential. secalloc() allocates memory from the secret pool
which is protected from debuggers reading the memory thru devproc.
secfree() releases the memory, overriding the data with garbage.
2016-08-27 20:33:03 +02:00
cinap_lenrek
1057a859b8 devsegment: cleanups
- return distinct error message when attempting to create Globalseg with physseg name
- copy directory name to up->genbuf so it stays valid after we unlock(&glogalseglock)
- cleanup wstat() handling, allow changing uid
- make sure global segment size is below SEGMAXSIZE
- move isoverlap() check from globalsegattach() into segattach()
- remove Proc* argument from globalsegattach(), segattach() and isoverlap()
- make Physseg.attr and segattach attr parameter an int for consistency
2016-03-30 22:49:13 +02:00
cinap_lenrek
04c3a6f66e zynq: introduce SG_FAULT to prevent access to AXI segment while PL is not ready
access to the axi segment hangs the machine when the fpga
is not programmed yet. to prevent access, we introduce a
new SG_FAULT flag, that when set on the Segment.type or
Physseg.attr, causes the fault handler to immidiately
return with an error (as if the segment would not be mapped).

during programming, we temporarily set the SG_FAULT flag
on the axi physseg, flush all processes tlb's that have
the segment mapped and when programming is done, we clear
the flag again.
2016-03-27 20:57:01 +02:00
cinap_lenrek
595501b005 kernel: make fversion()/mntversion() types consistent 2016-03-10 03:02:28 +01:00
cinap_lenrek
d19144155e kernel: missing changes for ibrk() prototype 2015-12-21 04:49:29 +01:00
cinap_lenrek
7f3659e78f kernel: cleanup exit()/shutdown()/reboot() code
introduce cpushutdown() function that does the common
operation of initiating shutdown, returning once all
cpu's got the message and are about to shutdown. this
avoids duplicated code which isnt really machine specific.

automatic reboot on panic only when *debug= is not set
and the machine is a cpu server or has no display,
otherwise just hang.
2015-11-30 14:56:00 +01:00
cinap_lenrek
9f4eac5292 kernel: pgrpcpy(), simplify Mount structure
instead of ordering the source mount list, order the new destination
list which has the advantage that we do not need to wlock the source
namespace, so copying can be done in parallel and we do not need the
copy forward pointer in the Mount structure.

the Mhead back pointer in the Mount strcture was unused, removed.
2015-08-09 21:16:10 +02:00
cinap_lenrek
86eb8ea6bb kernel: change vmemchr() length argument to ulong and simplify 2015-08-06 10:15:07 +02:00
cinap_lenrek
4bd9ed80c3 kernel: export mntattach() from devmnt.c avoiding bogus struct passing and special case in namec()
we already export mntauth() and mntversion(), so why not stop
being sneaky and just export mntattach() so bindmount() and
devshr can just call it directly with proper arguments being
checked.

we can also avoid handling #M attach specially in namec()
by having the devmnt's attach function do error(Enoattach).
2015-07-28 09:52:21 +02:00
cinap_lenrek
6617c63a37 kernel: pipelined read ahead for the mount cache
this changes devmnt adding mntrahread() function and some helpers
for it to do pipelined sequential read ahead for the mount cache.

basically, cread() calls mntrahread() with Mntrah structure and it
figures out if we where reading sequentially and if thats the case
issues reads of c->iounit size in advance.

the read ahead state (Mntrah) is kept in the mount cache so we can
handle (read ahead) cache invalidation in the presence of writes.
2015-07-26 05:43:26 +02:00
cinap_lenrek
8ed25f24b7 kernel: various cleanups of imagereclaim(), pagereclaim(), freepages(), putimage()
imagereclaim(), pagereclaim():
- move imagereclaim() and pagereclaim() declarations to portfns.h
- consistently use ulong type for page counts
- name number of pages to free "pages" instead of "min"
- check for pages == 0 on entry

freepages():
- move pagechaindone() call to wakeup newpage() consumers inside
  palloc critical section.

putimage():
- use long type for refcount
2015-07-09 00:01:50 +02:00
cinap_lenrek
64ed3658d2 kernel: add pagechaindone() to wakeup processes waiting for memory
we keep the details about palloc in page.c, providing pagechaindone()
for mmu code to be called after a series of pagechainhead() calls.
2015-06-15 17:40:47 +02:00
cinap_lenrek
8a3b388ffe kernel: implement separate wait queues for page allocation
give kernel processes and local disk file servers (procs
having noswap flag set) a clear advantage for page allocation
under starved condition by giving them ther own wait queue so
they get readied as soon as pages become available.
2015-06-15 16:05:00 +02:00
cinap_lenrek
46070c3122 kernel: add segio() function for reading/writing segments
devproc's procctlmemio() did not handle physical segment
types correctly, as it assumed it can just kmap() the page
in question and write to it. physical segments however
need to be mapped uncached but kmap() will always map
cached as it assumes normal memory. on some machines with
aliasing memory with different cache attributes
leads to undefined behaviour!

we borrow the code from devsegment and provide a generic
segio() function to read and write user segments which
handles all the cases without using kmap by just spawning
a kproc that attaches the segment that needs to be read
from or written to. fault() will setup the right mmu
attributes for us. it will also properly flush pages for
segments that maintain instruction cache when written.
however, tlb's have to be flushed separately.

segio() is used for devsegment and devproc now, which
also allows for simplification of fixfault() as there is no
special error handling case anymore as fixfault() is now
called from faulting process *only*.

reads from /proc/$pid/mem can now span multiple pages.
2015-04-16 00:45:25 +02:00
cinap_lenrek
972cd5e3fc kernel: get rid of auxpage() and preserve cache index bits in Page.va in mount cache
the mount cache uses Page.va to store cached range offset and
limit, but mips kernel uses cache index bits from Page.va to
maintain page coloring. Page.va was not initialized by auxpage().

this change removes auxpage() which was primarily used only
by the mount cache and use newpage() with cache file offset
page as va so we will get a page of the right color.

mount cache keeps the index bits intact by only using the top
and buttom PGSHIFT bits of Page.va for the range offset/limit.
2015-03-16 05:46:08 +01:00
cinap_lenrek
fcc336b902 kernel: catch address overflow in syssegfree()
the "to" address can overflow in syssegfree() causing wrong
number of pages to be passed to mfreeseg(). with the current
implementation of mfreeseg() however, this doesnt cause any
data corruption but was just freeing an unexpected number of
pages.

this change checks for this condition in syssegfree() and
errors out instead. also mfreeseg() was changed to take
ulong argument for number of pages instead of int to keep
it consistent with other routines that work with page counts.
2015-03-07 18:59:06 +01:00
cinap_lenrek
cb35d1a132 kernel: avoid inconsistent reads in /proc/#/fd and /proc/#/ns
to allow bytewise access to /proc/#/fd, the contents of the file where
recreated on each call. if fd's had been closed or reassigned between
the reads, the offset would be inconsistent and a read could start off
in the middle of a line. this happens when you cat /proc/#/fd file of
a busy process that mutates its filedescriptor table.

to fix this, we now return one line record at a time. if the line
fits in the read size, then this means the next read will always start
at the beginning of the next line record. we remember the consumed
byte count in Chan.mrock and the current record in Chan.nrock. (these
fields are free to usefor non-directory files)

if a read comes in and the offset is the same as c->mrock, we do not
need to regenerate the file and just render the next c->nrock's record.

for reads smaller than the line count, we have to regenerate the content
up to the offset and the race is still possible, but this should not
be the common case.

the same algorithm is now used for /proc/#/ns file, allowing a simpler
reimplementation and getting rid of Mntwalk state strcture.
2014-12-21 04:46:22 +01:00
cinap_lenrek
b18a641397 kernel: remove implicit Proc* argument from procctl()
procctl() is always called with up and it would not
work correctly if passed a different process, so
remove the Proc* argument and use up directly.
2014-11-09 08:19:28 +01:00
cinap_lenrek
3b661a96ef kernel: make noswap flag exclude processes from killbig() if not eve, reset noswap flag on exec 2014-08-17 00:50:20 +02:00
cinap_lenrek
655ec332a7 devproc: fix proccrlmemio bugs
dont kill the calling process when demand load fails if fixfault()
is called from devproc. this happens when you delete the binary
of a running process and try to debug the process accessing uncached
pages thru /proc/$pid/mem file.

fixes to procctlmemio():

- fix missed unlock as txt2data() can error
- make sure the segment isnt freed by taking a reference (under p->seglock)
- access the page with segment locked (see comment)
- get rid of the segment stealer lock

other stuff:

- move txt2data() and data2txt() to segment.c
- add procpagecount() function
- make return type mcounseg() to ulong
2014-07-14 06:02:21 +02:00
cinap_lenrek
d4d86df2ab kernel: new pagecache, remove Lock from page, use cmpswap for Ref instead of Lock
make the Page stucture less than half its original size by getting rid of
the Lock and the lru.

The Lock was required to coordinate the unchaining of pages that where
both cached and on the lru freelist.

now pages have a single next pointer that is used for palloc.head
freelist xor for page cache hash chains in Image.pghash[].

cached pages are not on the freelist anymore, but will be reclaimed
from images by the pager when the freelist runs out of pages.

each Image has its own 512 hash chains for cached page lookup. That is
2MB worth of pages and there should be no collisions for most text images.

page reclaiming can be done without holding palloc.lock as the Image is
the owner of the page hash chains protected by the Image's lock.

reclaiming Image structures can be done quickly by only reclaiming pages from
inactive images, that is images which are not currently in use by segments.

the Ref structure has no Lock anymore. Only a single long that is atomically
incremented or decremnted using cmpswap().

there are various other changes as a consequence code. and lots of pikeshedding,
sorry.
2014-06-22 15:12:45 +02:00
cinap_lenrek
72ba3571a3 kernel: remove _xinc()/_xdec()
as with the Block refcount changes, _xinc() and _xdec() arent
used anymore, so remove them.

architecure can still define ainc()/adec() when it needs them.
2014-06-08 01:35:22 +02:00
cinap_lenrek
a2d96d47c9 kernel: always reset notepending in eqlock, handle forceclosefgrp in eqlocks 2014-04-29 21:17:07 +02:00