This used to be a internal function, but virtio
uses multiple structures with the same cap type
to indicate the location of various register
blocks in the pci bars so export it.
For 64-bit architectures, the a.out header has the HDR_MAGIC flag set
in the magic and is expanded by 8 bytes containing the 64-bit virtual
address of the programs entry point. While Exec.entry contains physical
address for kernel images.
Our sysexec() would always use Exec.entry, even for 64-bit a.out binaries,
which worked because PADDR(entry) == entry for userspace pointers.
This change fixes it, having the kernel use the 64-bit entry point
and document the behaviour in the manpage.
Remove unused fields and factor common fields into a
new PMach struct in port/portdat.h.
The fields machno, splpc and proc are not moved to
PMach as they are part of the known offsets from
assembly (l.s).
We can take advantage of the fact that xinit() allocates
kernel memory from conf.mem[] banks always at the beginning
of a bank, so the separate palloc.mem[] array can be eleminated
as we can calculate the amount of non-kernel memory like:
upages = cm->npage - (PGROUND(cm->klimit - cm->kbase)/BY2PG)
for the number of reserved kernel pages,
we provide the new function: ulong nkpages(Confmem*)
This eleminates the error case of running out of slots in
the array and avoids wasting memory in ports that have simple
memory configurations (compared to pc/pc64).
The confstr was shared between readers so seprintconf() could
write concurrently to that buffer which is not safe.
This replaces the shared static confstr[Maxconf] buffer with a
pointer that is initially nil and a buffer that is alloced on
demand.
The new confstr pointer (and buffer) is now only updated while
wlock()ed from the new setconfstr() function.
This is now done by mconfig() / mdelctl() just before releasing
the wlock.
Now, rdconf() will check if confstr has been initialized, and
test for it again while wlock()ed; making sure the configuration
is read only once.
Also, rdconf() used to check for a undocumented "fsdev:\n" string
at the beginning of config data tho that was never documented.
This changes mconfig() to ignore that particular signature so
the example from the manpage will work as documented.
let pci.c deal with the special cardbus controller bar0 and
expansion roms.
handle apic interrupt routing for devices behind a cardbus slot.
do not free the pcidev on card removal, as the drivers
most certanly are not prepared to handle this yet.
instead, we provide a pcidevfree() function that just unlinks
the device from pcilist and the parent bridge.
Previously, mmurelease() was always called with
palloc spinlock held.
This is unneccesary for some mmurelease()
implementations as they wont release pages
to the palloc pool.
This change removes pagechainhead() and
pagechaindone() and replaces them with just
freepages() call, which aquires the palloc
lock internally as needed.
freepages() avoids holding the palloc lock
while walking the linked list of pages,
avoding some lock contention.
we might as well handle the per process cycle
counter in the portable part instead of duplicating the code
in every arch and have inconsistent implementations.
we now have a portable kenter() and kexit() function,
that is ment to be used in trap/syscall from user,
which updates the counters.
some kernels missed initializing Mach.cyclefreq.
The OCEXEC flag used to be maintained per channel,
making it shared between all the file desciptors.
This has a unexpected side effects with regard to
channel passing drivers such as devdup (/fd),
devsrv (/srv) and devshr (/shr).
For example, opening a /srv file with OCEXEC
makes it impossible to be remounted by exportfs
as it internally does a exec() to mount and
re-export it. There is no way to reset the flag.
This change makes the OCEXEC flag per file descriptor,
so a open with the OCEXEC flag only affects the fd
group of the calling process, and not the channel
itself.
On rfork(RFFDG), the per file descriptor flags get
copied.
On dup(), the per file descriptor flags are reset.
The second modification is that /fd, /srv and /shr
should reject the ORCLOSE flag, as the files that
are returned have already been opend.
With some newer UEFI firmware, not all pci bars get
programmed and we have to assign them ourselfs.
This was already done for memory bars. This change
adds the same for i/o port space, by providing a
ioreservewin() function which can be used to allocate
port space within the parent pci-pci bridge window.
Also, the pci code now allocates the pci config
space i/o ports 0xCF8/0xCFC so userspace needs to
use devpnp to access pci config space now. (see
latest realemu change).
Also, this moves the ioalloc()/iofree() code out
of devarch into port/iomap.c as it can be shared
with the ppc mtx kernel.
the whole idea of a ucallocb() is bad, as even access to the
metadata header would be in uncached memory. also, it tuns out
that it was never used by anyone.
The new pci code is moved to port/pci.[hc] and shared by
all ports.
Each port has its own PCI controller implementation,
providing the pcicfgrw*() functions for low level pci
config space access. The locking for pcicfgrw*() is now
done by the caller (only port/pci.c).
Device drivers now need to include "../port/pci.h" in
addition to "io.h".
The new code now checks bridge windows and membars,
while enumerating the bus, giving the pc driver a chance
to re-assign them. This is needed because some UEFI
implementations fail to assign the bars for some devices,
so we need to do it outselfs. (See pcireservemem()).
While working on this, it was discovered that the pci
code assimed the smallest I/O bar size is 16 (pcibarsize()),
which is wrong. I/O bars can be as small as 4 bytes.
Bit 1 in an I/O bar is also reserved and should be masked off,
making the port mask: port = bar & ~3;
The Abind case in namec() needs to cunique() the chan
before attaching the umh mount head pointer onto it.
This is because we cannot give a reference to the mount
head to any of the mh->mount...->to channels, as they
will never go away until the mount head goes away.
This is a cyclic reference.
This could be reproduced with:
@{rfork n; mount -a '#s/boot' /mnt/root; bind /mnt/root /}
Also, fix memory leaks around cunique(), which can
error, leaking the mount head we got from domount().
Move the umh != nil check inside cunique().
This change makes it mandatory for programs to call segflush() on
code that is not in the text segment if they want to execute it.
As a side effect, this means that everything but the text segment
will be non-executable by default, even without the SG_NOEXEC
attribute. Segments with the SG_NOEXEC attribute never become
executable, even when segflush() is called on them.
instruction cache maintenance is done on tlb miss;
when a page gets fauled in; with putmmu() checking
the page->txtflush cpu bitmap.
syssegflush() used to only call flushmmu() after
segflush() for the calling process, but when a segment
is shared with other processes, we have to flush the
other processes tlb as well.
this adds the missing procflushseg() call into segflush().
note that procflushseg() leaves the calling process alone,
so the flushmmu() call in syssegflush() is still required.
segmentioproc() does not need to call flushmmu() after
segflush() as it is never going to jump to the modified
page, hence the stale icache does not matter.
The sample frequency is an artificial parameter used for
isochronous out transfers to better match the target
frequency (usually, a sound card).
when hz is set, devusb adjusts the endpoint's maxpkt to get
the requested frequency and a multiple of the samplesize per
packet.
however, when hz is not set, then we should calculate the
frequency from maxpkt, ntds and pollival, so all parameters
will be consistent with each other.
do not touch s->map on SG_PHYSICAL type segments as they do
not have a pte map (s->mapsize == 0 && s->map == nil).
also remove the SG_PHYSICAL switch in freepte(), this is never
reached.
the calculation for the control endpoint0 output device context
missed the context size scaling shift, resulting in botched
stall handling as we would not read the correct endpoint status
value.
note, this calculation only affected control endpoint 0, which
was handled separately from all other endpoints.
when reclaiming pages from an image, always reclaim all
the hash chains equally. that way, we avoid being biased
towards the chains at the start of the Image.pghash[] array.
images can be in two states: active or inactive. inactive
images are the ones which are not used by program while
active ones aare.
when reclaiming pages, we should try to reclaim pages
from inactive images first and only if that set becomes
exhausted attempt to release text pages and attempt to
reclaim pages from active images.
when we run out of Image structures, it makes only sense
to reclaim pages from inactive images, as reclaiming pages
from active ones will never free any Image structures.
change putimage() to require a image already locked and
make it unlock the image. this avoids many pointless
unlock()/lock() sequences as all callers of putimage()
already had the image locked.
The swcursor used a 32x32 image for saving/restoring
screen contents for no reason.
Add a doflush argument to swcursorhide(), so that
disabling software cursor with a double buffered
softscreen is properly hidden. The doflush parameter
should be set to 0 in all other cases as swcursordraw()
will flushes both (current and previours) locations.
Make sure swcursorinit() and swcursorhide() clear the
visibility flag, even when gscreen is nil.
Remove the cursor locking and just do everything within
the drawlock. All cursor functions such as curson(),
cursoff() and setcursor() will be called drawlock
locked. This also means &cursor can be read.
Fix devmouse cursor reads and writes. We now have the
global cursor variable that is only modified under
the drawlock. So copy under drawlock.
Move the pc software cursor implementation into vgasoft
driver, so screen.c does not need to handle it as
a special case.
Remove unused functions such as drawhasclients().
This is a generic memory map for physical addresses. Entries
can be added with memmapadd() giving a range and a type.
Ranges can be allocated and freed from the map. The code
automatically resolves overlapping ranges by type priority.
when the control mountpoint side gets removed, close
mount channel immediately. this is usefull for implementing
automatic cleanup with ORCLOSE create mode.
allow reading the control file of a process and return
its pid number. if the process has exited, return an error.
this can be usefull as a way to test if a process is
still alive. and also makes it behave similar to
network protocol directories.
another side effect is that processes who erroneously
open the ctl file ORDWR would be allowed todo so as
along as they have write permission and the process is
not a kernel process.
progarg[0] can be assigned to elem directly as it is a
copy in kernel memory, so the char proelem[64] buffer
is not neccesary.
do the close-on-exit outside of the segment lock. there
is no reason to keep the segment table locked.
the user buffer could be changed while we parse it resulting
in a different number of watchpoints than initially calculated.
so add a check to the parse loop so we wont overflow the
watchpoint array.
in case the calling process changes its arguments under us, it could
happen that the final argument string lengths become bigger than
initially calculated. this is fine as we still make sure we wont
overflow the stack segment, but we could overrun into the tos
structure at the end of the stack. so change the limit to the
base of the tos, not the end of the stack segment.
writes to /proc/n/notepg and /proc/n/note should be able to write
at ERRMAX-1 bytes, not ERRMAX-2.
simplify write to /proc/n/args by just copying to local buf first
and then doing a kstrdup(). the value of Proc.nargs does not matter
when Proc.setargs is 1.
devproc assumes that when we hold the Proc.debug qlock,
the process will be prevented from exiting. but there is
another race where the process has already exited and
the Proc* slot gets reused. to solve this, on process
creation we also have to acquire the debug qlock while
initializing the fields of the process. this also means
newproc() should only initialize fields *not* protected
by the debug qlock.
always acquire the Proc.debug qlock when changing strings
in the proc structure to avoid doublefree on concurrent
update. for changing the user string, we add a procsetuser()
function that does this for auth.c and devcap.
remove pgrpnote() from pgrp.c and replace by static
postnotepg() in devproc.
avoid the assumption that the Proc* entries returned by
proctab() are continuous.
fixed devproc permission issues:
- make sure only eve can access /proc/trace
- none should only be allowed to read its own /proc/n/text
- move Proc.kp checks into procopen()
pid reuse was not handled correctly, as we where only
checking if a pid had a living process, but there still
could be processes expecting a particular parentpid or
noteid.
this is now addressed with reference counted Pid
structures which are organized in a hash table.
read access to the hash table does not require locks
which will be usefull for dtracy later.
replace machine specific userinit() by a portable
implemntation that uses kproc() to create the first
process. the initcode text is mapped using kmap(),
so there is no need for machine specific tmpmap()
functions.
initcode stack preparation should be done in init0()
where the stack is mapped and can be accessed directly.
replacing the machine specific userinit() allows some
big simplifications as sysrfork() and kproc() are now
the only callers of newproc() and we can avoid initializing
fields that we know are being initialized by these
callers.
rename autogenerated init.h and reboot.h headers.
the initcode[] and rebootcode[] blobs are now in *.i
files and hex generation was moved to portmkfile. the
machine specific mkfile only needs to specify how to
build rebootcode.out and initcode.out.
comparing m with MACHP() is wrong as m is a constant on 386.
add procflushothers(), which flushes all processes except up
using common procflushmmu() routine.
procflushmmu() returns once all *OTHER* processors that had
matching processes running on them flushed ther tlb/mmu state.
the caller of procflush...() takes care of flushing "up" by
calling flushmmu() later.
if the current process matched, then that means m->flushmmu
would be set, and hzclock() would call flushmmu() again.
to avoid this, we now check up->newtlb in addition to m->flushmmu
in hzclock() before calling flushmmu().
we also maintain information on which process on what processor
to wait for locally, which helps making progress when multiple
procflushmmu()'s are running concurrently.
in addition, this makes the wait condition for procflushmmu()
more sophisticated, by validating if the processor still runs
the selected process and only if it matchatches, considers
the MACHP(nm)->flushmmu flag.
for better system diagnostics, we *ALWAYS* want to record the parent
pid of a user process, regardless of if the child will post a wait
record on exit or not.
for that, we reverse the roles of Proc.parent and Proc.parentpid so
Proc.parentpid will always be set on rfork() and the Proc.parent
pointer will point to the parent's Proc structure or is set to nil
when no wait record should be posted on exit (RFNOWAIT flag).
this means that we can get the pid of the original parent process
from /proc, regardless of the the child having rforked with the
RFNOWAIT flag. this improves the output of pstree(1) somewhat if
the parent is still alive. note that theres no guarantee that the
parent pid is still valid.
the conditions are unchanged:
a user process that will post wait record has:
up->kp == 0 && up->parent != nil && up->parent->pid == up->parentpid
the boot process is:
up->kp == 0 && up->parent == nil && up->parentpid == 0
and kproc's have:
up->kp != 0 && up->parent == nil && up->parentpid == 0
after issuing CR_RESETEP command, we have to invalidate
the endpoints output context buffer so that the halted/error
status reflects the new state. not doing so resulted in
the halted state to be stuck and we continued issuing
endpoint reset commands when we where already recovered.
handle the devusb Ep.clrhalt flag from devusb that userspace
uses to force a endpoint reset on the next transaction.
the locking in proctext() is wrong. we have to acquire Proc.seglock
when reading segments from Proc.seg[] as segments do not
have a private freelist and can therefore be reused for other
data structures.
once we have Proc.seglock acquired, check that the process pid
is still valid so we wont accidentally read some other processes
segments. (for both proctext() and procctlmemio()). this also
should give better error message to distinguish the case when
the process did segdetach() the segment in question before we
could acquire Proc.seglock.
declare private functions as static.
pexit() and pprint() can get called outside of a syscall
(from procctl()) with a process that is in active note
handling and require floating point in the kernel on amd64
for aesni (devtls).
make exec() clear the per process error string
to avoid spurious errors and confusion.
the errstr() syscall used to always swap the
maximum buffer size with memmove(), which is
problematic as this gives access to the garbage
beyond the NUL byte. worse, newproc(), werrstr()
and rerrstr() only clear the first byte of the
input buffer. so random stack rubble could be
leaked across processes.
we change the errstr() syscall to not copy
beyond the NUL byte.
the manpage also documents that errstr() should
truncate on a utf8 boundary so we use utfecpy()
to ensure proper NUL termination.
the user should not be able to change the cache
attributes for a segment in segattach() as this
can cause the same memory to be mapped with
conflicting attributes in the cache.
SG_TEXT should always be mapped with SG_RONLY
attribute. so fix data2txt() to follow the rules.
fault() now has an additional pc argument that is
used to detect fault on a non-executable segment.
that is, we check on read fault if the segment
has the SG_NOEXEC attribute and the program counter
is within faulting page.
a portable SG_NOEXEC segment attribute was added to allow
non-executable (physical) segments. which will set the
PTENOEXEC bits for putmmu().
in the future, this can be used to make non-executable
stack / bss segments.
the SG_DEVICE attribute was added to distinguish between
mmio regions and uncached memory. only matterns on arm64.
on arm, theres the issue that PTEUNCACHED would have
no bits set when using the hardware bit definitions.
this is the reason bcm, kw, teg2 and omap kernels use
arteficial PTE constants. on zynq, the XN bit was used
as a hack to give PTEUNCACHED a non-zero value and when
the bit is clear then cache attributes where added to
the pte.
to fix this, PTECACHED constant was added.
the portable mmu code in fault.c will now explicitely set
PTECACHED bits for cached memory and PTEUNCACHED for
uncached memory. that way the hardware bit definitions
can be used everywhere.
we have to ensure that all stores saving the process state
have completed before setting up->mach = nil in the scheduler.
otherwise, another cpu could observe up->mach == nil while
the stores such as the processes p->sched label have not finnished.