the page attribute table was initialized in mmuinit(), which is
too late for bootscreen(). So now we check for PAT support and
insert the write-combine entry early in cpuidentify().
this might have been the cause of some slow EFI framebuffers on
machines with overlapping or insufficient MTRR entries.
ipiput4() and ipiput6() are called with the incoming interface rlocked
while ipoput4() and ipoput6() also rlock() the outgoing interface once
a route has been found. it is common that the incoming and outgoing
interfaces are the same recusive rlocking().
the deadlock happens when a reader holds the rlock for the incoming interface,
then ip/ipconfig tries to add a new address, trying to wlock the interface.
as there are still active readers on the ifc, ip/ipconfig process gets queued
on the inteface RWlock.
now the reader finds the outgoing route which has the same interface as the
incoming packet and tries to rlock the ifc again. but now theres a writer
queued, so we also go to sleep waiting four outselfs to release the lock.
the solution is to never wait for the outgoing interface rlock, but instead
use non-queueing canrlock() and if it cannot be acquired, discard the packet.
do not touch s->map on SG_PHYSICAL type segments as they do
not have a pte map (s->mapsize == 0 && s->map == nil).
also remove the SG_PHYSICAL switch in freepte(), this is never
reached.
the calculation for the control endpoint0 output device context
missed the context size scaling shift, resulting in botched
stall handling as we would not read the correct endpoint status
value.
note, this calculation only affected control endpoint 0, which
was handled separately from all other endpoints.
spectacular bug. cmpswap() had a sign extension bug
using sign extending MOV to load the old compare
value and LDXRW using zero extension while the CMP
instruction compared 64 bit registers.
this caused cmpswap with negative old value always
to fail.
interestingly, libc's version of this function was
fine.
when reclaiming pages from an image, always reclaim all
the hash chains equally. that way, we avoid being biased
towards the chains at the start of the Image.pghash[] array.
images can be in two states: active or inactive. inactive
images are the ones which are not used by program while
active ones aare.
when reclaiming pages, we should try to reclaim pages
from inactive images first and only if that set becomes
exhausted attempt to release text pages and attempt to
reclaim pages from active images.
when we run out of Image structures, it makes only sense
to reclaim pages from inactive images, as reclaiming pages
from active ones will never free any Image structures.
change putimage() to require a image already locked and
make it unlock the image. this avoids many pointless
unlock()/lock() sequences as all callers of putimage()
already had the image locked.
looks like linux changed the device tree names for
the memory node:
4b17654f51 (diff-ac03c9402b807c11d42edc9e8d03dfc7)
this fixes the memory size detection with latest firmware
on raspberry pi4-b (4GB) for kenji.
The swcursor used a 32x32 image for saving/restoring
screen contents for no reason.
Add a doflush argument to swcursorhide(), so that
disabling software cursor with a double buffered
softscreen is properly hidden. The doflush parameter
should be set to 0 in all other cases as swcursordraw()
will flushes both (current and previours) locations.
Make sure swcursorinit() and swcursorhide() clear the
visibility flag, even when gscreen is nil.
Remove the cursor locking and just do everything within
the drawlock. All cursor functions such as curson(),
cursoff() and setcursor() will be called drawlock
locked. This also means &cursor can be read.
Fix devmouse cursor reads and writes. We now have the
global cursor variable that is only modified under
the drawlock. So copy under drawlock.
Move the pc software cursor implementation into vgasoft
driver, so screen.c does not need to handle it as
a special case.
Remove unused functions such as drawhasclients().
most pc's are multiprocessors these days, that use apic or
msi interrupts, then the irq does not matter anymore. and
uefi does not even assign irq to pci devices anymore. if
we have a problem enabling an interrupt, we will print.
memory returned by rampage() is not zeroed, so we have to
zero it ourselfs. apparently, this bug didnt show up as we
where zeroing conventional low memory before the new memory
map code. also rampage() never returns nil.
This replaces the memory map code for both pc and pc64
kernels with a unified implementation using the new
portable memory map code.
The main motivation is to be robust against broken
e820 memory maps by the bios and delay the Conf.mem[]
allocation after archinit(), so mp and acpi tables
can be reserved and excluded from user memory.
There are a few changes:
new memreserve() function has been added for archinit()
to reserve bios and acpi tables.
upareserve() has been replaced by upaalloc(), which now
has an address argument.
umbrwmalloc() and umbmalloc() have been replaced by
umballoc().
both upaalloc() and umballoc() return physical addresses
or -1 on error. the physical address -1 is now used as
a sentinel value instead of 0 when dealing with physical
addresses.
archmp and archacpi now always use vmap() to access
the bios tables and reserve the ranges. more overflow
checks have been added.
ramscan() has been rewritten using vmap().
to handle the population of kernel memory, pc and pc64
now have pmap() and punmap() functions to do permanent
mappings.
This is a generic memory map for physical addresses. Entries
can be added with memmapadd() giving a range and a type.
Ranges can be allocated and freed from the map. The code
automatically resolves overlapping ranges by type priority.
Fix the inconsistent use of ether->mem. Always use physical
addresses. Let ether8390 convert to virtual addresses using
KADDR() when we have to copy data in/out.
the previous mkfile had a sneaky hack that would use
sed to delete the first 2 lines of hex output to strip
the 32 byte long a.out header for apbootstrap and rebootcode.
use 8l -H3 flag to strip the header from the output file.
the rc & operator changes stdin to /dev/null, so we
have to do the <[0=1] inside the {}
this never showed up as an issue because many
fileservers just read 9p messages from standard
output.
when the control mountpoint side gets removed, close
mount channel immediately. this is usefull for implementing
automatic cleanup with ORCLOSE create mode.
allow reading the control file of a process and return
its pid number. if the process has exited, return an error.
this can be usefull as a way to test if a process is
still alive. and also makes it behave similar to
network protocol directories.
another side effect is that processes who erroneously
open the ctl file ORDWR would be allowed todo so as
along as they have write permission and the process is
not a kernel process.
progarg[0] can be assigned to elem directly as it is a
copy in kernel memory, so the char proelem[64] buffer
is not neccesary.
do the close-on-exit outside of the segment lock. there
is no reason to keep the segment table locked.
the user buffer could be changed while we parse it resulting
in a different number of watchpoints than initially calculated.
so add a check to the parse loop so we wont overflow the
watchpoint array.
in case the calling process changes its arguments under us, it could
happen that the final argument string lengths become bigger than
initially calculated. this is fine as we still make sure we wont
overflow the stack segment, but we could overrun into the tos
structure at the end of the stack. so change the limit to the
base of the tos, not the end of the stack segment.
writes to /proc/n/notepg and /proc/n/note should be able to write
at ERRMAX-1 bytes, not ERRMAX-2.
simplify write to /proc/n/args by just copying to local buf first
and then doing a kstrdup(). the value of Proc.nargs does not matter
when Proc.setargs is 1.
devproc assumes that when we hold the Proc.debug qlock,
the process will be prevented from exiting. but there is
another race where the process has already exited and
the Proc* slot gets reused. to solve this, on process
creation we also have to acquire the debug qlock while
initializing the fields of the process. this also means
newproc() should only initialize fields *not* protected
by the debug qlock.
always acquire the Proc.debug qlock when changing strings
in the proc structure to avoid doublefree on concurrent
update. for changing the user string, we add a procsetuser()
function that does this for auth.c and devcap.
remove pgrpnote() from pgrp.c and replace by static
postnotepg() in devproc.
avoid the assumption that the Proc* entries returned by
proctab() are continuous.
fixed devproc permission issues:
- make sure only eve can access /proc/trace
- none should only be allowed to read its own /proc/n/text
- move Proc.kp checks into procopen()
pid reuse was not handled correctly, as we where only
checking if a pid had a living process, but there still
could be processes expecting a particular parentpid or
noteid.
this is now addressed with reference counted Pid
structures which are organized in a hash table.
read access to the hash table does not require locks
which will be usefull for dtracy later.
replace machine specific userinit() by a portable
implemntation that uses kproc() to create the first
process. the initcode text is mapped using kmap(),
so there is no need for machine specific tmpmap()
functions.
initcode stack preparation should be done in init0()
where the stack is mapped and can be accessed directly.
replacing the machine specific userinit() allows some
big simplifications as sysrfork() and kproc() are now
the only callers of newproc() and we can avoid initializing
fields that we know are being initialized by these
callers.
rename autogenerated init.h and reboot.h headers.
the initcode[] and rebootcode[] blobs are now in *.i
files and hex generation was moved to portmkfile. the
machine specific mkfile only needs to specify how to
build rebootcode.out and initcode.out.
to prevent deadlock on media unbind (which is called with
the interface wlock()'ed), the medias reader processes
that unbind was waiting for used to discard packets when
the interface could not be rlocked.
this has the unfortunate side effect that when we change
addresses on a interface that packets are getting lost.
this is problematic for the processing of ipv6 router
advertisements when multiple RA's are getting received
in quick succession.
this change removes that packet dropping behaviour and
instead changes the unbind process to avoid the deadlock
by wunlock()ing the interface temporarily while waiting
for the reader processes to finish. the interface media
is also changed to the mullmedium before unlocking (see
the comment).
comparing m with MACHP() is wrong as m is a constant on 386.
add procflushothers(), which flushes all processes except up
using common procflushmmu() routine.
procflushmmu() returns once all *OTHER* processors that had
matching processes running on them flushed ther tlb/mmu state.
the caller of procflush...() takes care of flushing "up" by
calling flushmmu() later.
if the current process matched, then that means m->flushmmu
would be set, and hzclock() would call flushmmu() again.
to avoid this, we now check up->newtlb in addition to m->flushmmu
in hzclock() before calling flushmmu().
we also maintain information on which process on what processor
to wait for locally, which helps making progress when multiple
procflushmmu()'s are running concurrently.
in addition, this makes the wait condition for procflushmmu()
more sophisticated, by validating if the processor still runs
the selected process and only if it matchatches, considers
the MACHP(nm)->flushmmu flag.
the change to support no-execute bits broke the original
raspberry pi1, as it uses backwards compatible page table
format.
to use the XN bit, subpage AP bits have to be disabled
using the XP bit in CP15 Control Register c1 Bit 23.
when a process does an exec syscall, procsetup() is called and
we have to disable the debug watchpoint registers. just clearing
p->dr is not enougth as we are not going thru a procsave() and
procrestore() cycle which would disable and reload the saved
debug registers.
instead of clearing debug registers in procfork(), we should
clear the saved debug registers before a process goes to die
(pexit() calls sched() with up->state = Moribund) as the Proc
structure can get reused for kernel processes (kproc) which
never call procfork() and would therefore have debug registers
loaded.
for better system diagnostics, we *ALWAYS* want to record the parent
pid of a user process, regardless of if the child will post a wait
record on exit or not.
for that, we reverse the roles of Proc.parent and Proc.parentpid so
Proc.parentpid will always be set on rfork() and the Proc.parent
pointer will point to the parent's Proc structure or is set to nil
when no wait record should be posted on exit (RFNOWAIT flag).
this means that we can get the pid of the original parent process
from /proc, regardless of the the child having rforked with the
RFNOWAIT flag. this improves the output of pstree(1) somewhat if
the parent is still alive. note that theres no guarantee that the
parent pid is still valid.
the conditions are unchanged:
a user process that will post wait record has:
up->kp == 0 && up->parent != nil && up->parent->pid == up->parentpid
the boot process is:
up->kp == 0 && up->parent == nil && up->parentpid == 0
and kproc's have:
up->kp != 0 && up->parent == nil && up->parentpid == 0
Ori_B reports that his controller gets stuck in the ahciencreset()
wait loop. so we implement a 1000 ms timeout. we replace delay()
with esleep() as we are always called from the ledkproc().
blink() could enter infinite delay loop as well, so instead of
waiting for the message to get transmitted we exit making it
non-blocking (we will get called again anyway).
the access to the controller map[] was wrong in ledkproc(). the
index is the controller number, not the drive number!
when making outgoing connections, the source ip was selected
by just iterating from the first to the last interface and
trying each local address until a route was found. the result
was kind of hard to predict as it depends on the interface
order.
this change replaces the algorithm with the route lookup algorithm
that we already have which takes more specific desination and
source prefixes into account. so the order of interfaces does
not matter anymore.
on 386, the kernel memory region is mapped using huge 4MB pages
(when supported by the cpu). so the uncached pte manipulation
does not work to map the cursor data with uncached attribute.
instead, we allocate a memory page using newpage() and map
it globally using vmap(), which maps it uncached.
after issuing CR_RESETEP command, we have to invalidate
the endpoints output context buffer so that the halted/error
status reflects the new state. not doing so resulted in
the halted state to be stuck and we continued issuing
endpoint reset commands when we where already recovered.
handle the devusb Ep.clrhalt flag from devusb that userspace
uses to force a endpoint reset on the next transaction.
permission checking had the "other" and "owner" bits swapped plus incoming
connections where always owned by "network" instead of the owner of
the listening connection. also, ipwstat() was not effective as the uid
strings where not parsed.
this fixes the permission checks for data/ctl/err file and makes incoming
connections inherit the owner from the listening connection.
we also allow ipwstat() to change ownership to the commonuser() or anyone
if we are eve.
we might have to add additional restrictions for none at a later point...
we want devip to get reattached after hostowner has been written. factotum
already handles this with a private authdial() routine that mounts devip
when it is not present. so we detach devmnt before starting factotum,
and attach once factotum finishes.
the locking in proctext() is wrong. we have to acquire Proc.seglock
when reading segments from Proc.seg[] as segments do not
have a private freelist and can therefore be reused for other
data structures.
once we have Proc.seglock acquired, check that the process pid
is still valid so we wont accidentally read some other processes
segments. (for both proctext() and procctlmemio()). this also
should give better error message to distinguish the case when
the process did segdetach() the segment in question before we
could acquire Proc.seglock.
declare private functions as static.
there was a small window between modifying mmutop and switching the
asid where the core could bring in the new entries under the old asid
into the tlb due to speculation / prefetching.
this change moves the entering of the page tables into mmutop after
setttbr() to prevent this scenario.
due to us switching to the resereved asid 0 on procsave()->putasid(),
the only asid that could have potentially been poisoned would be asid 0
which does not have any user mappings. so this did not show any noticable
effect.
pexit() and pprint() can get called outside of a syscall
(from procctl()) with a process that is in active note
handling and require floating point in the kernel on amd64
for aesni (devtls).
make exec() clear the per process error string
to avoid spurious errors and confusion.
the errstr() syscall used to always swap the
maximum buffer size with memmove(), which is
problematic as this gives access to the garbage
beyond the NUL byte. worse, newproc(), werrstr()
and rerrstr() only clear the first byte of the
input buffer. so random stack rubble could be
leaked across processes.
we change the errstr() syscall to not copy
beyond the NUL byte.
the manpage also documents that errstr() should
truncate on a utf8 boundary so we use utfecpy()
to ensure proper NUL termination.
the idea is to catch bugs and make kernel exploitation
harder by mapping the kernel text section readonly
and everything else no-execute.
l.s maps the KZERO address space using 2MB pages so
to get the 4K granularity for the text section we use
the new ptesplit() function to split that mapping up.
we need to set EFER no-execute enable bit early
in apbootstrap so secondary application processors
will understand the NX bit in our shared kernel page
tables. also APBOOTSTRAP needs to be kept executable.
rebootjump() needs to mark REBOOTADDR page executable.
the user should not be able to change the cache
attributes for a segment in segattach() as this
can cause the same memory to be mapped with
conflicting attributes in the cache.
SG_TEXT should always be mapped with SG_RONLY
attribute. so fix data2txt() to follow the rules.
fault() now has an additional pc argument that is
used to detect fault on a non-executable segment.
that is, we check on read fault if the segment
has the SG_NOEXEC attribute and the program counter
is within faulting page.
a portable SG_NOEXEC segment attribute was added to allow
non-executable (physical) segments. which will set the
PTENOEXEC bits for putmmu().
in the future, this can be used to make non-executable
stack / bss segments.
the SG_DEVICE attribute was added to distinguish between
mmio regions and uncached memory. only matterns on arm64.
on arm, theres the issue that PTEUNCACHED would have
no bits set when using the hardware bit definitions.
this is the reason bcm, kw, teg2 and omap kernels use
arteficial PTE constants. on zynq, the XN bit was used
as a hack to give PTEUNCACHED a non-zero value and when
the bit is clear then cache attributes where added to
the pte.
to fix this, PTECACHED constant was added.
the portable mmu code in fault.c will now explicitely set
PTECACHED bits for cached memory and PTEUNCACHED for
uncached memory. that way the hardware bit definitions
can be used everywhere.
on the 2GB and 4GB raspberry pi 4 variants, there are two
memory regions for ram:
[0x00000000..0x3e600000)
[0x40000000..0xfc000000)
the framebuffer is somewhere at the end of the first
GB of memory.
to handle these, we append the region base and limit
of the second region to *maxmem= like:
*maxmem=0x3e600000 0x40000000 0xfc000000
the mmu code has been changed to have non-existing
ram unmapped and mmukmap() now uses small 64K pages
instead of 512GB pages to avoid aliasing (framebuffer).
the VIRTPCI mapping has been removed as we now have
a proper vmap() implementation which assigns vritual
addresses automatically.
this adds a 4GB KMAP window into the kernel address space
so we can access all physical ram on raspberry pi 4 for
user pages.
note that kernel memory above KZERO is still limited
to 1GB because of DMA restrictions.
according to the following linux change, BCM2711 uses a different
method for changing pullup/down mode:
abcfd09286 (diff-cf078559c38543ac72c5db99323e236d)
gpiomeminit() was broken, using virtual address for the gpio physseg
instead of the physical one.
cleanup the code, avoid repetition by declaring static u32int *regs
variable. make local variable names consistent.
the raspberry pi 4 has a new interrupt controller and
pci support, so get rid of intrenable() macro and
properly make intrenable function with tbdf argument.
the raspberry pi4 firmware refuses to enable the GIC interrup controller
for arm64 when the .img file is not a multiple of 4 bytes. yes, this
is insane and nowhere documented.
the new raspberry pi 4 firmware for arm64 seems to have
broken atag support. so we now parse the device tree
structure to get the bootargs and memory configuration.
Ori Bernstein had Sunrise Point-H USB 3.0 xHCI Controller that would mysteriously
crash on the 5th ENABLESLOT command. This was reproducable by even just allocating
slots in a loop right after init.
It turns out, the 1.2 spec extended the Max Scratchpad Buffers in HCSPARAMS2 so our
driver would not allocate enougth scratchpad buffers and controller firmware would
crash once it went beyond our allocated scratchpad buffer array.
This change also fixes:
- ignore bits 16:31 in PAGESIZE register
- preserve bits 10:31 in the CONFIG register
- handle ADDESSDEV command failure (so it can be retried)
preallocate 2% of user pages for page tables and MMU structures
and keep them mapped in the VMAP range. this leaves more space
in the KZERO window and avoids running out of kernel memory on
machines with large amounts of memory.
always clean AND invalidate caches before dma read,
never just invalidate as the buffer might not be
aligned to cache lines...
we have to invalidate caches again *AFTER* the dma
read has completed. the processor can bring in data
speculatively into the cache while the dma in in
flight.
we override atag memory on reboot, so preserve
the memsize learned from atag as *maxmem plan9
variable. the global memsize variable is not
needed anymore.
avoid trashing the following atag when zero
terminating the cmdline string.
zero memory after plan9.ini variables.
the Ipselftab is designed to not require locking on read
operation. locking the selftab in ipselftabread() risks
deadlock when accessing the user buffer creates a fault.
remove unused fields from the Ipself struct.
initialize the rate limits when the device gets
bound, not when it is created. so that the
rate limtis get reset to default when the ifc
is reused.
adjust the burst delay when the mtu is changed.
this is to make sure that we allow at least one
full sized packet burst.
make a local copy of ifc->m before doing nil
check as it can change under us when we do
not have the ifc locked.
specify Ebound[] and Eunbound[] error strings
and use them consistently.
remove references to the unused Conv.car qlock.
ipifcregisterproxy() is called with the proxy
ifc wlock'd, which means we cannot acquire the
rwlock of the interfaces that will proxy for us
because it is allowed to rlock() multiple ifc's
in any order. to get arround this, we use canrlock()
and skip the interface when we cannot acquire the
lock.
the ifc should get wlock'd only when we are about
to modify the ifc or its lifc chain. that is when
adding or removing addresses. wlock is not required
when we addresses to the selfcache, which has its
own qlock.
mark reader process pointers with (void*)-1 to mean
not started yet. this avoids the race condition when
media unbind happens before the kproc has set its
Proc* pointer. then we would not post the note and
the reader would continue running after unbind.
etherbind can be simplified by reading the #lX/addr
file to get the mac address, avoiding the temporary
buffer.
when the exclusive monitor is cleared, a event is generated
which we can use to wake up idlehands. that way we do not
need to wait for the next timer interrupt until a cpu takes
work from the run queue.
we did not interpret the $rootdir and $rootspec environment
variables right. $rootdir is what gets bound to / (usually /root)
and $rootspec is the mountspec of /root.
i'v been seeing the error condition described above in the
Slowbulkin comment. so i'm enabling the work arround which
seems to fix the lockup.
in the split transaction case where we want to start the
transaction at frame start, acquire the ctlr lock *before*
checking if we are in the right frame number. so the start
will happen atomically. checking the software ctlr->sofchan
instead of checking the interrupt mask register seems to
be quicker.
setting the haint mask bit for the chan under ctlr lock
in chanio() instead of chanwait() avoids needing to acquire
the ctlr lock twice.
mask wakechan bits with busychan bitmap in interrupt handlers
so we will not try to wake up released chans by accident.
sleep() and tsleep() might get interrupted so we have to
release the split qlock in the split transaction case and
in all cases, make sure to halt the channel before release.
add some common debug functions to dump channel and controller
registers.
we have to ensure that all stores saving the process state
have completed before setting up->mach = nil in the scheduler.
otherwise, another cpu could observe up->mach == nil while
the stores such as the processes p->sched label have not finnished.
attached is a patch to fix receive in the 8169 chip on my thinkpad
A485. i'm not sure why, but the same thing was done in 3d56a0fc4645
for Macv45.
nick
some control transactions can confuse the xhci controller so
much that it even fails to respond to command abort or STOPEP
control command. with no way for us to abort the transaction
but a full controller reset.
we give the controller 5 seconds to abort our initial
transaction and if that fails we wake the recover process
to reset the controller.
thanks mischief for testing.