replace machine specific userinit() by a portable
implemntation that uses kproc() to create the first
process. the initcode text is mapped using kmap(),
so there is no need for machine specific tmpmap()
functions.
initcode stack preparation should be done in init0()
where the stack is mapped and can be accessed directly.
replacing the machine specific userinit() allows some
big simplifications as sysrfork() and kproc() are now
the only callers of newproc() and we can avoid initializing
fields that we know are being initialized by these
callers.
rename autogenerated init.h and reboot.h headers.
the initcode[] and rebootcode[] blobs are now in *.i
files and hex generation was moved to portmkfile. the
machine specific mkfile only needs to specify how to
build rebootcode.out and initcode.out.
comparing m with MACHP() is wrong as m is a constant on 386.
add procflushothers(), which flushes all processes except up
using common procflushmmu() routine.
procflushmmu() returns once all *OTHER* processors that had
matching processes running on them flushed ther tlb/mmu state.
the caller of procflush...() takes care of flushing "up" by
calling flushmmu() later.
if the current process matched, then that means m->flushmmu
would be set, and hzclock() would call flushmmu() again.
to avoid this, we now check up->newtlb in addition to m->flushmmu
in hzclock() before calling flushmmu().
we also maintain information on which process on what processor
to wait for locally, which helps making progress when multiple
procflushmmu()'s are running concurrently.
in addition, this makes the wait condition for procflushmmu()
more sophisticated, by validating if the processor still runs
the selected process and only if it matchatches, considers
the MACHP(nm)->flushmmu flag.
for better system diagnostics, we *ALWAYS* want to record the parent
pid of a user process, regardless of if the child will post a wait
record on exit or not.
for that, we reverse the roles of Proc.parent and Proc.parentpid so
Proc.parentpid will always be set on rfork() and the Proc.parent
pointer will point to the parent's Proc structure or is set to nil
when no wait record should be posted on exit (RFNOWAIT flag).
this means that we can get the pid of the original parent process
from /proc, regardless of the the child having rforked with the
RFNOWAIT flag. this improves the output of pstree(1) somewhat if
the parent is still alive. note that theres no guarantee that the
parent pid is still valid.
the conditions are unchanged:
a user process that will post wait record has:
up->kp == 0 && up->parent != nil && up->parent->pid == up->parentpid
the boot process is:
up->kp == 0 && up->parent == nil && up->parentpid == 0
and kproc's have:
up->kp != 0 && up->parent == nil && up->parentpid == 0
after issuing CR_RESETEP command, we have to invalidate
the endpoints output context buffer so that the halted/error
status reflects the new state. not doing so resulted in
the halted state to be stuck and we continued issuing
endpoint reset commands when we where already recovered.
handle the devusb Ep.clrhalt flag from devusb that userspace
uses to force a endpoint reset on the next transaction.
the locking in proctext() is wrong. we have to acquire Proc.seglock
when reading segments from Proc.seg[] as segments do not
have a private freelist and can therefore be reused for other
data structures.
once we have Proc.seglock acquired, check that the process pid
is still valid so we wont accidentally read some other processes
segments. (for both proctext() and procctlmemio()). this also
should give better error message to distinguish the case when
the process did segdetach() the segment in question before we
could acquire Proc.seglock.
declare private functions as static.
pexit() and pprint() can get called outside of a syscall
(from procctl()) with a process that is in active note
handling and require floating point in the kernel on amd64
for aesni (devtls).
make exec() clear the per process error string
to avoid spurious errors and confusion.
the errstr() syscall used to always swap the
maximum buffer size with memmove(), which is
problematic as this gives access to the garbage
beyond the NUL byte. worse, newproc(), werrstr()
and rerrstr() only clear the first byte of the
input buffer. so random stack rubble could be
leaked across processes.
we change the errstr() syscall to not copy
beyond the NUL byte.
the manpage also documents that errstr() should
truncate on a utf8 boundary so we use utfecpy()
to ensure proper NUL termination.
the user should not be able to change the cache
attributes for a segment in segattach() as this
can cause the same memory to be mapped with
conflicting attributes in the cache.
SG_TEXT should always be mapped with SG_RONLY
attribute. so fix data2txt() to follow the rules.
fault() now has an additional pc argument that is
used to detect fault on a non-executable segment.
that is, we check on read fault if the segment
has the SG_NOEXEC attribute and the program counter
is within faulting page.
a portable SG_NOEXEC segment attribute was added to allow
non-executable (physical) segments. which will set the
PTENOEXEC bits for putmmu().
in the future, this can be used to make non-executable
stack / bss segments.
the SG_DEVICE attribute was added to distinguish between
mmio regions and uncached memory. only matterns on arm64.
on arm, theres the issue that PTEUNCACHED would have
no bits set when using the hardware bit definitions.
this is the reason bcm, kw, teg2 and omap kernels use
arteficial PTE constants. on zynq, the XN bit was used
as a hack to give PTEUNCACHED a non-zero value and when
the bit is clear then cache attributes where added to
the pte.
to fix this, PTECACHED constant was added.
the portable mmu code in fault.c will now explicitely set
PTECACHED bits for cached memory and PTEUNCACHED for
uncached memory. that way the hardware bit definitions
can be used everywhere.
we have to ensure that all stores saving the process state
have completed before setting up->mach = nil in the scheduler.
otherwise, another cpu could observe up->mach == nil while
the stores such as the processes p->sched label have not finnished.
between being commited to a machno and having acquired the lock, the
scheduler could come in an schedule us on a different processor. the
solution is to have dtmachlock() take a special -1 argument to mean
"current mach" and return the actual mach number after the lock has
been acquired and interrupts being disabled.
all screen implementations use a Memimage* internally
for the framebuffer, so we can return a shared reference
to its Memdata structure in attachscreen() instead of
a framebuffer data pointer.
this eleminates the softscreen == 0xa110c hack as we
always use shared Memdata* now.
always start the pager kproc in swapinit(), simplifying kickpager().
allow zero conf.nswap and conf.nswppo. avoid allocating the reference
map and iolist arrays in that case.
use ulong for ioptr and iolist indices.
don't panic when writing pages out to the swapfile fails. just
requeue the page in the io transaction list so we will try
again next time executeio() is run or just free the page when
the swap reference was dropped.
remove unused pagersummary() function.
segclock() has to be called from hzclock(), otherwise
only processes running on cpu0 would catche the interrupt
and the time delta would be wrong.
lock the segment when allocating Seg->profile as
profile ctl might be issued from multiple processes.
Proc->debug qlock is not sufficient.
Seg->profile can never be freed or reallocated once
set as the timer interrupt accesses it without any
locking.
linux will send small, unpadded arp packets which may arrive over
wifi, so allow small packets into the bridge and pad any packets that
are too small when going out.
Once a second rebalance() is called on cpu0 to adjust priorities,
so cpu-bound processes won't lock others out. However it was only
adjusting processes which were running on cpu0. This was observed
to lead to livelock, eg when a higher-priority process spin-waits
for a lock held by a lower priority one.
now handle the supported rates element properly, only
providing the intersecting set of rates that the bss
advertises and what the driver supports, putting the
basic rates first.
also avoid using usupported rates.
on HZ 100 systems like pc and pc64, the minium sleep time
was 10ms, which is quite high. the cap isnt really needed
as arch specific timerset() enforces its own limit, but on
a higher resolution.
background:
from Charles Forsyth:
I haven't really got an opinion on it. The 10ms interval was first used on
machines that were much slower.
I thought someone did set HZ to a bigger value, partly to support better
in-kernel timing. I haven't done it because I never had a need for it.
If I were doing (say) protocol implementation in user mode, I'd certainly
reconsider. Sleep itself forces at best ms granularity,
and for some applications that's too big.
initial mail from qwx raising the issue:
> Hello,
>
> I found out recently that sleep(2)'s resolution on 386 and 9front's amd64
> kernel is 10 ms rather than 1 ms. The reason is that on those kernels,
> HZ is set to 100 rather than say 1000. In syssleep, we get 1 tich every
> 10 ms.
>
> What is unclear is why.
>
> To paraphrase cinap_lenrek's answer to my question:
>
> In syssleep:
> if(ms < TK2MS(1))
> ms = TK2MS(1);
> tsleep(&up->sleep, return0, 0, ms);
>
> "TK2MS(1)" can be replaced with just "1", and the arch specific
> timerset() routine would do its own capping of the period if it's too
> small for the timer resolution, and make better decisions based on what
> the minimum timer period should be given the latency overhead of the
> given arch's interrupt handling and performance characteristics.
>
> Alternatively, HZ could be raised to 500 or 1000.
>
> It seems it's just trying to prevent excessive context switches and
> interrupts, but it seems somewhat arbitrary. A ton of syscalls can be
> done in 1 ms, and it's the lowest we can go without changing the unit.
>
>
> What do you think?
>
> Thanks in advance,
>
> qwx
devdir internally replicates the qid in ther perm stat field
already and the practice of explicitely passing just causing
confusion when done inconsistently.
this driver makes regions of physical memory accessible as a disk.
to use it, ramdiskinit() has to be called before confinit(), so
that conf.mem[] banks can be reserved. currently, only pc and pc64
kernel use it, but otherwise the implementation is portable.
ramdisks are not zeroed when allocated, so that the contents are
preserved across warm reboots.
to not waste memory, physical segments do not allocate Page structures
or populate the segment pte's anymore. theres also a new SG_CHACHED
attribute.
everything was broken. strting with hsinit not even chaining
the itd's into a ring. followed by broken buffer pointer pages.
finally, the interrupt handler's read transaction length
calculation was completely bugged, using the *FRAME* index
to access descriptors csw[] fields and not reseting tdi->ndata
thru the loop.
minor stuff:
iso->data needs to be freed with ctlr->dmafree()
put ival in iso->ival so ctl message cannot override the endpoints
pollival and screw up deallocation.
we allow devether to create ethernet cards on attach. this is useull
for virtual cards like the sink driver, so we can create a sink
by simply: bind -a '#l2:sink ea=112233445566' /net
the detach routine was never called, so remove it from the few drivers
that attempted to implement it.
the only architecture dependence of devether was enabling interrupts,
which is now done at the end of the driver's reset() function now.
the wifi stack and dummy ethersink also go to port/.
do the IRQ2->IRQ9 hack for pc kernels in intrenabale(), so not
every caller of intrenable() has to be aware of it.