sending multicast was broken when ipconfig assigned the 0
address for dhcp as they would wrongly classified as Runi.
this could happen when we do slaac and dhcp in parallel,
breaking the sending of router solicitations.
now handle the supported rates element properly, only
providing the intersecting set of rates that the bss
advertises and what the driver supports, putting the
basic rates first.
also avoid using usupported rates.
nobody passes us the "RSD PTR " address when doing multiboot/kexec
on UEFI systems. so we search for it manually in the ACPI reserved
area as indicated in the e820 memory map.
we did not apply the special case to store 0xFFFF (-0)
in the checksum field when the checksum calculation
returned zero. we survived this for v4 as RFC768 states:
> If the computed checksum is zero, it is transmitted as
> all ones (the equivalent in one's complement arithmetic).
>
> An all zero transmitted checksum value means that the
> transmitter generated no checksum (for debuging or for
> higher level protocols that don't care).
for ipv6 however, the checksum is not optional and receivers
would drop packets with a zero checksum.
during dhcp, ipconfig assigns the null address :: which makes
ipforme() return Runi for any destination, which can trigger
arp resolution when we attempt to reply. so have v4local()
skip the null address and have sendarp() check the return
status of v4local(), avoing the spurious arp requests.
closeconv() calls ipifcremmulti() like:
while((mp = cv->multi) != nil)
ipifcremmulti(cv, mp->ma, mp->ia);
so we have to defer freeing the entry after doing:
if((lifc = iplocalonifc(ifc, ia)) != nil)
remselfcache(f, ifc, lifc, ma);
which accesses the otherwise free'd ia and ma arguments.
on HZ 100 systems like pc and pc64, the minium sleep time
was 10ms, which is quite high. the cap isnt really needed
as arch specific timerset() enforces its own limit, but on
a higher resolution.
background:
from Charles Forsyth:
I haven't really got an opinion on it. The 10ms interval was first used on
machines that were much slower.
I thought someone did set HZ to a bigger value, partly to support better
in-kernel timing. I haven't done it because I never had a need for it.
If I were doing (say) protocol implementation in user mode, I'd certainly
reconsider. Sleep itself forces at best ms granularity,
and for some applications that's too big.
initial mail from qwx raising the issue:
> Hello,
>
> I found out recently that sleep(2)'s resolution on 386 and 9front's amd64
> kernel is 10 ms rather than 1 ms. The reason is that on those kernels,
> HZ is set to 100 rather than say 1000. In syssleep, we get 1 tich every
> 10 ms.
>
> What is unclear is why.
>
> To paraphrase cinap_lenrek's answer to my question:
>
> In syssleep:
> if(ms < TK2MS(1))
> ms = TK2MS(1);
> tsleep(&up->sleep, return0, 0, ms);
>
> "TK2MS(1)" can be replaced with just "1", and the arch specific
> timerset() routine would do its own capping of the period if it's too
> small for the timer resolution, and make better decisions based on what
> the minimum timer period should be given the latency overhead of the
> given arch's interrupt handling and performance characteristics.
>
> Alternatively, HZ could be raised to 500 or 1000.
>
> It seems it's just trying to prevent excessive context switches and
> interrupts, but it seems somewhat arbitrary. A ton of syscalls can be
> done in 1 ms, and it's the lowest we can go without changing the unit.
>
>
> What do you think?
>
> Thanks in advance,
>
> qwx
devdir internally replicates the qid in ther perm stat field
already and the practice of explicitely passing just causing
confusion when done inconsistently.
this driver makes regions of physical memory accessible as a disk.
to use it, ramdiskinit() has to be called before confinit(), so
that conf.mem[] banks can be reserved. currently, only pc and pc64
kernel use it, but otherwise the implementation is portable.
ramdisks are not zeroed when allocated, so that the contents are
preserved across warm reboots.
to not waste memory, physical segments do not allocate Page structures
or populate the segment pte's anymore. theres also a new SG_CHACHED
attribute.
fpurestore() unconditionally changed fpstate to FPinactive when
the kernel used the FPU. but in the FPinit case, the registers are
not saved by mathemu(), resulting in all zero initialized registers
being loaded once userspace uses the FPU so the process would have
wrong MXCR value.
the index overflow check was wrong with using shifted value.
there appears to be confusion about the refresh flag of arpenter().
when we get an arp reply, it makes more sense to just refresh
waiting/existing entries instead creating a new one as we do not
know if we are going to communicate with the remote host in the future.
when we see an arp request for ourselfs however, we want to always
enter the senders address into the arp cache as it is likely the sender
attempts to communicate with us and with the arp entry, we can reply
immidiately.
reject senders from multicast/broadcast mac addresses. thats just silly.
we can get rid of the multicast/broadcast ip checks in ethermedium and
do it in arpenter() instead, checking the route type for the target to
see if its a non unicast target.
enforce strict separation of interface's arp entries by passing a
rlock'd ifc explicitely to arpenter, which we compare against the route
target interface. this makes sure arp/ndp replies only affect entries for
the receiving interface.
handle neighbor solicitation retransmission in nbsendsol() only. that is,
both ethermedium and the rxmitproc just call nbsendsol() which maintains
the timers and counters and handles the rotation on the re-transmission
chain.
no need to rlock ifc in targetttype() as we are called from icmpiput6(),
which the ifc rlocked.
for icmpadvise, the lport, destination *AND* source have to match.
a connection gets a packet when the packets destination matches the source
*OR* the packets source matches the destination.
v4lookup() and v6lookup() do not acquire the routelock, so it is
possible to hit routes that are on the freelist. to detect these,
we set ref to 0 and check for this case, avoiding overriding the ifc.
re-evaluate routes when the ifcid on the route hint doesnt match.
rfc4861 7.2.2:
If the source address of the packet prompting the solicitation is the
same as one of the addresses assigned to the outgoing interface, that
address SHOULD be placed in the IP Source Address of the outgoing
solicitation.
this change adds ndbsendsol() which handles the source address selection
and also handles the arp table locking; avoiding access to the arp entry
after the arp table is unlocked.
cleanups:
- use ipmove() instead of memmove().
- useless extern qualifiers
ipv4local() and ipv6local() now take remote address argument,
returning the closest local address to the source. this
implements the standartized source address selection rules
instead of just returning the first local v4 or v6 address.
the source address selection was broken for esp, rudp an udp,
blindly assuming ifc->lifc->local being a valid v4 address.
use ipv6local() instead.
the v6 routing code used to lookup source address route to
decide to drop the packet instead of checking the interface
on the destination route.
factor out the route hint from Conv and put it in Routehint
structure. avoiding stack bloat in v4 routing. implement the
same trick for v6 avoiding second route lookup in ipoput6.
fix memory leak in icmpv6 router solicitation handling.
remove old unfinished handling of multiple v6 routers. should
implement source specific routes instead.
avoid duplication, use common convipvers() function.
use isv4() instead of memcmp v4prefix.
everything was broken. strting with hsinit not even chaining
the itd's into a ring. followed by broken buffer pointer pages.
finally, the interrupt handler's read transaction length
calculation was completely bugged, using the *FRAME* index
to access descriptors csw[] fields and not reseting tdi->ndata
thru the loop.
minor stuff:
iso->data needs to be freed with ctlr->dmafree()
put ival in iso->ival so ctl message cannot override the endpoints
pollival and screw up deallocation.
we allow devether to create ethernet cards on attach. this is useull
for virtual cards like the sink driver, so we can create a sink
by simply: bind -a '#l2:sink ea=112233445566' /net
the detach routine was never called, so remove it from the few drivers
that attempted to implement it.
the only architecture dependence of devether was enabling interrupts,
which is now done at the end of the driver's reset() function now.
the wifi stack and dummy ethersink also go to port/.
do the IRQ2->IRQ9 hack for pc kernels in intrenabale(), so not
every caller of intrenable() has to be aware of it.
the td index "x" was incremented twice, once in for loop
and in the body expression. so r->rp only got updated
every second completion. this is wrong, but harmless.
flushing tlb once the index wraps arround is not enougth
as in use pte's can be speculatively loaded. so instead
use invlpg() and explicitely invalidate the tlb of the
page mapped.
this fixes wired mount cache corruption for reads approaching
2MB which is the size of the KMAP window.
invlpg() was broken, using wrong operand type.
windows 7 just drops the default router when it tries to
probe for router reachability but gets a neighbor avertisement
from the router with the router bit clear.
so set the R-flag when sendra is active, which implies that
we are a router.
the driver doesnt implement multicast filter, but just turns
on promiscuous mode when a multicast address is added. but this
breaks when one actually enables and then disables promiscuous
mode with say, running snoopy.
we have to keep promisc mode active as long as multicast table
is not empty.
broadcast traffic was received back on the wire causing
duplicate address detection to break with dmat proy as
the rewritten broadcasts where observable.
the fix is to just ignore packets from ourselfs received
from the air. devether already handles loopback.
when kernel memory is exhausted, rtl8169replenish() can fail
to plant more receive descriptors and rtl8169receive() would
run over the receive tail and crash on the nil ctlr->rb[x].
rtl8169receive() is called on "Receive Descriptor Unavailable"
and "Packet Underrun" so we will try to replenish descriptors
in the beginning first in case memory was exhausted and memory
is available again and make sure not to run over the tail.
in case we continue to send traffic (like ping) with the ap gone,
the sending would keep updating bss->lastseen which prevents the
timeout to happen to switch bss.
- remove arbitrary limits on screen size, just check with badrect()
- post resize when physgscreenr is changed (actualsize ctl command)
- preserve physgscreenr across softscreen flag toggle
- honor panning flag on resize
- fix nil dereference in panning ctl command when scr->gscreen == nil
- use clipr when drawing vga plan 9 console (vgascreenwin())
to capture bios and bootloader messages, convert the contents
on the screen to kmesg.
on machines without legacy cga, the cga registers read out as
0xFF, resuting in out of bounds cgapos. so set cgapos to 0 in
that case.
when a process does an exec, it calls procsetup() which
unconditionally sets the sets the TS flag and fpstate=FPinit
and fpurestore() should not revert the fpstate.
cannot just reenable the fpu in FPactive case as we might have
been procsaved() an rescheduled on another cpu. what was i thinking...
thanks qu7uux for reproducing the problem.
new approach to graphics memory management:
the kernel driver never really cared about the size of stolen memory
directly. that was only to figure out the maximum allocation
to place the hardware cursor image somewhere at the end of the
allocation done by bios.
qu7uux's gm965 bios however wont steal enougth memory for his
native resolution so we have todo it manually.
the userspace igfx driver will figure out how much the bios
allocated by looking at the gtt only. then extend the memory by
creating a "fixed" physical segment.
the kernel driver allocates the memory for the cursor image
from normal kernel memory, and just maps it into the gtt at the
end of the virtual kernel framebuffer aperture.
thanks to qu7uux for the patch.
The aim is to take advantage of SSE instructions such as AES-NI
in the kernel by lazily saving and restoring FPU state across
system calls and pagefaults. (everything can can do I/O)
This is accomplished by the functions fpusave() and fpurestore().
fpusave() remembers the current state and disables the FPU if it
was active by setting the TS flag. In case the FPU gets used,
the current state gets saved and a new PFPU.fpslot is allocated
by mathemu().
fpurestore() restores the previous FPU state, reenabling the FPU
if fpusave() disabled it.
In the most common case, when userspace is not using the FPU,
then fpusave()/fpurestore() just toggle the FPpush bit in
up->fpstate.
When the FPU was active, but we do not use the FPU, then nothing
needs to be saved or restored. We just switched the TS flag on
and off agaian.
Note, this is done for the amd64 kernel only.
introducing the PFPU structue which allows the machine specific
code some flexibility on how to handle the FPU process state.
for example, in the pc and pc64 kernel, the FPsave structure is
arround 512 bytes. with avx512, it could grow up to 2K. instead
of embedding that into the Proc strucutre, it is more effective
to allocate it on first use of the fpu, as most processes do not
use simd or floating point in the first place. also, the FPsave
structure has special 16 byte alignment constraint, which further
favours dynamic allocation.
this gets rid of the memmoves in pc/pc64 kernels for the aligment.
there is also devproc, which is now checking if the fpsave area
is actually valid before reading it, avoiding debuggers to see
garbage data.
the Notsave structure is gone now, as it was not used on any
machine.
adjust to new aes_xts routines.
allow optional offset in the 4th argument where the encrypted
sectors start instead of hardcoding the 64K header area for
cryptsetup.
avoid allocating temporary buffer for cryptio() reads, we can
just decrypt in place there.
use sdmalloc() to allocate the temporary buffer for cryptio()
writes so that devsd wont need to allocate and copy in case
it didnt like our alignment.
do not duplicate the error reporting code, just use io()
that is what it is for.
allow 2*256 bit keys in addition to 2*128 bit keys.
there isnt much of a point in keep maintaining separate
kernel configurations for terminal and cpu kernels as
the role can be switched with service=cpu boot parameter.
to make stuff cosistent, we will just have one "pc" kernel
and one "pc64" kernel configuration now.
we used to tweak arround in the ICH registers for all intel controllers,
which is wrong, as the 200 series has different magic registes. see
the datasheet:
https://www.intel.com/content/www/us/en/chipsets/200-series-chipset-pch-datasheet-vol-2.html
this caused the clocks to be disabled for the 6th port causing a full
machine lockup touching the 6th port registers.
the next problem was that aiju's bios disabled the unused ports somehow
but didnt clear ther PI bits, so that they would stay in Sbist status even
after a port reset. so the port would get stuck in the Dportreset state
forever. the fix for this was to use a one second timeout for the
port reset procedure.
with xhci, bandwidth allocations are handled by the controller
and there are various speed settings possible that currently
not exposed in the Udev. so just keep usbload() as it is for
usb2 and keep ep->load as zero for superspeed.
more conservative approach: only one transaction in flight
per endpoint (except iso). also serialize controller commands.
no driver currently uses this and i doubt it is usefull.
create constants for common TRB flags and remove bogus 1<<16
flag on TR_NORMAL.
while endpoints != 0 are opend after the device descriptor has been
parsed and the endpoint properties like maxpkt have been set, the
control endpoint is opend with a guessed maxpkt value. once the first
8 bytes of the descriptor have been read by usbd, maxpkt gets set and
we need to reevaluate the control endpoint 0 context to update the value.
instead of guessing where the controllers dequeue pointer went,
stop the endpoint and then explicitely set te dequeue pointer to
the next write td position. that way we do not need to fix the cycle
bit in the td's and dont need to rely on if the controller
advanced the dequeue pointer after a stall or not.
add ctx and slot back pointers to ring.
add support for edp, dp and hdmi on haswell and haswell ult.
vga, dvi and specific configurations like ulx are unimplemented.
remaining issue: edp link training always fails (time out).
the problem is that segio doesnt check segment attributes
and it can't really in case of SG_FAULT which can be
inherited from pseg and toggle at any time.
so instead of returning -1 from fault into the fault$cputype
handler which then panics when fault happend kernel mode,
we jump into segio's waserror() block just like in the
demand load i/o error case (faulterror()).
make sure the loop terminates and doesnt get stuck at
name == aname. avoid memrchr() as it conflicts with
libc on unix (drawterm). declare namelenerror() as
static.
calculate alloc flag before waserror(), as compilers like
gcc will not notice the value changing later because
setjump() restores the old value due to callee-saves.
change is applies here to make it easier to merge with
drawterm.
thanks to aiju for debugging this; used to cause drawterm
memory leak until compiled with gcc -O0.
turns out on real hardware, the front falls off if we write
the completion queue doorbell registers without consuming
an entry. so only write the register when we have processed
something.
timerdel() did not make sure that the timer function
is not active (on another cpu). just acquiering the
Timer lock in the timer function only blocks the caller
of timerdel()/timeradd() but not the other way arround
(on a multiprocessor).
this changes the timer code to track activity of
the timer function, having timerdel() wait until
the timer has finished executing.
basic NVMe controller driver, reads and writes work.
"namespaces" show up as logical units.
uses pin/msi interrupts (no msi-x support yet).
one submission queue per cpu, shared completion queue.
no recovery from fatal controller errors.
only tested in qemu (no hardware available).
commiting this so it can be found by someone who has
hardware.
on thinkpad x1v4, the PCMP structure resides in upper reserved memory
pa=0xd7f49000 - while system memory ends at 0x0ffff000; so we have to
vmap() it instead of KADDR().
the RSD structure for ACPI might reside in low memory, so we sould
KADDR() in that case.
devmouse controls the screen blanking timeout, so move the
code there avoiding cross calls between modules. the only
function that needs to be provided is blankscreen(), which
gets called with drawlock locked.
the blank timeout is set thru /dev/mousectl now, so kernels
without devvga can set it.
blanking now only happens while /dev/mouse is read. so this
avoids accidentally blanking the screen on cpu servers that
do not have a mouse to unblank it.
- add some milisecond timestamps to the status change debug printing
- flush the packets in the queue on deassoc to avoid processing old pae
packets on next association.
- make roaming timeout shorter (60 -> 20 seconds)
- automatically timeout and restart wpa/pae blocked state
- fix printing race when essid gets changed underneath seprint
from openbsd driver, it seems the Centrino Advanced-N 6030 and 6235
cards share the same device revision as the 6205 (Type6005). Also
changing the device revision field from 4 to 5 bits.
- drivers enable short preamble and sort timeslot depending
on the ap beacon capinfo field (bss->cap)
- wifi sets short preamble bit in capinfo on association request
- wifi sets short timeslot bit when ap advertized it in beacon
- no need to splhi() in timerset, always called with
interrupts off.
- make timerset always update the period (next == 0)
- remove period update in fastticks(), simplify
delta calculation.
introducing new ctrunc() function that invalidates any caches
for the passed in chan, invoked when handling wstat with a
specified file length or on file creation/truncation.
test program to reproduce the problem:
#include <u.h>
#include <libc.h>
#include <libsec.h>
void
main(int argc, char *argv[])
{
int fd;
Dir *d, nd;
fd = create("xxx", ORDWR, 0666);
write(fd, "1234", 4);
d = dirstat("xxx");
assert(d->length == 4);
nulldir(&nd);
nd.length = 0;
dirwstat("xxx", &nd);
d = dirstat("xxx");
assert(d->length == 0);
fd = open("xxx", OREAD);
assert(read(fd, (void*)&d, 4) == 0);
}
given that the igfx driver doesnt provide any acceleration functions
and drawing is usually faster with double buffering as it eleminates
reads over the pci bus, enable softscreen by default.
on 386 kernel, each processor has its own pdb where the primary
pdb for kernel mappings is on cpu0 and other cpu's lazily pull
pdb entries from cpu0 when they fault in vmapsync().
so we have to edit the table tables in the pdb of cpu0 and not
the current processor.
on some modern machines like the x250, the bios arranges the mtrr's
and the framebuffer membar in a way that doesnt allow us to mark
the framebuffer pages as write combining, leading to slow graphics.
since the pentium III, the processor interprets the page table bit
combinations of the WT, CD and bit7 bits as an index into the
page attribute table (PAT).
to not change the semantics of the WT and CD bits, we preserve
the bit patterns 0-3 and use the last entry 7 for write combining.
(done in mmuinit() for each core).
the new patwc() function takes virtual address range and changes
the page table marking the range as write combining. no attempt
is made on invalidating tlb's. doesnt matter in our case as the
following mtrr() call in screen.c does it for us.