This avoids ipconfig having to explicitely specify the tag
when we want to set route type, as the tag can be provided
implicitely thru the "tag" command.
This adds a new route "t"-flag that enables network address translation,
replacing the source address (and local port) of a forwarded packet to
one of the outgoing interface.
The state for a translation is kept in a new Translation structure,
which contains two Iphash entries, so it can be inserted into the
per protocol 4-tuple hash table, requiering no extra lookups.
Translations have a low overhead (~200 bytes on amd64),
so we can have many of them. They get reused after 5 minutes
of inactivity or when the per protocol limit of 1000 entries
is reached (then the one with longest inactivity is reused).
The protocol needs to export a "forward" function that is responsible
for modifying the forwarded packet, and then handle translations in
its input function for iphash hits with Iphash.trans != 0.
This patch also fixes a few minor things found during development:
- Include the Iphash in the Conv structure, avoiding estra malloc
- Fix ttl exceeded check (ttl < 1 -> ttl <= 1)
- Router should not reply with ttl exceeded for multicast flows
- Extra checks for icmp advice to avoid protocol confusions.
Wlock()'ing the ifc causes a deadlock with Medium
bind/unbind as the routine can walk /net, while
ndb/dns or ndb/cs are currently blocked enumerating
/net/ipifc/*.
The fix is to have a fake medium, called "unbound",
that is set temporarily during the call of Medium
bind and unbind.
That way, the interface rwlock can be released while
bind/unbind is in progress.
The ipifcunbind() routine will refuse to unbind a
ifc that is currently assigned to the "unbound"
medium, preventing any accidents.
The ipoput4() and ipoput6() functions can raise an error(),
which means before calling sndrst() or limbo() (from tcpiput()),
we have to get rid of our blist by calling freeblist(bp).
Makse sure to set the Block pointer to nil after freeing in
ipiput() to avoid accidents.
Fix wrong panic string in sndsynack, and make any sending
functions like sndrst(), sndsynack() and tcpsendka()
return the value of ipoput*(), so we can distinguish
"no route" error.
Add a Enoroute[] string constant.
Both htontcp4() and htontcp6() can never return nil,
as they will allocate new or resize the existing block.
Remove the misleading error handling code that assumes
that it can fail.
Unlock proto on error in limborexmit() which can
be raised from sndsynack() -> ipoput*() -> error().
Make sndsynack() pass a Routehint pointer to ipoput*()
as it already did the route lookup, so we dont have todo
it twice.
i'm not confident about mutating the route tree
pointers and have concurrent readers walking the
pointer chains.
given that most route lookups are bypassed now
for non-routing case and we are not building a
high performance router here, lets play it safe.
theres no structure in the lower 32 bits of an ipv6 address.
use the top bit to distinguish special stuff like multicast
and link-local addresses, and use the 16-bit subnet-id bits
for the rest.
Instead of having to do an arp hash table lookup for each
outgoing ip packet, forward the Routehint pointer to the
medium's bwrite() function and let it cache the arp entry
pointer.
This avoids route and arp hash table lookups for tcp, il
and connection oriented udp.
It also allows us to avoid multiple route and arp table
lookups for the retransmits once an arp/neighbour solicitation
response arrives.
The IPv4 ARP cache used to indefinitely buffer packets in the Arpent hold list.
This is bad in case of a router, because it opens a 1 second
(retransmit time) window to leak all the to be forwarded packets.
This change makes the ipv4 arp code path similar to the IPv6 neighbour
solicitation path, using the retransmit process to time out old entries
(after 3 arp retransmits => 3 seconds).
A new function arpcontinue() has been added that unifies the point when
we schedule the (ipv6 sol retransmit) / (ipv4 arp timeout) and reduce
the hold queue to the last packet and unlock the cache.
As a bonus, we also now send a icmp host unreachable notification
for the dropped packets.
Added a ver= field to the filter to distinguish the ip version.
By default, a filter is parsed as ipv6, and after parsing
proto, src and dst fields are converted to ipv4. When no
ver= field is specified, a ip version filter is implicitely
added and both protocols are parsed.
This change also gets rid of the fast compare types as the
filed might not be aligned correctly in the packet.
This also fixes the ifc= filter, as we have to check any
local address.
We used to just return the first address of the incoming
interface regardless of if the address matches the source
ip type and scope.
This change tries to find the best interface address that
will match the source ip so it can be used as a source
address when replying to the packet.
ipiput4() and ipiput6() are called with the incoming interface rlocked
while ipoput4() and ipoput6() also rlock() the outgoing interface once
a route has been found. it is common that the incoming and outgoing
interfaces are the same recusive rlocking().
the deadlock happens when a reader holds the rlock for the incoming interface,
then ip/ipconfig tries to add a new address, trying to wlock the interface.
as there are still active readers on the ifc, ip/ipconfig process gets queued
on the inteface RWlock.
now the reader finds the outgoing route which has the same interface as the
incoming packet and tries to rlock the ifc again. but now theres a writer
queued, so we also go to sleep waiting four outselfs to release the lock.
the solution is to never wait for the outgoing interface rlock, but instead
use non-queueing canrlock() and if it cannot be acquired, discard the packet.
to prevent deadlock on media unbind (which is called with
the interface wlock()'ed), the medias reader processes
that unbind was waiting for used to discard packets when
the interface could not be rlocked.
this has the unfortunate side effect that when we change
addresses on a interface that packets are getting lost.
this is problematic for the processing of ipv6 router
advertisements when multiple RA's are getting received
in quick succession.
this change removes that packet dropping behaviour and
instead changes the unbind process to avoid the deadlock
by wunlock()ing the interface temporarily while waiting
for the reader processes to finish. the interface media
is also changed to the mullmedium before unlocking (see
the comment).
when making outgoing connections, the source ip was selected
by just iterating from the first to the last interface and
trying each local address until a route was found. the result
was kind of hard to predict as it depends on the interface
order.
this change replaces the algorithm with the route lookup algorithm
that we already have which takes more specific desination and
source prefixes into account. so the order of interfaces does
not matter anymore.
permission checking had the "other" and "owner" bits swapped plus incoming
connections where always owned by "network" instead of the owner of
the listening connection. also, ipwstat() was not effective as the uid
strings where not parsed.
this fixes the permission checks for data/ctl/err file and makes incoming
connections inherit the owner from the listening connection.
we also allow ipwstat() to change ownership to the commonuser() or anyone
if we are eve.
we might have to add additional restrictions for none at a later point...
the Ipselftab is designed to not require locking on read
operation. locking the selftab in ipselftabread() risks
deadlock when accessing the user buffer creates a fault.
remove unused fields from the Ipself struct.
initialize the rate limits when the device gets
bound, not when it is created. so that the
rate limtis get reset to default when the ifc
is reused.
adjust the burst delay when the mtu is changed.
this is to make sure that we allow at least one
full sized packet burst.
make a local copy of ifc->m before doing nil
check as it can change under us when we do
not have the ifc locked.
specify Ebound[] and Eunbound[] error strings
and use them consistently.
remove references to the unused Conv.car qlock.
ipifcregisterproxy() is called with the proxy
ifc wlock'd, which means we cannot acquire the
rwlock of the interfaces that will proxy for us
because it is allowed to rlock() multiple ifc's
in any order. to get arround this, we use canrlock()
and skip the interface when we cannot acquire the
lock.
the ifc should get wlock'd only when we are about
to modify the ifc or its lifc chain. that is when
adding or removing addresses. wlock is not required
when we addresses to the selfcache, which has its
own qlock.
mark reader process pointers with (void*)-1 to mean
not started yet. this avoids the race condition when
media unbind happens before the kproc has set its
Proc* pointer. then we would not post the note and
the reader would continue running after unbind.
etherbind can be simplified by reading the #lX/addr
file to get the mac address, avoiding the temporary
buffer.
using ~IP_DF mask to select offset and "more fragments" bits
includes the evil bit 15. so instead define a constant IP_FO
for the fragment offset bits and use (IP_MF|IP_FO). that way
the evil bit gets ignored and doesnt cause any useless calls
to ipreassemble().
unfraglen() had the side effect that it would always copy the
nexthdr field from the fragment header to the previous nexthdr
field. this is fine when we reassemble packets but breaks
fragments that we want to just forward unchanged.
given that we now keep the block size consistent with the
ip packet size, the variable header part of the ip packet
is just: BLEN(bp) - fp->flen == fp->hlen.
fix bug in ip6reassemble() in the non-fragmented case:
reload ih after ip header was moved before writing ih->ploadlen.
use concatbloc() instead of pullupblock().
some protocols assume that Ip4hdr.length[] and Ip6hdr.ploadlen[]
are valid and not out of range within the block but this has
not been verified. also, the ipv4 and ipv6 headers can have variable
length options, which was not considered in the fragmentation and
reassembly code.
to make this sane, ipiput4() and ipiput6() now verify that everything
is in range and trims to block to the expected size before it does
any further processing. now blocklen() and Ip4hdr.length[] are conistent.
ipoput4() and ipoput6() are simpler now, as they can rely on
blocklen() only, not having a special routing case.
ip fragmentation reassembly has to consider that fragments could
arrive with different ip header options, so we store the header+option
size in new Ipfrag.hlen field.
unfraglen() has to make sure not to run past the buffer, and hadle
the case when it encounters multiple fragment headers.
Under the normal close sequence, when we receive a FIN|ACK, we enter
TIME-WAIT and respond to that LAST-ACK with an ACK. Our TCP stack would
send an ACK in response to *any* ACK, which included FIN|ACK but also
included regular ACKs. (Or PSH|ACKs, which is what we were actually
getting/sending).
That was more ACKs than is necessary and results in an endless ACK storm
if we were under the simultaneous close sequence. In that scenario,
both sides of a connection are in TIME-WAIT. Both sides receive
FIN|ACK, and both respond with an ACK. Then both sides receive *those*
ACKs, and respond again. This continues until the TIME-WAIT wait period
elapses and each side's TCP timers (in the Plan 9 / Akaros case) shut
down.
The fix for this is to only respond to a FIN|ACK when we are in TIME-WAIT.
when a prefix is added with the onlink flag clear, packets
towards that prefix needs to be send to the default gateway
so we omit adding the interface route.
when the on-link flag gets changed to 1 later, we add the
interface route.
the on-link flag is sticky, so theres no way to clear it back
to zero except removing and re-adding the prefix.
sending multicast was broken when ipconfig assigned the 0
address for dhcp as they would wrongly classified as Runi.
this could happen when we do slaac and dhcp in parallel,
breaking the sending of router solicitations.