Git9 was sloppy about telling git what commits we have.
We would list the commits at the tip of the branch, but not
walk down it, which means we would request too much data if
our local branches were ahead of the remote.
This patch changes that, sending the tips *and* the first
256 commits after them, so that git can produce a better
pack for us, with fewer redundant commits.
Inspired by some changes made in game of trees, I've
implemented a number of speedups in git9.
First, hashing the chunks during deltification with
murmurhash instead of sha1 speeds up the delta search
significantly.
The stretch function was micro-optimized a bit as well,
since that was taking a large portion of the time when
chunking.
Finally, the full path is not stored. We only care about
grouping files with the same name and path. We don't care
about the ordering. Therefore, only the hash of the path
xored with the hash of the diretory is kept, which saves
a bunch of mallocs and string munging.
This reduces the time spent repacking some test repos
significantly.
9front:
% time git/repack
deltifying 97473 objects: 100%
writing 97473 objects: 100%
indexing 97473 objects: 100%
58.85u 1.39s 61.82r git/repack
% time /sys/src/cmd/git/6.repack
deltifying 97473 objects: 100%
writing 97473 objects: 100%
indexing 97473 objects: 100%
43.86u 1.29s 47.51r /sys/src/cmd/git/6.repack
openbsd:
% time git/repack
deltifying 2092325 objects: 100%
writing 2092325 objects: 100%
indexing 2092325 objects: 100%
1589.48u 45.03s 1729.18r git/repack
% time /sys/src/cmd/git/6.repack
deltifying 2092325 objects: 100%
writing 2092325 objects: 100%
indexing 2092325 objects: 100%
1238.68u 41.49s 1373.15r /sys/src/cmd/git/6.repack
go:
% time git/repack
deltifying 529507 objects: 100%
writing 529507 objects: 100%
indexing 529507 objects: 100%
345.32u 7.71s 369.25r git/repack
% time /sys/src/cmd/git/6.repack
deltifying 529507 objects: 100%
writing 529507 objects: 100%
indexing 529507 objects: 100%
248.07u 4.47s 257.59r /sys/src/cmd/git/6.repack
When pushing, git/send would sometimes decide we had all the
objects that we'd need to update the remote, and would try
to pack and send the entire history of the repository. This
is because we only set the 'theirs' ref when we had the object.
If we didn't have the object, we would set a zero hash,
then when deciding if we needed to force, we would think
that we were updating a new branch and send everything,
which would fail to update the remote.
A while ago, qwx noticed that we clobbered the exec
bit when merging files. This is not what we want, so
we changed the operator precedence to avoid merging
dirty files implicitly.
But we do want to merge, because it's convenient for
maintaining permissions. So, instead, we should do a
3 way merge of the exec bit.
This patch does that, as well as reverting the rollback
of that change.
While we're here, we adjust the timestamps correctly
in git/branch.
This requires changes to git/fs, because without an open
handler, lib9p allows opening any file with any mode,
which confuses 'test -x'.
there was a diff that went in a while ago to improve
this, but it got backed out because it encounters a
bug in upstream git -- the spec says that a single
ACK should be sent when not using multi-ack modes,
but they send back multiple ones.
This commit brings back the functionality, and works
around the upstream git bug in two different ways.
First, it skips the packets up until it finds the
start of a pack header.
Second, it deduplicates the want messages, which
is what seems to trigger the duplicate ACKs that
cause us trouble.
Although git9 always uses the same commit date and author date, other
implementation do make a distinction. Since commit date is more
representative of the commit graph order, use this as a traversal hint
instead of author date.
If the server only supports the dumb protocol, the first 4 bytes of
response will be the initial part of the hash of the first ref.
The http-protocol documentation says that we should fall back to the
dumb protocol when we don't see a content-type of
application/x-$servicename-advertisement. Check this before
attempting to read a smart git packet.
We now keep track of 3 sets during traversal:
- keep: commits we've reached from head commits
- drop: commits we've reached from tail commits
- skip: ancestors of commits in both 'keep' and 'drop'
Commits in 'keep' and/or 'drop' may be added later to the 'skip' set
if we discover later that they are part of a common subgraph of the
head and tail commits.
From these sets we can calculate the commits we are interested in:
lca commits are those in 'keep' and 'drop', but not in 'skip'.
findtwixt commits are those in 'keep', but not in 'drop' or 'skip'.
The "LCA" commit returned is a common ancestor such that there are no
other common ancestors that can reach that commit. Although there can
be multiple commits that meet this criteria, where one is technically
lower on the commit-graph than the other, these cases only happen in
complex merge arrangements and any choice is likely a decent merge
base.
Repainting is now done in paint() directly. When we find a boundary
commit, we switch our paint color to 'skip'. 'skip' painting does
not stop when it hits another color; we continue until we are left
with only 'skip' commits on the queue.
This fixes several mishandled cases in the current algorithm:
1. If we hit the common subgraph from tail commits first (if the tail
commit was newer than the head commit), we ended up traversing the
entire commit graph. This is because we couldn't distinguish
between 'drop' commits that were part of the common subgraph, and
those that were still looking for it.
2. If we traversed through an initial part of the common subgraph from
head commits before reaching it from tail commits, these commits
were returned from findtwixt even though they were also reachable
from tail commits.
3. In the same case as 2, we might end up choosing an incorrect
commit as the LCA, which is an ancestor of the real LCA.
when reverting files, 'cp -x' updates the mtime
to the time the file was committed. this prevents
'mk' from rebuilding the file, leading to stale
builds.
this change touches the file on revert, so that
we rebuild the file.
when running outside of a repository, we would try to
remove '$msgfile.tmp', but we had never actually set
'$msgfile'.
the error is harmless, but annoying.
due to the way the size of buf was calculated, the parent
file had one trailing null byte for each parent. also, while
we're here, replace the sprint with seprint, and compute the
size from how much we printed in.
git used to track cache size in object
count, rather than bytes. This had the
unfortunate effect of making memory use
depend on the size of objects -- repos
with lots of large objects could cause
out of memory deaths.
now, we track sizes in bytes, which should
keep our memory usage flatter.
We seem to have a botch in the protocol negotiation, where
we leak some protocol packets into the packfile; this will
need to be fixed before we put this change in.
the pack cache was very stupid: it would close packs
as early as possible, which would prevent packs from
getting reused effectively. It would also select a
bad pack to close.
This picks the oldest pack, refcounts correctly, and
keeps up to Npackcache open at once (though it will
go over if more are in use).
git/revert requires a file name argument, but when none is given
it fails in a strange way:
% git/revert
usage: cleanname [-d pwd] name...
/bin/git/revert:15: null list in concatenation
Due to the way LCA is defined, a using a strict LCA
on a graph like this:
<--a--b--c--d--e--f--g
\ /
+-----h-------
can lead to spurious requests to merge. This happens
because 'lca(b, g)' would return 'a', since it can be
reached in one step from 'b', and 2 steps from 'g', while
reaching 'b' from 'a' would be a longer path.
As a result, we need to implement an lca variant that
returns the starting node if one is reachable from the
other, even if it's already found the technically correct
least common ancestor.
This replaces our LCA algorithm with one based on the
painting we do while finding a twixt, making it give
the resutls we want.
git/query: fix spurious merge requests
Due to the way LCA is defined, a using a strict LCA
on a graph like this:
<--a--b--c--d--e--f--g
\ /
+-----h-------
can lead to spurious requests to merge. This happens
because 'lca(b, g)' would return 'a', since it can be
reached in one step from 'b', and 2 steps from 'g', while
reaching 'b' from 'a' would be a longer path.
As a result, we need to implement an lca variant that
returns the starting node if one is reachable from the
other, even if it's already found the technically correct
least common ancestor.
This replaces our LCA algorithm with one based on the
painting we do while finding a twixt.
Git has the ability to track the person who
creates a commit separately from the person
who wrote the commit. For git9, we ignored
this feature.
However, as we start using git/import more,
it will be useful to figure out who imported
a commit, as well as who wrote it.
This change adds support for seeing this
information in git, as well as setting the
author and committer separately in git/import.
Per the docs:
the sender SHOULD include a LF, but the
receiver MUST NOT complain if it is not
present.
I typoed away the SHOULD, and got missed the
MUST NOT.
thanks qbit.
the subst utility no longer supports a '-g'
flag, but this was left behind in commit;
this means that the lines listing modified
files were not correctly commented in the
commit header.
This is mostly harmless, but when using an
editor like sam to edit the commit message,
the modified lines would have to be removed
manually.
Often, people (including myself) will write emails that
can almost be applied with git/import. This changes
git/diff and git/import so that things will generally
work even when assembling diffs by hand:
1. git/import becomes slightly more lax:
^diff ...
^--- ...
will both be detected as the start of a patch.
2. git/diff produces the same format of diff
as git/export, starting with paths:
--- a/path/to/file
+++ b/path/to/file
which means that the 'ape/patch -p1' used
within git/import will just work.
So with this, if you send an email to the mailing list,
write up a committable description, and append the
output of git/diff to the end of the email, git/import
should just work.
[this patch was send through the mailing list using the
above procedure, and will be committed with git/import
to verify that it works as advertised]
Git currently gets a bit confused if you try to
manipulate files by absolute path. There were also a
number of places where user-controlled file paths ended
up getting passed to regex interpretation, which could
confuse things.
This change mainly does 2 things:
- Adds a 'drop' function which drops
a non-regex prefix from a string, and uses
that to manipulate paths, simplifies 'subst',
and removes 'subst -g', which was only used
with fixed regexes; sed does this job fine.
- When getting a path from a user, we
make it absolute and then strip out the head
Along the way it cleans up a couple of stupids:
- 'for(f in $list) if(! ~ $#f 0) use $f:
$f can't be a nil list because of
list flattening.
- removes a useless substitution here:
all=`$nl{{git/query -c $1 $2; git/query -c $2 $3} | sed 's/^..//' | \
gsubst '^('$ourbr'|'$basebr'|'$theirbr')/*' | sort | uniq}
where git/query -c doesn't produce
paths prefixed with the query.
The '-m' flag was added to date largely
to support git scripts. It predates the
tmdate code, which is why it exists, but
it's a recent enough addition that nothing
I'm aware of uses it, other than git.
As a result, it would be good to remove
it, so let's do that.
currently, git/fetch prints the refs
to update before it fully fetches the
pack files; this can lead to updates
to the refs before we're 100% certain
that the objects are present.
This change prints the updates after
the packfile has been successfully
indexed.