orib.dev: Git/serve

A while ago, I released git9, a git client for Plan 9. However, it always felt like it was missing something: A git server. over a few weekends of work, I sat down and put one together. This Labor Day weekend, I took the opportunity to labor on it, and got something working.

usage

Git/serve is a git server designed to fit into the Plan 9 ecosystem, allowing hosting of, and interacting with, code on a Plan 9 server, while still allowing legacy clients running on Unix to get a copy of the code.

Git/serve only speaks the git:// protocol. But because it runs behind aux/listen and tlssrv, we effectively get two additional protocols for free: gits://, or TLS-encrypted git://, and hjgit://, which is git:// but with Plan 9 authentication to support user login over TLS. Unfortunately, only the unencrypted git:// protocol is supported out of the box by upstream git. There may be ways to solve this, using custom transports, and a unix port of tlsclient

To start a git:// server, just run it under aux/listen1. To enable encryption and user authentication, run it under tlssrv -a. This doesn't need a certificate, because the Plan 9 authentication generates the secret used for TLS. And, finally, if you just want encryption, you can use tlsclient with a certificate.

# git:// server, serving every repository under the current directory
% aux/listen1 'tcp!*!9418' git/serve -r `{pwd}

# gits:// server, doing the same: But with encryption.
% aux/listen1 'tcp!*!9418' tlsclient -c /path/to/cert.pem git/serve -r `{pwd}

# hjgit:// server doing the same: Requires account on server to connect
% aux/listen1 'tcp!*!9418' tlsclient -a git/serve -r `{pwd}

If you want to allow people to write, you can just add the -w flag to git/serve. While you can do this on any of the protocols, keep in mind that only hjgit:// authenticates the users. Push to the world, but don't let the world push to you!

implementation

All the protocols git uses are closely related. The ssh:// protocol is just the git:// protocol, with the command selected sightly differently. The smart http:// protocols are the same as the git:// protocol, with a different handshake, and split over multiple post requests.

Git9 implements the git protocol in under 500 lines of C. The full code is available on github, in serve.c.

The protocol look something like this for pushing:

<=w= 0030:	"git-fetch-pack /oridb/git9\0host=github.com\n"
=r=> 0086:	"747e9e80f710c0b8bbd928080745915ad2493322 HEAD\n"
=r=> 009d:	"747e9e80f710c0b8bbd928080745915ad2493322 refs/heads/master\n"
<=w= 0076:	"747e9e80f710c0b8bbd928080745915ad2493322 \
	dc802c499dc7164aa208b919a07c5bd18701b0e1 \
	refs/heads/master\0report-status\n"
<=w= 0000
<=w= 
=r=> 000e:	unpack ok
=r=> 0019:	ok refs/heads/master

And this for pulling:

<=w= 0030:	"git-upload-pack /oridb/git9\0host=github.com\n"
=r=> 013b:	"dc802c499dc7164aa208b919a07c5bd18701b0e1 HEAD\n"
=r=> 003f:	"dc802c499dc7164aa208b919a07c5bd18701b0e1 refs/heads/master\n"
=r=> 0000
<=w= 0032:	"want 5d761590cfdce23e4b17e94050b0776c8804d4a1\n"
<=w= 0032:	"want 8f0eb1386a2c748cb7bf26917684479ff8c6d5b5\n"
<=w= 0032:	"want dc4a9d467c6a28dab3927c8611a1bfbc571f1568\n"
<=w= 0000
<=w= 0033:	"have dc802c499dc7164aa208b919a07c5bd18701b0e1\n"
<=w= 0009:	"done\n"
=r=> 0031:	"ACK dc802c499dc7164aa208b919a07c5bd18701b0e1\n"

The git protocol consists of pkt-lines, which are strings with a hex-formatted length prefix. The length prefix includes itself. So, the string "hi" would be formatted as 0006hi. A zero lenth packet where the length prefix is not included is special. This is called a 'flush packet', and is used to terminate a phase of negotiation.

All negotiation is with these packet lines, at which git flips over to pure binary mode to transfer a pack file over.

In the git:// protocol, the client always starts off by saying which action it wants to do: Either it wants the server to fetch a pack from the client, or it wants the server to upload a pack to the client. That's this line:

<=w= 0030:	"git-fetch-pack /oridb/git9\0host=github.com\n"

The upload side of this protocol is simpler on the server side so we'll go through it first.

pushing

In the upload protocol, the server begins by sending a list of references that the client may update. In git/serve we grab all the refs in the repository, and filter down to just the ones beginning with heads/.

	if((nrefs = listrefs(&refs, &names)) == -1)
		sysfatal("listrefs: %r");
	for(i = 0; i < nrefs; i++){
		if(strncmp(names[i], "heads/", strlen("heads/")) != 0)
			continue;
		if(fmtpkt(c, "%H refs/%s\n", refs[i], names[i]) == -1)
			goto error;
	}
	if(flushpkt(c) == -1)
		goto error;

With the first phase of the protocol, the client then sends the list of references it wants to update. This takes the form of:

	OLDHASH NEWHASH heads/refs/updateme\n"

Because the client knows what the reference versions are on the server, it can compute what commits are between the version on the server and the version that it has. If the new hash is the zero hash, this signals to the server that the reference should be deleted. This list of updates is terminated with a flush packet. The code in git9 that handles this is below:

	while(1){
		/* Did we get a packet? */
		if((n = readpkt(c, pkt, sizeof(pkt))) == -1)
			goto error;
		/* Was it a flush packet? */
		if(n == 0)
			break;
		/* Split it up into the 3 parts: old, new, reference */
		if(getfields(pkt, sp, nelem(sp), 1, " \t\n\r") != 3){
			fmtpkt(c, "ERR  protocol garble %s\n", pkt);
			goto error;
		}
		/* verify that these are valid hashes */
		if(hparse(&old, sp[0]) == -1){
			fmtpkt(c, "ERR bad old hash %s\n", sp[0]);
			goto error;
		}
		if(hparse(&new, sp[1]) == -1){
			fmtpkt(c, "ERR bad new hash %s\n", sp[1]);
			goto error;
		}
		/* and valid refs */
		if(!validref(sp[2])){
			fmtpkt(c, "ERR invalid ref %s\n", sp[2]);
			goto error;
		}
		/* and then remember them for when we do the update */
		*cur = erealloc(*cur, (*nupd + 1)*sizeof(Hash));
		*upd = erealloc(*upd, (*nupd + 1)*sizeof(Hash));
		*ref = erealloc(*ref, (*nupd + 1)*sizeof(Hash));
		(*cur)[*nupd] = old;
		(*upd)[*nupd] = new;
		(*ref)[*nupd] = estrdup(sp[2]);
		*nupd += 1;
	}

Next, the client uploads the pack. This is a blob containing the commit data. It goes into .git/objects/packs/recv-$pid.pack, at least until we can index it and rename it.

	while(1){
		n = read(c->rfd, buf, sizeof(buf));
		if(n == 0)
			break;
		if(n == -1 || write(pfd, buf, n) != n)
			return -1;
		packsz += n;
	}
	if(checkhash(pfd, packsz, &h) == -1){
		dprint(1, "hash mismatch\n");
		goto error1;
	}
	if(indexpack(packtmp, idxtmp, h) == -1){
		dprint(1, "indexing failed\n");
		goto error1;
	}
	if(rename(packtmp, idxtmp, h) == -1){
		dprint(1, "rename failed: %r\n");
		goto error2;
	}

Finally, we update the references. Note that we haven't locked the repository yet. This is because none of the data we have here can conflict: All objects are addressed by hash, so a race would simply leave us with a duplicate hash in the packfile. Eventually, a git/repack will clean that up.

However, updating the references can conflict. So, for updating the references, we acquire a lock file. We then read all the references, make sure that they match the old reference, and then write in the new reference.

	for(i = 0; i < nupd; i++){
		if(resolveref(&h, ref[i]) == 0 && !hasheq(&h, &cur[i])){
			fmtpkt(c, "ERR old ref changed: %s", ref[i]);
			goto error;
		}
		if((o = readobject(upd[i])) == nil){
			fmtpkt(c, "ERR update to nonexistent hash %H", upd[i]);
			goto error;
		}
		if(o->type != GCommit){
			fmtpkt(c, "ERR not commit: %H", upd[i]);
			goto error;
		}
		unref(o);
		if(snprint(refpath, sizeof(refpath), ".git/%s", ref[i]) == sizeof(refpath)){
			fmtpkt(c, "ERR ref path too long: %s", ref[i]);
			goto error;
		}
		if((fd = create(refpath, OWRITE, 0644)) == -1){
			fmtpkt(c, "ERR open ref: %r");
			goto error;
		}
		if(fprint(fd, "%H", upd[i]) == -1){
			close(fd);
			fmtpkt(c, "ERR upate ref: %r");
			goto error;
		}
		close(fd);
	}

And that's pushing.

pulling

Pulling takes more work on the server side, because the server needs to compute a reasonable packfile to send to the client. The protocol itself is still fairly simple.

It begins the same way as pushing, by sending all the branches that a client may want to obtain from the git repository.

=r=> 013b:	"dc802c499dc7164aa208b919a07c5bd18701b0e1 HEAD\n"
=r=> 003f:	"dc802c499dc7164aa208b919a07c5bd18701b0e1 refs/heads/master\n"
=r=> 0000

The client then starts telling us what commits they want, and what commits they have. This lets us find a graph difference between the server and client graph, and generate a pack file that contains few, if any, extraneous commits.

<=w= 0032:	"want 5d761590cfdce23e4b17e94050b0776c8804d4a1\n"
<=w= 0032:	"want 8f0eb1386a2c748cb7bf26917684479ff8c6d5b5\n"
<=w= 0032:	"want dc4a9d467c6a28dab3927c8611a1bfbc571f1568\n"
<=w= 0000
<=w= 0033:	"have dc802c499dc7164aa208b919a07c5bd18701b0e1\n"
<=w= 0009:	"done\n"
=r=> 0031:	"ACK dc802c499dc7164aa208b919a07c5bd18701b0e1\n"

Computing the commits that go into a pack is not always trivial. The comits that the client may be on may have "bubbles" in the graph, so simply walking back from the start of the graph to the commits that the client has may end up walking around the commit, leading to nearly the whole history of the repository being sent, instead of just one or two commits.

Consider a repo where the server is ahead of the client, and now the client is trying to pull the changes. The client has commits c, and we're trying to compute a pack with only the commits o.

                o---o
               /     \
    --c---c---c---c---o---o <-- server
                  ^
                client

The client does a git/pull, and sends that it has the commit marked [c]. Since the server has every commit the client does, and more, it can look at all ancestors of the client commit. If we're smart, we would, but a naive backwards walk would not. So here's what happens:

                o---o
               /     \
    --c---c---c--[c]--o---o <-- server
                  ^
                client

The server walks back, selecting commits, and marking them with an 'X'

                o---o
               /     \
    --c---c---c--[c]--X---X <-- server
                  ^
                client

Since history forked, the server walks down both forks. It should stop at the stage below, because the ancestor of the client commit is obviously in the history -- but with a naive walk this doesn't happen.

                X---X
               /     \
    --o---o---o--[o]--X---X <-- server

                  ^
                client

So the naive walk gingerly steps around the client commit, and walks all the way through the rest of the history, creating an incredibly bloated pack file.


                X---X
               /     \
    --X---X---X--[o]--X---X <-- server
                  ^
                client

So, this is the algorithm I settled on to compute the stopping point

paint(keepheads, dropheads) {
    q = nil;
    for(n in dropheads)
        insert(q, n, Drop)
    for(n in keepheads)
        insert(q, n, Keep)
    while((n, color = qpop(q)) != nil){
        match(n.color, color){
        | (Drop, Drop):  continue
        | (Keep, Keep):  continue
        | (Keep, Drop):  continue
        | (Drop, Keep):  repaint(n)
        | (Blank, _):
            n.color = color
            for(p : n.parents)
                insert(q, p, color)
        }
    }
}

repaint(n){
    if(n.color == Blank){
        /*
         * need to recolor blank node at
         * frontier to flush the repaint
         * queue.
         */
        n.color = Local
        return
    }else{
        n.color = Local
        for(p : n.parents)
            repaint(p)
    }
}

What this algorithm does is paint backwards from both the nodes we want to keep from the server, and the nodes that the client has, which can be dropped from the pack. When a node that's been painted 'keep' meets a node that's painted 'drop', we're done with that path through the graph. When a node that's painted 'drop' meets a node that's painted 'keep', then all 'keep' nodes behind the 'drop' node in the graph are repainted to 'drop' Then, one blank node past the 'drop' frontier is painted, to force a flush of the 'keep' nodes in the queue.

Blank nodes are painted with the color of the node we're coming from, and nodes that already have a color are left untouched in all other cases.

To give a simple example, starting from the same situation, as above, the server starts to walk back from both the client and server heads

                o---o
               /     \
    --c---c---c---c---o---o <-- server
                  ^
                client

Marking the initial commits as either keep or drop

                o---o
               /     \
    --c---c---c---D---o---K <-- server
                  ^
                client

Adding the parents of the commits to the queue, and then popping off the queue to mark things appropriately.

                o---o
               /     \
    --c---c---D---D---K---K <-- server
                  ^
                client

Until eventually, the keeps meet the drops, and the commits going into the pack are known. at this point, the computation is done.

                K---K
               /     \
    --c---D---D---D---K---K <-- server
                  ^
                client

Now that the desired nodes are selected, we need to load all of the git objects, and put them into a pack file. This is done by deltifying the objects against each other. This is the bulk of the code in the git server. It's too much code to include inline, so I'll just put in a brief sketch of the solution.

First, the git objects are sorted, using a sort order that's likely to put similar types of object close to each other. Any sort order will produce correct results, but a good sort key will produce more efficient results.

The sort key that's used is the tuple of (type, path, mtime). The idea is that trees are likely to deltify well off of other trees, files with the same path will generally be modifications of each other, and older files will be modifications of newer ones.

From there, we use a modification of the rsync rolling hash algorithm to compute a reasonable delta of each object against the last 10 objects in the list. We then pick the smallest delta -- or decide not to deltify, if none of the deltas are nice

We then resort the list by date, so that we have objects close in time packed together, and write that out to the client.

And that's pulling.