Great Experimental File Shredder
        --------------------------------

Gefs is currently the Great Experimental File Shredder.
With time, it will become the Good Enough File System.
It aims to provide a crash-safe, corruption-detecting
simple, and fast snapshotting file system for Plan 9.

It's currently extremely experimental.

Gefs is built in 3 layers: The 9p layer sits on top of
a key-value store, which sits on top of a block store.

The Key-value store
-------------------

Gefs is built on a key value store. This provides
an API that allows for the following operations:

	upsert(msg[]) -> status:
		Atomically insert a sequence of
		messages into the key-value map.

		Either all messages become visible
		at once, or none of them do.

		The messages represent mutations
		to the leaf nodes.

	lookup(key) -> kvp:
		Looks up the value associated with
		a key.

	scan(prefix) -> iterator:
		Scans across the vlaues associated
		with a prefix. All values are returned
		in order.

The key value store is based on a Bε tree.  Bε trees
are essentially B+ trees, where the internal nodes are
split into a write buffer, as well as pivot keys.

This is better described here:

	http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf

There are a few modifications made to the above paper,
in order to allow for the tree to be implemented copy
on write.  Insertions are done bottom up instead of top
down, tree splits are done eagerly to guarantee that
there is room to insert into the tree on future
flushes.

In addition, the splits, rotations, and merges are done
lazily.  This means that the balance conditions for
B-trees are relaxed, and the correction on the weights
of siblings can be delayed until the next insertion
that crosses them.

The 9p Interface
------

The GEFS interface uses 2 key value stores: The
snapshot store, which stores roots of the file
system trees, and the file system trees which
store a snapshot in time of the file system contents.

The key value stores have several different key types.
Some are used for file system trees, the remainder are
used for snapshot trees.

These are the ones used to represent the file system:

	Kent: (parent qid, name) => dirent:

		This maps from a directory entry to a
		file.  It's used for walking, listing,
		and statting.

	Ksuper: qid => (parent qid, name):

		This key is only created for directory
		entries, and is used for implementing
		'..'.

	Kdat (qid, block offset) => blockptr

		This key is used for file contents.

These are the ones used to represent the snapshots:

	Klabel (snapname) => snapid:

		This is a moving, user friendly name
		for a snapshot. The default name is
		'main'.

	Ktref (uniqueid) => snapid:

		This is a transient, temporary name
		for a snapshot. It us used for exec
		and scan snapshots.

	Ksnap snapid => tree:

		This is a reference to a tree root
		in the copy on write file system
		tree. Snapshot ids do not move
		between tree roots, though they
		can be created or deleted.

To load up the file system and open a mount:

	0. snaptree = opentree(superblock.snapaddr)
	1. id = lookup(snaptree, snapname)
	2. addr = lookup(id)
	3. mount = opentree(addr)

To walk to a file, iteratively look up the components:

	ent = {path: -1, name: ""}	# qid/path for root
	for(name in walkpath)
		enc = encodekey(ent)
		ent = lookup(mount, enc)
		qid.path = ent.path
		qid.name = name

To pread data:

	id = {path: qid.path, off: offset&~Blockmask}
	blk = lookup(encodekey(id))
	memcpy(resp, blk.data+off%Blksz, len)

To list a directory, do a prefix scan over the tree:

	prefix = encodekey(parentqid)
	scan = scanprefix(mount, prefix)
	for(kvp in scan)
		handledirent(kvp)

etc.

Block allocation
-----------

The disk is split into a number of equal sized arenas.
Blocks are allocated in a round robin fashion from each
arena, allocating a certain number of blocks from each
arena before rotating. The arena is selected as:

	a = arenas[roundrobin++/ncontig + blocktype]

Segregating arenas by block type keeps data blocks roughly
continguous, and increases read speeds significantly.

Block allocations are recored into a per-arena allocation
log, which contains log packed into blocks. Each block
contains a sequence of entries of the following forms:

	LogAlloc  [addr, len]
	LogFree   [addr, len]
	LogAlloc1 [addr]
	LogFree1  [addr]
	LogChain  [nextent]
	LogEnd

These entries are packed into a linked list of logs.
The logs are replayed into an AVL tree when the file
system starts up, and the AVL tree is used to record
the allocations.

At file system start (and soon, periodically while
runnign), the allocations log is compressed and
rewritten, taking contiguous smaller logged allocs,
such as per-block LogAlloc1 entries, and compressing
them into a single run of blocks.

Snapshots
---------

Snapshots work by taking a reference to a file system
root node. The open snaphots labels are updated on a
regular timer.

Blocks are deleted across snapshots using the deadlist
mechanism, lifted wholesale from ZFS, and explained
here:

	https://www.youtube.com/watch?v=NXg86uBDSqI