Great Experimental File Shredder -------------------------------- Gefs is currently the Great Experimental File Shredder. With time, it will become the Good Enough File System. It aims to provide a crash-safe, corruption-detecting simple, and fast snapshotting file system for Plan 9. It's currently extremely experimental. Gefs is built in 3 layers: The 9p layer sits on top of a key-value store, which sits on top of a block store. The Key-value store ------------------- Gefs is built on a key value store. This provides an API that allows for the following operations: upsert(msg[]) -> status: Atomically insert a sequence of messages into the key-value map. Either all messages become visible at once, or none of them do. The messages represent mutations to the leaf nodes. lookup(key) -> kvp: Looks up the value associated with a key. scan(prefix) -> iterator: Scans across the vlaues associated with a prefix. All values are returned in order. The key value store is based on a Bε tree. Bε trees are essentially B+ trees, where the internal nodes are split into a write buffer, as well as pivot keys. This is better described here: http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf There are a few modifications made to the above paper, in order to allow for the tree to be implemented copy on write. Insertions are done bottom up instead of top down, tree splits are done eagerly to guarantee that there is room to insert into the tree on future flushes. In addition, the splits, rotations, and merges are done lazily. This means that the balance conditions for B-trees are relaxed, and the correction on the weights of siblings can be delayed until the next insertion that crosses them. The 9p Interface ------ The GEFS interface uses 2 key value stores: The snapshot store, which stores roots of the file system trees, and the file system trees which store a snapshot in time of the file system contents. The key value stores have several different key types. Some are used for file system trees, the remainder are used for snapshot trees. These are the ones used to represent the file system: Kent: (parent qid, name) => dirent: This maps from a directory entry to a file. It's used for walking, listing, and statting. Ksuper: qid => (parent qid, name): This key is only created for directory entries, and is used for implementing '..'. Kdat (qid, block offset) => blockptr This key is used for file contents. These are the ones used to represent the snapshots: Klabel (snapname) => snapid: This is a moving, user friendly name for a snapshot. The default name is 'main'. Ktref (uniqueid) => snapid: This is a transient, temporary name for a snapshot. It us used for exec and scan snapshots. Ksnap snapid => tree: This is a reference to a tree root in the copy on write file system tree. Snapshot ids do not move between tree roots, though they can be created or deleted. To load up the file system and open a mount: 0. snaptree = opentree(superblock.snapaddr) 1. id = lookup(snaptree, snapname) 2. addr = lookup(id) 3. mount = opentree(addr) To walk to a file, iteratively look up the components: ent = {path: -1, name: ""} # qid/path for root for(name in walkpath) enc = encodekey(ent) ent = lookup(mount, enc) qid.path = ent.path qid.name = name To pread data: id = {path: qid.path, off: offset&~Blockmask} blk = lookup(encodekey(id)) memcpy(resp, blk.data+off%Blksz, len) To list a directory, do a prefix scan over the tree: prefix = encodekey(parentqid) scan = scanprefix(mount, prefix) for(kvp in scan) handledirent(kvp) etc. Block allocation ----------- The disk is split into a number of equal sized arenas. Blocks are allocated in a round robin fashion from each arena, allocating a certain number of blocks from each arena before rotating. The arena is selected as: a = arenas[roundrobin++/ncontig + blocktype] Segregating arenas by block type keeps data blocks roughly continguous, and increases read speeds significantly. Block allocations are recored into a per-arena allocation log, which contains log packed into blocks. Each block contains a sequence of entries of the following forms: LogAlloc [addr, len] LogFree [addr, len] LogAlloc1 [addr] LogFree1 [addr] LogChain [nextent] LogEnd These entries are packed into a linked list of logs. The logs are replayed into an AVL tree when the file system starts up, and the AVL tree is used to record the allocations. At file system start (and soon, periodically while runnign), the allocations log is compressed and rewritten, taking contiguous smaller logged allocs, such as per-block LogAlloc1 entries, and compressing them into a single run of blocks. Snapshots --------- Snapshots work by taking a reference to a file system root node. The open snaphots labels are updated on a regular timer. Blocks are deleted across snapshots using the deadlist mechanism, lifted wholesale from ZFS, and explained here: https://www.youtube.com/watch?v=NXg86uBDSqI