Storage Management Requirements

Gw
Advisor/Customer: Eelco Dolstra

Most current Wikis are based on RCS or CVS. This has some limitations, the main one being that files (Wiki pages), not directories, are versioned. This means that:

  • It is not possible to rename or move files, refactor the directory structure, etc.

  • Wikis generally do not support a recursive directory structure.

  • Commits are per-file. So changes to sets of files are logically distinct, i.e., not atomic. So if you want to make related changes to a bunch of pages, the Wiki may be in an inconsistent state while you're making the changes, and it's hard to undo (back out) the changes.

Since Subversion solves all these problems, we will use it as the storage back-end of GW. An entire Wiki will be stored in a single Subversion repository.

The goal of the storage team is to implement the Subversion storage back-end. I envision the following milestone releases (subject to change):

Release 0 - initial GW

On startup GW does a checkout of the repository. Edits happen on the working copy. However, there are no commits, so all changes to the Wiki are lost when the server is restarted. There is no locking of any kind.

Release 1 - persistent storage

Save operations should cause a commit to happen. When edits a new page, the page should be added first of course. Still no locking though.

R2 - multi file edits ("transactions"):

When a user starts editing a file, create a new working copy.

Edits happen on this per-session working copy and are not visible to other users. So a "Save" causes the per-session working copy to be modified, but nothing is committed.

There should now be a "Commit" operation that causes the entire per-session working copy to be committed. After this the per-session working copy can be deleted, and the global working copy should be updated. This makes the changes globally visible.

Merge conflicts are ignored for now.

R3 - merge conflicts

If on commit a merge conflict occurs, the per-session working copy should be retained, the user should be presented with a list of conflicting pages (showing the conflicts in those pages) and be allowed to edit those pages. Most of the work here should be done by the versioning UI team, but the storage team has to provide the supporting infrastructure.

R4 - use RA layer

Using working copies is inefficient. For instance, to start an edit session, we have to clone the entire working copy. This doesn't scale well. So instead of using Subversion's working copy (WC) layer, we should use the remote access (RA) layer, which allows us to fetch and edit just those files that are involved in an operation.

It's possible that the high-level Subversion bindings only supports WC operations, not RA operations. So additional C/Java bindings might have to be created.

R5 - caching for RA operations

Using the RA layer is scalable, but it's also slow. For instance, to view a page, we have to fetch it from the Subversion server every time (while before R4, we could just get it from the working copy). So RA fetches should be cached.

Of course, the cache should be properly invalidated on edit operations.

Additional complications

The storage layer is quite fundamental. The entire Wiki depends on it. The storage team should design a simple but sufficient interface to the storage layer that other teams can develop against. In particular the work of versioning UI team is closely related to that of the storage team. For instance, for merge support it is necessary that the storage layer offers to the upper layers notification that there is a merge conflict, a way to query what the conflicts are, and a way to clear the conflict situation. Close communication and frequent syncing with the versioning UI team is probably required. Big bang integration is not an option.

Maybe R4/R5 aren't such a good idea since replicating a lot of the functionality in the WC layer (such as support for moving files) is a lot of work. However something should be done about the scalability problem in R3. Alternatives might be to clone working copies using hard or symbolic links, lazily cloning the working copy, and so on.